enterprise-agent-ops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Enterprise Agent Ops

企业级Agent运维

Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
使用此技能管理云托管或持续运行的Agent系统,这类系统需要超越单一CLI会话的操作控制。

Operational Domains

运维领域

  1. runtime lifecycle (start, pause, stop, restart)
  2. observability (logs, metrics, traces)
  3. safety controls (scopes, permissions, kill switches)
  4. change management (rollout, rollback, audit)
  1. 运行时生命周期(启动、暂停、停止、重启)
  2. 可观测性(日志、指标、追踪)
  3. 安全控制(范围、权限、终止开关)
  4. 变更管理(发布、回滚、审计)

Baseline Controls

基准控制措施

  • immutable deployment artifacts
  • least-privilege credentials
  • environment-level secret injection
  • hard timeout and retry budgets
  • audit log for high-risk actions
  • 不可变部署制品
  • 最小权限凭据
  • 环境级密钥注入
  • 硬超时与重试预算
  • 高风险操作审计日志

Metrics to Track

需追踪的指标

  • success rate
  • mean retries per task
  • time to recovery
  • cost per successful task
  • failure class distribution
  • 成功率
  • 每个任务的平均重试次数
  • 恢复时间
  • 每个成功任务的成本
  • 故障类别分布

Incident Pattern

事件响应模式

When failure spikes:
  1. freeze new rollout
  2. capture representative traces
  3. isolate failing route
  4. patch with smallest safe change
  5. run regression + security checks
  6. resume gradually
当故障激增时:
  1. 冻结新的发布
  2. 捕获代表性追踪数据
  3. 隔离故障路由
  4. 用最小的安全变更进行补丁修复
  5. 运行回归与安全检查
  6. 逐步恢复服务

Deployment Integrations

部署集成

This skill pairs with:
  • PM2 workflows
  • systemd services
  • container orchestrators
  • CI/CD gates
此技能可与以下工具集成:
  • PM2 工作流
  • systemd 服务
  • 容器编排器
  • CI/CD 网关