enterprise-agent-ops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Enterprise Agent Ops

企业级Agent运维

Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
该技能适用于云托管或持续运行的Agent系统,这类系统需要超出单CLI会话范围的运营控制。

Operational Domains

运营域

  1. runtime lifecycle (start, pause, stop, restart)
  2. observability (logs, metrics, traces)
  3. safety controls (scopes, permissions, kill switches)
  4. change management (rollout, rollback, audit)
  1. 运行时生命周期(启动、暂停、停止、重启)
  2. 可观测性(日志、指标、链路追踪)
  3. 安全控制(作用域、权限、紧急停止开关)
  4. 变更管理(上线、回滚、审计)

Baseline Controls

基线控制

  • immutable deployment artifacts
  • least-privilege credentials
  • environment-level secret injection
  • hard timeout and retry budgets
  • audit log for high-risk actions
  • 不可变部署制品
  • 最小权限凭证
  • 环境级密钥注入
  • 硬性超时和重试预算
  • 高风险操作审计日志

Metrics to Track

需跟踪的指标

  • success rate
  • mean retries per task
  • time to recovery
  • cost per successful task
  • failure class distribution
  • 成功率
  • 单任务平均重试次数
  • 恢复时间
  • 单成功任务成本
  • 故障类型分布

Incident Pattern

事件处理模式

When failure spikes:
  1. freeze new rollout
  2. capture representative traces
  3. isolate failing route
  4. patch with smallest safe change
  5. run regression + security checks
  6. resume gradually
当故障激增时:
  1. 冻结新上线
  2. 采集代表性链路追踪数据
  3. 隔离故障路由
  4. 采用最小安全变更进行补丁修复
  5. 运行回归测试+安全检查
  6. 逐步恢复服务

Deployment Integrations

部署集成

This skill pairs with:
  • PM2 workflows
  • systemd services
  • container orchestrators
  • CI/CD gates
该技能可与以下工具搭配使用:
  • PM2工作流
  • systemd服务
  • 容器编排器
  • CI/CD门禁