enterprise-agent-ops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEnterprise Agent Ops
企业级Agent运维
Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
使用此技能管理云托管或持续运行的Agent系统,这类系统需要超越单一CLI会话的操作控制。
Operational Domains
运维领域
- runtime lifecycle (start, pause, stop, restart)
- observability (logs, metrics, traces)
- safety controls (scopes, permissions, kill switches)
- change management (rollout, rollback, audit)
- 运行时生命周期(启动、暂停、停止、重启)
- 可观测性(日志、指标、追踪)
- 安全控制(范围、权限、终止开关)
- 变更管理(发布、回滚、审计)
Baseline Controls
基准控制措施
- immutable deployment artifacts
- least-privilege credentials
- environment-level secret injection
- hard timeout and retry budgets
- audit log for high-risk actions
- 不可变部署制品
- 最小权限凭据
- 环境级密钥注入
- 硬超时与重试预算
- 高风险操作审计日志
Metrics to Track
需追踪的指标
- success rate
- mean retries per task
- time to recovery
- cost per successful task
- failure class distribution
- 成功率
- 每个任务的平均重试次数
- 恢复时间
- 每个成功任务的成本
- 故障类别分布
Incident Pattern
事件响应模式
When failure spikes:
- freeze new rollout
- capture representative traces
- isolate failing route
- patch with smallest safe change
- run regression + security checks
- resume gradually
当故障激增时:
- 冻结新的发布
- 捕获代表性追踪数据
- 隔离故障路由
- 用最小的安全变更进行补丁修复
- 运行回归与安全检查
- 逐步恢复服务
Deployment Integrations
部署集成
This skill pairs with:
- PM2 workflows
- systemd services
- container orchestrators
- CI/CD gates
此技能可与以下工具集成:
- PM2 工作流
- systemd 服务
- 容器编排器
- CI/CD 网关