ai-lead-ops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Lead Ops
AI运维负责人
When to Use
适用场景
- Standing up AI platform operations and production service reliability
- Defining SLAs/SLOs for LLM-powered features
- Running AI incident reviews and post-mortems
- Governing model, prompt, and index rollouts with tiered gates
- Tracking AI unit economics (cost per session, tokens per feature)
- Coordinating red-team and evaluation gates before releases
- Building team rituals and cadence across engineering, research, risk, and product
- Managing AI vendor relationships, contracts, and bake-offs
- 搭建AI平台运维体系与生产服务可靠性保障
- 为基于LLM的功能定义SLA/SLO
- 开展AI事件复盘与事后分析
- 通过分层准入机制管控模型、提示词和索引的发布
- 跟踪AI单位经济指标(每会话成本、每功能令牌消耗)
- 在发布前协调红队测试与评估准入
- 建立跨工程、研究、风险与产品团队的协作流程和节奏
- 管理AI供应商关系、合同及选型测试
When NOT to Use
不适用场景
- Implementing memory stores or context packing code → /
ai-memory-developerai-context-engineer - Building RAG pipelines or agent tools →
ai-engineer - Designing corporate AI policy or regulatory mapping →
ai-risk-governance - General network penetration testing or enterprise security programs →
cybersecurity - Structured token/cost improvement roadmaps with backlog →
ai-token-improvement-plan-engineer - Commercial/enterprise AI solution architecture →
applied-ai-architect-commercial-enterprise - Vertical AI product engineering managers and squad roadmaps →
engineering-manager-vertical-ai-products
- 实现内存存储或上下文打包代码 → /
ai-memory-developerai-context-engineer - 构建RAG管道或Agent工具 →
ai-engineer - 设计企业AI政策或合规映射 →
ai-risk-governance - 通用网络渗透测试或企业安全项目 →
cybersecurity - 制定带待办事项的结构化令牌/成本优化路线图 →
ai-token-improvement-plan-engineer - 商业/企业级AI解决方案架构设计 →
applied-ai-architect-commercial-enterprise - 垂直AI产品工程经理及团队路线图规划 →
engineering-manager-vertical-ai-products
Related skills
相关技能
| Need | Skill |
|---|---|
| Build RAG, agents, eval harnesses | |
| Memory and context implementation | |
| Risk tiering and policies | |
| Adversarial testing execution | |
| CI/CD and platform incidents | |
| Pipeline security | |
| Token optimization roadmap and initiative backlog | |
| Commercial/enterprise AI architecture | |
| Skills portfolio governance | |
| Safeguard inference platform | |
| Safety classifier research | |
| 需求 | 技能 |
|---|---|
| 构建RAG、Agent、评估框架 | |
| 内存与上下文实现 | |
| 风险分层与政策制定 | |
| 对抗性测试执行 | |
| CI/CD与平台事件处理 | |
| 管道安全 | |
| 令牌优化路线图与待办事项管理 | |
| 商业/企业级AI架构设计 | |
| 技能组合治理 | |
| 推理平台安全保障 | |
| 安全分类器研究 | |
Core Workflows
核心工作流程
1. Operating model and cadence
1. 运营模式与协作节奏
| Ritual | Frequency | Outcomes |
|---|---|---|
| AI ops standup | Daily | Blockers, incidents, deploys |
| Model/prompt change review | Per release | Approvers, eval delta |
| Cost review | Weekly | Spend vs budget, top features |
| Risk & safety sync | Bi-weekly | Incidents, policy gaps |
| Quarterly capacity | Quarterly | Model roadmap, vendor contracts |
Define RACI: who owns model, prompt, index, eval suite, on-call.
See for roles and escalation.
references/operating_model.md| 流程 | 频率 | 成果 |
|---|---|---|
| AI运维站会 | 每日 | 阻塞问题、事件、部署情况 |
| 模型/提示词变更评审 | 每次发布 | 审批人、评估差异 |
| 成本评审 | 每周 | 支出vs预算、高消耗功能 |
| 风险与安全同步会 | 每两周 | 事件、政策缺口 |
| 季度容量规划 | 每季度 | 模型路线图、供应商合同 |
定义RACI:明确模型、提示词、索引、评估套件、值班的负责人。
详见获取角色与升级流程。
references/operating_model.md2. Release governance
2. 发布治理
Production promotion checklist:
- Eval regression passed on golden + safety set
- Red-team sign-off for tier-2+ use cases
- Model card / change log updated
- Canary with error and cost monitors
- Rollback procedure tested (previous prompt + model version pinned)
- Comms plan for customer-visible behavior change
See for tiered gates and canary metrics.
references/release_governance.md生产环境推广检查清单:
- 通过黄金数据集+安全数据集的评估回归测试
- 二级及以上用例获得红队签字确认
- 更新模型卡片/变更日志
- 配置带错误与成本监控的金丝雀发布
- 测试回滚流程(固定使用之前的提示词+模型版本)
- 制定客户可见行为变更的沟通计划
详见获取分层准入机制与金丝雀发布指标。
references/release_governance.md3. SLOs, incidents, and observability
3. SLO、事件与可观测性
Example SLIs:
| SLI | Notes |
|---|---|
| Availability | Successful completion / total requests |
| Latency | p95 end-to-end |
| Quality proxy | Thumbs-down rate, escalation rate |
| Safety | Policy violation rate post-deploy |
| Cost | USD per successful session |
AI incident types: toxic output, PII leak in logs, retrieval cross-tenant leak, runaway agent loop, vendor outage.
See for severity matrix and post-incident template.
references/incidents_slos.md示例SLI:
| SLI | 说明 |
|---|---|
| 可用性 | 成功请求数 / 总请求数 |
| 延迟 | p95端到端延迟 |
| 质量代理指标 | 差评率、升级率 |
| 安全性 | 部署后的政策违规率 |
| 成本 | 每成功会话的美元成本 |
AI事件类型: 有害输出、日志中泄露PII、跨租户检索泄露、Agent循环失控、供应商宕机。
详见获取严重程度矩阵与事后分析模板。
references/incidents_slos.md4. Cost and capacity
4. 成本与容量管理
- Track tokens by model, feature, tenant
- Set budgets and alerts at 80/100/110%
- Optimize via routing, caching, context engineering (partner with )
ai-context-engineer - Forecast from usage growth + model price changes
See for unit economics worksheet.
references/cost_capacity.md- 按模型、功能、租户跟踪令牌消耗
- 设置80%/100%/110%预算阈值及告警
- 通过路由、缓存、上下文工程优化成本(与协作)
ai-context-engineer - 根据使用量增长+模型价格变化进行预测
详见获取单位经济核算工作表。
references/cost_capacity.md5. Vendor and eval program
5. 供应商与评估项目
- Maintain scorecard: quality, latency, safety, price, data terms
- Run structured bake-offs before annual renewals
- Own central eval harness ownership and dataset hygiene
See for RFP topics and eval program maturity.
references/vendor_eval_program.md- 维护评分卡:质量、延迟、安全性、价格、数据条款
- 在年度续约前开展结构化选型测试
- 负责集中式评估框架的维护与数据集卫生管理
详见获取RFP主题与评估项目成熟度标准。
references/vendor_eval_program.mdWhen to load references
何时参考文档
- Team cadence and RACI →
references/operating_model.md - Releases and canaries →
references/release_governance.md - SLOs and incidents →
references/incidents_slos.md - Cost and capacity →
references/cost_capacity.md - Vendors and eval ops →
references/vendor_eval_program.md
- 团队节奏与RACI →
references/operating_model.md - 发布与金丝雀部署 →
references/release_governance.md - SLO与事件处理 →
references/incidents_slos.md - 成本与容量管理 →
references/cost_capacity.md - 供应商与评估运维 →
references/vendor_eval_program.md