ai-lead-ops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Lead Ops

AI运维负责人

When to Use

适用场景

Standing up AI platform operations and production service reliability
Defining SLAs/SLOs for LLM-powered features
Running AI incident reviews and post-mortems
Governing model, prompt, and index rollouts with tiered gates
Tracking AI unit economics (cost per session, tokens per feature)
Coordinating red-team and evaluation gates before releases
Building team rituals and cadence across engineering, research, risk, and product
Managing AI vendor relationships, contracts, and bake-offs

搭建AI平台运维体系与生产服务可靠性保障
为基于LLM的功能定义SLA/SLO
开展AI事件复盘与事后分析
通过分层准入机制管控模型、提示词和索引的发布
跟踪AI单位经济指标（每会话成本、每功能令牌消耗）
在发布前协调红队测试与评估准入
建立跨工程、研究、风险与产品团队的协作流程和节奏
管理AI供应商关系、合同及选型测试

When NOT to Use

不适用场景

Implementing memory stores or context packing code →
```
ai-memory-developer
```
/
```
ai-context-engineer
```
Building RAG pipelines or agent tools →
```
ai-engineer
```
Designing corporate AI policy or regulatory mapping →
```
ai-risk-governance
```
General network penetration testing or enterprise security programs →
```
cybersecurity
```
Structured token/cost improvement roadmaps with backlog →
```
ai-token-improvement-plan-engineer
```
Commercial/enterprise AI solution architecture →
```
applied-ai-architect-commercial-enterprise
```
Vertical AI product engineering managers and squad roadmaps →
```
engineering-manager-vertical-ai-products
```

实现内存存储或上下文打包代码 →
```
ai-memory-developer
```
/
```
ai-context-engineer
```
构建RAG管道或Agent工具 →
```
ai-engineer
```
设计企业AI政策或合规映射 →
```
ai-risk-governance
```
通用网络渗透测试或企业安全项目 →
```
cybersecurity
```
制定带待办事项的结构化令牌/成本优化路线图 →
```
ai-token-improvement-plan-engineer
```
商业/企业级AI解决方案架构设计 →
```
applied-ai-architect-commercial-enterprise
```
垂直AI产品工程经理及团队路线图规划 →
```
engineering-manager-vertical-ai-products
```

Related skills

Need	Skill
Build RAG, agents, eval harnesses	`ai-engineer`
Memory and context implementation	`ai-memory-developer` , `ai-context-engineer`
Risk tiering and policies	`ai-risk-governance`
Adversarial testing execution	`ai-redteam`
CI/CD and platform incidents	`devops`
Pipeline security	`devsecops`
Token optimization roadmap and initiative backlog	`ai-token-improvement-plan-engineer`
Commercial/enterprise AI architecture	`applied-ai-architect-commercial-enterprise`
Skills portfolio governance	`ai-skill-manager`
Safeguard inference platform	`ml-infrastructure-engineer-safeguards`
Safety classifier research	`ml-research-engineer-safeguards`

需求	技能
构建RAG、Agent、评估框架	`ai-engineer`
内存与上下文实现	`ai-memory-developer` , `ai-context-engineer`
风险分层与政策制定	`ai-risk-governance`
对抗性测试执行	`ai-redteam`
CI/CD与平台事件处理	`devops`
管道安全	`devsecops`
令牌优化路线图与待办事项管理	`ai-token-improvement-plan-engineer`
商业/企业级AI架构设计	`applied-ai-architect-commercial-enterprise`
技能组合治理	`ai-skill-manager`
推理平台安全保障	`ml-infrastructure-engineer-safeguards`
安全分类器研究	`ml-research-engineer-safeguards`

Core Workflows

核心工作流程

1. Operating model and cadence

1. 运营模式与协作节奏

Ritual	Frequency	Outcomes
AI ops standup	Daily	Blockers, incidents, deploys
Model/prompt change review	Per release	Approvers, eval delta
Cost review	Weekly	Spend vs budget, top features
Risk & safety sync	Bi-weekly	Incidents, policy gaps
Quarterly capacity	Quarterly	Model roadmap, vendor contracts

Define RACI: who owns model, prompt, index, eval suite, on-call.

See
references/operating_model.md
for roles and escalation.

流程	频率	成果
AI运维站会	每日	阻塞问题、事件、部署情况
模型/提示词变更评审	每次发布	审批人、评估差异
成本评审	每周	支出vs预算、高消耗功能
风险与安全同步会	每两周	事件、政策缺口
季度容量规划	每季度	模型路线图、供应商合同

定义RACI：明确模型、提示词、索引、评估套件、值班的负责人。

详见
references/operating_model.md
获取角色与升级流程。

2. Release governance

2. 发布治理

Production promotion checklist:

Eval regression passed on golden + safety set
Red-team sign-off for tier-2+ use cases
Model card / change log updated
Canary with error and cost monitors
Rollback procedure tested (previous prompt + model version pinned)
Comms plan for customer-visible behavior change

See
references/release_governance.md
for tiered gates and canary metrics.

生产环境推广检查清单：

通过黄金数据集+安全数据集的评估回归测试
二级及以上用例获得红队签字确认
更新模型卡片/变更日志
配置带错误与成本监控的金丝雀发布
测试回滚流程（固定使用之前的提示词+模型版本）
制定客户可见行为变更的沟通计划

详见
references/release_governance.md
获取分层准入机制与金丝雀发布指标。

3. SLOs, incidents, and observability

3. SLO、事件与可观测性

Example SLIs:

SLI	Notes
Availability	Successful completion / total requests
Latency	p95 end-to-end
Quality proxy	Thumbs-down rate, escalation rate
Safety	Policy violation rate post-deploy
Cost	USD per successful session

AI incident types: toxic output, PII leak in logs, retrieval cross-tenant leak, runaway agent loop, vendor outage.

See
references/incidents_slos.md
for severity matrix and post-incident template.

示例SLI：

SLI	说明
可用性	成功请求数 / 总请求数
延迟	p95端到端延迟
质量代理指标	差评率、升级率
安全性	部署后的政策违规率
成本	每成功会话的美元成本

AI事件类型： 有害输出、日志中泄露PII、跨租户检索泄露、Agent循环失控、供应商宕机。

详见
references/incidents_slos.md
获取严重程度矩阵与事后分析模板。

4. Cost and capacity

4. 成本与容量管理

Track tokens by model, feature, tenant
Set budgets and alerts at 80/100/110%
Optimize via routing, caching, context engineering (partner with
```
ai-context-engineer
```
)
Forecast from usage growth + model price changes

See
references/cost_capacity.md
for unit economics worksheet.

按模型、功能、租户跟踪令牌消耗
设置80%/100%/110%预算阈值及告警
通过路由、缓存、上下文工程优化成本（与
```
ai-context-engineer
```
协作）
根据使用量增长+模型价格变化进行预测

详见
references/cost_capacity.md
获取单位经济核算工作表。

5. Vendor and eval program

5. 供应商与评估项目

Maintain scorecard: quality, latency, safety, price, data terms
Run structured bake-offs before annual renewals
Own central eval harness ownership and dataset hygiene

See
references/vendor_eval_program.md
for RFP topics and eval program maturity.

维护评分卡：质量、延迟、安全性、价格、数据条款
在年度续约前开展结构化选型测试
负责集中式评估框架的维护与数据集卫生管理

详见
references/vendor_eval_program.md
获取RFP主题与评估项目成熟度标准。

When to load references

何时参考文档

Team cadence and RACI →
```
references/operating_model.md
```
Releases and canaries →
```
references/release_governance.md
```
SLOs and incidents →
```
references/incidents_slos.md
```
Cost and capacity →
```
references/cost_capacity.md
```
Vendors and eval ops →
```
references/vendor_eval_program.md
```

团队节奏与RACI →
```
references/operating_model.md
```
发布与金丝雀部署 →
```
references/release_governance.md
```
SLO与事件处理 →
```
references/incidents_slos.md
```
成本与容量管理 →
```
references/cost_capacity.md
```
供应商与评估运维 →
```
references/vendor_eval_program.md
```