ai-lead-ops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Lead Ops

AI运维负责人

When to Use

适用场景

  • Standing up AI platform operations and production service reliability
  • Defining SLAs/SLOs for LLM-powered features
  • Running AI incident reviews and post-mortems
  • Governing model, prompt, and index rollouts with tiered gates
  • Tracking AI unit economics (cost per session, tokens per feature)
  • Coordinating red-team and evaluation gates before releases
  • Building team rituals and cadence across engineering, research, risk, and product
  • Managing AI vendor relationships, contracts, and bake-offs
  • 搭建AI平台运维体系与生产服务可靠性保障
  • 为基于LLM的功能定义SLA/SLO
  • 开展AI事件复盘与事后分析
  • 通过分层准入机制管控模型、提示词和索引的发布
  • 跟踪AI单位经济指标(每会话成本、每功能令牌消耗)
  • 在发布前协调红队测试与评估准入
  • 建立跨工程、研究、风险与产品团队的协作流程和节奏
  • 管理AI供应商关系、合同及选型测试

When NOT to Use

不适用场景

  • Implementing memory stores or context packing code →
    ai-memory-developer
    /
    ai-context-engineer
  • Building RAG pipelines or agent tools →
    ai-engineer
  • Designing corporate AI policy or regulatory mapping →
    ai-risk-governance
  • General network penetration testing or enterprise security programs →
    cybersecurity
  • Structured token/cost improvement roadmaps with backlog →
    ai-token-improvement-plan-engineer
  • Commercial/enterprise AI solution architecture →
    applied-ai-architect-commercial-enterprise
  • Vertical AI product engineering managers and squad roadmaps →
    engineering-manager-vertical-ai-products
  • 实现内存存储或上下文打包代码 →
    ai-memory-developer
    /
    ai-context-engineer
  • 构建RAG管道或Agent工具 →
    ai-engineer
  • 设计企业AI政策或合规映射 →
    ai-risk-governance
  • 通用网络渗透测试或企业安全项目 →
    cybersecurity
  • 制定带待办事项的结构化令牌/成本优化路线图 →
    ai-token-improvement-plan-engineer
  • 商业/企业级AI解决方案架构设计 →
    applied-ai-architect-commercial-enterprise
  • 垂直AI产品工程经理及团队路线图规划 →
    engineering-manager-vertical-ai-products

Related skills

相关技能

NeedSkill
Build RAG, agents, eval harnesses
ai-engineer
Memory and context implementation
ai-memory-developer
,
ai-context-engineer
Risk tiering and policies
ai-risk-governance
Adversarial testing execution
ai-redteam
CI/CD and platform incidents
devops
Pipeline security
devsecops
Token optimization roadmap and initiative backlog
ai-token-improvement-plan-engineer
Commercial/enterprise AI architecture
applied-ai-architect-commercial-enterprise
Skills portfolio governance
ai-skill-manager
Safeguard inference platform
ml-infrastructure-engineer-safeguards
Safety classifier research
ml-research-engineer-safeguards
需求技能
构建RAG、Agent、评估框架
ai-engineer
内存与上下文实现
ai-memory-developer
,
ai-context-engineer
风险分层与政策制定
ai-risk-governance
对抗性测试执行
ai-redteam
CI/CD与平台事件处理
devops
管道安全
devsecops
令牌优化路线图与待办事项管理
ai-token-improvement-plan-engineer
商业/企业级AI架构设计
applied-ai-architect-commercial-enterprise
技能组合治理
ai-skill-manager
推理平台安全保障
ml-infrastructure-engineer-safeguards
安全分类器研究
ml-research-engineer-safeguards

Core Workflows

核心工作流程

1. Operating model and cadence

1. 运营模式与协作节奏

RitualFrequencyOutcomes
AI ops standupDailyBlockers, incidents, deploys
Model/prompt change reviewPer releaseApprovers, eval delta
Cost reviewWeeklySpend vs budget, top features
Risk & safety syncBi-weeklyIncidents, policy gaps
Quarterly capacityQuarterlyModel roadmap, vendor contracts
Define RACI: who owns model, prompt, index, eval suite, on-call.
See
references/operating_model.md
for roles and escalation.
流程频率成果
AI运维站会每日阻塞问题、事件、部署情况
模型/提示词变更评审每次发布审批人、评估差异
成本评审每周支出vs预算、高消耗功能
风险与安全同步会每两周事件、政策缺口
季度容量规划每季度模型路线图、供应商合同
定义RACI:明确模型、提示词、索引、评估套件、值班的负责人。
详见
references/operating_model.md
获取角色与升级流程。

2. Release governance

2. 发布治理

Production promotion checklist:
  • Eval regression passed on golden + safety set
  • Red-team sign-off for tier-2+ use cases
  • Model card / change log updated
  • Canary with error and cost monitors
  • Rollback procedure tested (previous prompt + model version pinned)
  • Comms plan for customer-visible behavior change
See
references/release_governance.md
for tiered gates and canary metrics.
生产环境推广检查清单:
  • 通过黄金数据集+安全数据集的评估回归测试
  • 二级及以上用例获得红队签字确认
  • 更新模型卡片/变更日志
  • 配置带错误与成本监控的金丝雀发布
  • 测试回滚流程(固定使用之前的提示词+模型版本)
  • 制定客户可见行为变更的沟通计划
详见
references/release_governance.md
获取分层准入机制与金丝雀发布指标。

3. SLOs, incidents, and observability

3. SLO、事件与可观测性

Example SLIs:
SLINotes
AvailabilitySuccessful completion / total requests
Latencyp95 end-to-end
Quality proxyThumbs-down rate, escalation rate
SafetyPolicy violation rate post-deploy
CostUSD per successful session
AI incident types: toxic output, PII leak in logs, retrieval cross-tenant leak, runaway agent loop, vendor outage.
See
references/incidents_slos.md
for severity matrix and post-incident template.
示例SLI:
SLI说明
可用性成功请求数 / 总请求数
延迟p95端到端延迟
质量代理指标差评率、升级率
安全性部署后的政策违规率
成本每成功会话的美元成本
AI事件类型: 有害输出、日志中泄露PII、跨租户检索泄露、Agent循环失控、供应商宕机。
详见
references/incidents_slos.md
获取严重程度矩阵与事后分析模板。

4. Cost and capacity

4. 成本与容量管理

  • Track tokens by model, feature, tenant
  • Set budgets and alerts at 80/100/110%
  • Optimize via routing, caching, context engineering (partner with
    ai-context-engineer
    )
  • Forecast from usage growth + model price changes
See
references/cost_capacity.md
for unit economics worksheet.
  • 按模型、功能、租户跟踪令牌消耗
  • 设置80%/100%/110%预算阈值及告警
  • 通过路由、缓存、上下文工程优化成本(与
    ai-context-engineer
    协作)
  • 根据使用量增长+模型价格变化进行预测
详见
references/cost_capacity.md
获取单位经济核算工作表。

5. Vendor and eval program

5. 供应商与评估项目

  • Maintain scorecard: quality, latency, safety, price, data terms
  • Run structured bake-offs before annual renewals
  • Own central eval harness ownership and dataset hygiene
See
references/vendor_eval_program.md
for RFP topics and eval program maturity.
  • 维护评分卡:质量、延迟、安全性、价格、数据条款
  • 在年度续约前开展结构化选型测试
  • 负责集中式评估框架的维护与数据集卫生管理
详见
references/vendor_eval_program.md
获取RFP主题与评估项目成熟度标准。

When to load references

何时参考文档

  • Team cadence and RACI
    references/operating_model.md
  • Releases and canaries
    references/release_governance.md
  • SLOs and incidents
    references/incidents_slos.md
  • Cost and capacity
    references/cost_capacity.md
  • Vendors and eval ops
    references/vendor_eval_program.md
  • 团队节奏与RACI
    references/operating_model.md
  • 发布与金丝雀部署
    references/release_governance.md
  • SLO与事件处理
    references/incidents_slos.md
  • 成本与容量管理
    references/cost_capacity.md
  • 供应商与评估运维
    references/vendor_eval_program.md