operating-production-services
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOperating Production Services
生产服务运维
Production reliability patterns: measure what matters, learn from failures, improve systematically.
生产可靠性模式:衡量关键指标、从故障中学习、系统性优化。
Quick Reference
速查指南
| Need | Go To |
|---|---|
| Define reliability targets | SLOs & Error Budgets |
| Write incident report | Postmortem Templates |
| Set up SLO alerting | references/slo-alerting.md |
SLOs & Error Budgets
SLO与错误预算
The Hierarchy
层级关系
SLA (Contract) → SLO (Target) → SLI (Measurement)SLA(服务级别协议) → SLO(服务级别目标) → SLI(服务级别指标)Common SLIs
常见SLI
promql
undefinedpromql
undefinedAvailability: successful requests / total requests
可用性:成功请求数 / 总请求数
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
Latency: requests below threshold / total requests
延迟:阈值内请求数 / 总请求数
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
undefinedsum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
undefinedSLO Targets Reality Check
SLO目标实际参考
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
Don't aim for 100%. Each nine costs exponentially more.
| SLO百分比 | 每月停机时长 | 每年停机时长 |
|---|---|---|
| 99% | 7.2小时 | 3.65天 |
| 99.9% | 43分钟 | 8.76小时 |
| 99.95% | 22分钟 | 4.38小时 |
| 99.99% | 4.3分钟 | 52分钟 |
**不要追求100%的SLO。**每多一个9,成本都会呈指数级增长。
Error Budget
错误预算
Error Budget = 1 - SLO TargetExample: 99.9% SLO = 0.1% error budget = 43 minutes/month
Policy:
| Budget Remaining | Action |
|---|---|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.
错误预算 = 1 - SLO目标值**示例:**99.9%的SLO = 0.1%的错误预算 = 每月43分钟
策略:
| 剩余预算占比 | 应对措施 |
|---|---|
| >50% | 正常迭代速度 |
| 10-50% | 推迟高风险变更 |
| <10% | 冻结非关键变更 |
| 0% | 功能冻结,优先修复可靠性问题 |
查看references/slo-alerting.md获取Prometheus记录规则和多窗口燃烧速率告警配置。
Postmortem Templates
事后复盘模板
The Blameless Principle
无责原则
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| 追责导向 | 无责导向 |
|---|---|
| “是谁导致了这个问题?” | “是什么条件导致了这个问题?” |
| 惩罚个人 | 优化系统 |
| 隐瞒信息 | 分享经验 |
When to Write Postmortems
何时撰写事后复盘报告
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
- SEV1/SEV2级别的事件
- 影响客户的停机时长超过15分钟
- 数据丢失或安全事件
- 可能引发严重后果的未遂事件
- 新型故障模式
Standard Template
标准模板
markdown
undefinedmarkdown
undefinedPostmortem: [Incident Title]
事后复盘:[事件标题]
Date: YYYY-MM-DD | Duration: X min | Severity: SEVX
日期:YYYY-MM-DD | 时长:X分钟 | 级别:SEVX
Executive Summary
执行摘要
One paragraph: what happened, impact, root cause, resolution.
一段文字:事件概况、影响、根本原因、解决方案。
Timeline (UTC)
时间线(UTC时区)
| Time | Event |
|---|---|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
| 时间 | 事件 |
|---|---|
| HH:MM | 首次告警触发 |
| HH:MM | 值班人员确认 |
| HH:MM | 定位根本原因 |
| HH:MM | 修复方案部署 |
| HH:MM | 服务恢复正常 |
Root Cause Analysis
根本原因分析
5 Whys
5Why分析法
- Why did service fail? → [Answer]
- Why did [1] happen? → [Answer]
- Why did [2] happen? → [Answer]
- Why did [3] happen? → [Answer]
- Why did [4] happen? → [Root cause]
- 服务为何故障? → [答案]
- 问题1的原因是什么? → [答案]
- 问题2的原因是什么? → [答案]
- 问题3的原因是什么? → [答案]
- 问题4的原因是什么? → [根本原因]
Impact
影响范围
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
- 受影响客户数:X
- 持续时长:X分钟
- 营收影响:$X
- 支持工单数量:X
Action Items
行动项
| Priority | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
undefined| 优先级 | 行动内容 | 负责人 | 截止日期 | 工单编号 |
|---|---|---|---|---|
| P0 | [紧急修复措施] | @name | 日期 | XXX-123 |
| P1 | [预防复发措施] | @name | 日期 | XXX-124 |
| P2 | [优化检测机制] | @name | 日期 | XXX-125 |
undefinedQuick Template (Minor Incidents)
简易模板(小型事件)
markdown
undefinedmarkdown
undefinedQuick Postmortem: [Title]
简易事后复盘:[标题]
Date: YYYY-MM-DD | Duration: X min | Severity: SEV3
日期:YYYY-MM-DD | 时长:X分钟 | 级别:SEV3
What Happened
事件概况
One sentence description.
一句话描述事件。
Timeline
时间线
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
- HH:MM - 事件触发
- HH:MM - 事件检测
- HH:MM - 事件解决
Root Cause
根本原因
One sentence.
一句话描述。
Fix
修复措施
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
---- 紧急措施:[已执行操作]
- 长期措施:[工单XXX-123]
---Postmortem Meeting Guide
事后复盘会议指南
Structure (60 min)
会议结构(60分钟)
- Opening (5 min) - Remind: "We're here to learn, not blame"
- Timeline (15 min) - Walk through events chronologically
- Analysis (20 min) - What failed? Why? What allowed it?
- Action Items (15 min) - Prioritize, assign owners, set dates
- Closing (5 min) - Summarize learnings, confirm owners
- 开场(5分钟) - 强调:“我们是为了学习,而非追责”
- 时间线回顾(15分钟) - 按时间顺序梳理事件
- 问题分析(20分钟) - 哪里出了问题?为什么?是什么条件导致的?
- 行动项确定(15分钟) - 优先级排序、分配负责人、设定截止日期
- 收尾(5分钟) - 总结经验,确认负责人
Facilitation Tips
主持技巧
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
- 将追责引导至系统层面:“是什么条件导致了这个失误?”
- 控制无关讨论的时间
- 记录不同意见
- 鼓励沉默的参与者发言
Anti-Patterns
反模式
| Don't | Do Instead |
|---|---|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
| 不要做 | 正确做法 |
|---|---|
| 追求100%的SLO | 接受错误预算的存在 |
| 忽略小型事件 | 小型事件能暴露潜在模式 |
| 无主行动项 | 每个行动项都需要负责人+截止日期+工单 |
| 指责个人 | 询问“是什么条件导致了这个问题?” |
| 制定无意义的行动项 | 行动项应能预防事件复发 |
Verification
验证
Run:
python scripts/verify.py运行:
python scripts/verify.pyReferences
参考资料
- references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards
- references/slo-alerting.md - Prometheus规则、燃烧速率告警、Grafana仪表盘