operating-production-services

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Operating Production Services

生产服务运维

Production reliability patterns: measure what matters, learn from failures, improve systematically.
生产可靠性模式:衡量关键指标、从故障中学习、系统性优化。

Quick Reference

速查指南

NeedGo To
Define reliability targetsSLOs & Error Budgets
Write incident reportPostmortem Templates
Set up SLO alertingreferences/slo-alerting.md

需求参考内容
定义可靠性目标SLO与错误预算
撰写事件报告事后复盘模板
配置SLO告警references/slo-alerting.md

SLOs & Error Budgets

SLO与错误预算

The Hierarchy

层级关系

SLA (Contract) → SLO (Target) → SLI (Measurement)
SLA(服务级别协议) → SLO(服务级别目标) → SLI(服务级别指标)

Common SLIs

常见SLI

promql
undefined
promql
undefined

Availability: successful requests / total requests

可用性:成功请求数 / 总请求数

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

Latency: requests below threshold / total requests

延迟:阈值内请求数 / 总请求数

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
undefined
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
undefined

SLO Targets Reality Check

SLO目标实际参考

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43 minutes8.76 hours
99.95%22 minutes4.38 hours
99.99%4.3 minutes52 minutes
Don't aim for 100%. Each nine costs exponentially more.
SLO百分比每月停机时长每年停机时长
99%7.2小时3.65天
99.9%43分钟8.76小时
99.95%22分钟4.38小时
99.99%4.3分钟52分钟
**不要追求100%的SLO。**每多一个9,成本都会呈指数级增长。

Error Budget

错误预算

Error Budget = 1 - SLO Target
Example: 99.9% SLO = 0.1% error budget = 43 minutes/month
Policy:
Budget RemainingAction
> 50%Normal velocity
10-50%Postpone risky changes
< 10%Freeze non-critical changes
0%Feature freeze, fix reliability
See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

错误预算 = 1 - SLO目标值
**示例:**99.9%的SLO = 0.1%的错误预算 = 每月43分钟
策略:
剩余预算占比应对措施
>50%正常迭代速度
10-50%推迟高风险变更
<10%冻结非关键变更
0%功能冻结,优先修复可靠性问题
查看references/slo-alerting.md获取Prometheus记录规则和多窗口燃烧速率告警配置。

Postmortem Templates

事后复盘模板

The Blameless Principle

无责原则

Blame-FocusedBlameless
"Who caused this?""What conditions allowed this?"
Punish individualsImprove systems
Hide informationShare learnings
追责导向无责导向
“是谁导致了这个问题?”“是什么条件导致了这个问题?”
惩罚个人优化系统
隐瞒信息分享经验

When to Write Postmortems

何时撰写事后复盘报告

  • SEV1/SEV2 incidents
  • Customer-facing outages > 15 minutes
  • Data loss or security incidents
  • Near-misses that could have been severe
  • Novel failure modes
  • SEV1/SEV2级别的事件
  • 影响客户的停机时长超过15分钟
  • 数据丢失或安全事件
  • 可能引发严重后果的未遂事件
  • 新型故障模式

Standard Template

标准模板

markdown
undefined
markdown
undefined

Postmortem: [Incident Title]

事后复盘:[事件标题]

Date: YYYY-MM-DD | Duration: X min | Severity: SEVX
日期:YYYY-MM-DD | 时长:X分钟 | 级别:SEVX

Executive Summary

执行摘要

One paragraph: what happened, impact, root cause, resolution.
一段文字:事件概况、影响、根本原因、解决方案。

Timeline (UTC)

时间线(UTC时区)

TimeEvent
HH:MMFirst alert fired
HH:MMOn-call acknowledged
HH:MMRoot cause identified
HH:MMFix deployed
HH:MMService recovered
时间事件
HH:MM首次告警触发
HH:MM值班人员确认
HH:MM定位根本原因
HH:MM修复方案部署
HH:MM服务恢复正常

Root Cause Analysis

根本原因分析

5 Whys

5Why分析法

  1. Why did service fail? → [Answer]
  2. Why did [1] happen? → [Answer]
  3. Why did [2] happen? → [Answer]
  4. Why did [3] happen? → [Answer]
  5. Why did [4] happen? → [Root cause]
  1. 服务为何故障? → [答案]
  2. 问题1的原因是什么? → [答案]
  3. 问题2的原因是什么? → [答案]
  4. 问题3的原因是什么? → [答案]
  5. 问题4的原因是什么? → [根本原因]

Impact

影响范围

  • Customers affected: X
  • Duration: X minutes
  • Revenue impact: $X
  • Support tickets: X
  • 受影响客户数:X
  • 持续时长:X分钟
  • 营收影响:$X
  • 支持工单数量:X

Action Items

行动项

PriorityActionOwnerDueTicket
P0[Immediate fix]@nameDateXXX-123
P1[Prevent recurrence]@nameDateXXX-124
P2[Improve detection]@nameDateXXX-125
undefined
优先级行动内容负责人截止日期工单编号
P0[紧急修复措施]@name日期XXX-123
P1[预防复发措施]@name日期XXX-124
P2[优化检测机制]@name日期XXX-125
undefined

Quick Template (Minor Incidents)

简易模板(小型事件)

markdown
undefined
markdown
undefined

Quick Postmortem: [Title]

简易事后复盘:[标题]

Date: YYYY-MM-DD | Duration: X min | Severity: SEV3
日期:YYYY-MM-DD | 时长:X分钟 | 级别:SEV3

What Happened

事件概况

One sentence description.
一句话描述事件。

Timeline

时间线

  • HH:MM - Trigger
  • HH:MM - Detection
  • HH:MM - Resolution
  • HH:MM - 事件触发
  • HH:MM - 事件检测
  • HH:MM - 事件解决

Root Cause

根本原因

One sentence.
一句话描述。

Fix

修复措施

  • Immediate: [What was done]
  • Long-term: [Ticket XXX-123]

---
  • 紧急措施:[已执行操作]
  • 长期措施:[工单XXX-123]

---

Postmortem Meeting Guide

事后复盘会议指南

Structure (60 min)

会议结构(60分钟)

  1. Opening (5 min) - Remind: "We're here to learn, not blame"
  2. Timeline (15 min) - Walk through events chronologically
  3. Analysis (20 min) - What failed? Why? What allowed it?
  4. Action Items (15 min) - Prioritize, assign owners, set dates
  5. Closing (5 min) - Summarize learnings, confirm owners
  1. 开场(5分钟) - 强调:“我们是为了学习,而非追责”
  2. 时间线回顾(15分钟) - 按时间顺序梳理事件
  3. 问题分析(20分钟) - 哪里出了问题?为什么?是什么条件导致的?
  4. 行动项确定(15分钟) - 优先级排序、分配负责人、设定截止日期
  5. 收尾(5分钟) - 总结经验,确认负责人

Facilitation Tips

主持技巧

  • Redirect blame to systems: "What made this mistake possible?"
  • Time-box tangents
  • Document dissenting views
  • Encourage quiet participants

  • 将追责引导至系统层面:“是什么条件导致了这个失误?”
  • 控制无关讨论的时间
  • 记录不同意见
  • 鼓励沉默的参与者发言

Anti-Patterns

反模式

Don'tDo Instead
Aim for 100% SLOAccept error budget exists
Skip small incidentsSmall incidents reveal patterns
Orphan action itemsEvery item needs owner + date + ticket
Blame individualsAsk "what conditions allowed this?"
Create busywork actionsActions should prevent recurrence

不要做正确做法
追求100%的SLO接受错误预算的存在
忽略小型事件小型事件能暴露潜在模式
无主行动项每个行动项都需要负责人+截止日期+工单
指责个人询问“是什么条件导致了这个问题?”
制定无意义的行动项行动项应能预防事件复发

Verification

验证

Run:
python scripts/verify.py
运行:
python scripts/verify.py

References

参考资料

  • references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards
  • references/slo-alerting.md - Prometheus规则、燃烧速率告警、Grafana仪表盘