operating-production-services

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Operating Production Services

生产服务运维

Production reliability patterns: measure what matters, learn from failures, improve systematically.

生产可靠性模式：衡量关键指标、从故障中学习、系统性优化。

Quick Reference

速查指南

Need	Go To
Define reliability targets	SLOs & Error Budgets
Write incident report	Postmortem Templates
Set up SLO alerting	references/slo-alerting.md

需求	参考内容
定义可靠性目标	SLO与错误预算
撰写事件报告	事后复盘模板
配置SLO告警	references/slo-alerting.md

SLOs & Error Budgets

SLO与错误预算

The Hierarchy

层级关系

SLA (Contract) → SLO (Target) → SLI (Measurement)

SLA（服务级别协议） → SLO（服务级别目标） → SLI（服务级别指标）

Common SLIs

常见SLI

promql

undefined

promql

undefined

Availability: successful requests / total requests

可用性：成功请求数 / 总请求数

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

Latency: requests below threshold / total requests

延迟：阈值内请求数 / 总请求数

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

undefined

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

undefined

SLO Targets Reality Check

SLO目标实际参考

SLO %	Downtime/Month	Downtime/Year
99%	7.2 hours	3.65 days
99.9%	43 minutes	8.76 hours
99.95%	22 minutes	4.38 hours
99.99%	4.3 minutes	52 minutes

Don't aim for 100%. Each nine costs exponentially more.

SLO百分比	每月停机时长	每年停机时长
99%	7.2小时	3.65天
99.9%	43分钟	8.76小时
99.95%	22分钟	4.38小时
99.99%	4.3分钟	52分钟

**不要追求100%的SLO。**每多一个9，成本都会呈指数级增长。

Error Budget

错误预算

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy:

Budget Remaining	Action
> 50%	Normal velocity
10-50%	Postpone risky changes
< 10%	Freeze non-critical changes
0%	Feature freeze, fix reliability

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

错误预算 = 1 - SLO目标值

**示例：**99.9%的SLO = 0.1%的错误预算 = 每月43分钟

策略：

剩余预算占比	应对措施
>50%	正常迭代速度
10-50%	推迟高风险变更
<10%	冻结非关键变更
0%	功能冻结，优先修复可靠性问题

查看references/slo-alerting.md获取Prometheus记录规则和多窗口燃烧速率告警配置。

Postmortem Templates

事后复盘模板

The Blameless Principle

无责原则

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
Punish individuals	Improve systems
Hide information	Share learnings

追责导向	无责导向
“是谁导致了这个问题？”	“是什么条件导致了这个问题？”
惩罚个人	优化系统
隐瞒信息	分享经验

When to Write Postmortems

何时撰写事后复盘报告

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

SEV1/SEV2级别的事件
影响客户的停机时长超过15分钟
数据丢失或安全事件
可能引发严重后果的未遂事件
新型故障模式

Standard Template

标准模板

markdown

undefined

markdown

undefined

Postmortem: [Incident Title]

事后复盘：[事件标题]

Date: YYYY-MM-DD | Duration: X min | Severity: SEVX

日期：YYYY-MM-DD | 时长：X分钟 | 级别：SEVX

Executive Summary

执行摘要

One paragraph: what happened, impact, root cause, resolution.

一段文字：事件概况、影响、根本原因、解决方案。

Timeline (UTC)

时间线（UTC时区）

Time	Event
HH:MM	First alert fired
HH:MM	On-call acknowledged
HH:MM	Root cause identified
HH:MM	Fix deployed
HH:MM	Service recovered

时间	事件
HH:MM	首次告警触发
HH:MM	值班人员确认
HH:MM	定位根本原因
HH:MM	修复方案部署
HH:MM	服务恢复正常

Root Cause Analysis

根本原因分析

5 Whys

5Why分析法

Why did service fail? → [Answer]
Why did [1] happen? → [Answer]
Why did [2] happen? → [Answer]
Why did [3] happen? → [Answer]
Why did [4] happen? → [Root cause]

服务为何故障？ → [答案]
问题1的原因是什么？ → [答案]
问题2的原因是什么？ → [答案]
问题3的原因是什么？ → [答案]
问题4的原因是什么？ → [根本原因]

Impact

影响范围

Customers affected: X
Duration: X minutes
Revenue impact: $X
Support tickets: X

受影响客户数：X
持续时长：X分钟
营收影响：$X
支持工单数量：X

Action Items

行动项

Priority	Action	Owner	Due	Ticket
P0	[Immediate fix]	@name	Date	XXX-123
P1	[Prevent recurrence]	@name	Date	XXX-124
P2	[Improve detection]	@name	Date	XXX-125

undefined

优先级	行动内容	负责人	截止日期	工单编号
P0	[紧急修复措施]	@name	日期	XXX-123
P1	[预防复发措施]	@name	日期	XXX-124
P2	[优化检测机制]	@name	日期	XXX-125

undefined

Quick Template (Minor Incidents)

简易模板（小型事件）

markdown

undefined

markdown

undefined

Quick Postmortem: [Title]

简易事后复盘：[标题]

Date: YYYY-MM-DD | Duration: X min | Severity: SEV3

日期：YYYY-MM-DD | 时长：X分钟 | 级别：SEV3

What Happened

事件概况

One sentence description.

一句话描述事件。

Timeline

时间线

HH:MM - Trigger
HH:MM - Detection
HH:MM - Resolution

HH:MM - 事件触发
HH:MM - 事件检测
HH:MM - 事件解决

Root Cause

根本原因

One sentence.

一句话描述。

Fix

修复措施

Immediate: [What was done]
Long-term: [Ticket XXX-123]

---

紧急措施：[已执行操作]
长期措施：[工单XXX-123]

---

Postmortem Meeting Guide

事后复盘会议指南

Structure (60 min)

会议结构（60分钟）

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

开场（5分钟） - 强调：“我们是为了学习，而非追责”
时间线回顾（15分钟） - 按时间顺序梳理事件
问题分析（20分钟） - 哪里出了问题？为什么？是什么条件导致的？
行动项确定（15分钟） - 优先级排序、分配负责人、设定截止日期
收尾（5分钟） - 总结经验，确认负责人

Facilitation Tips

主持技巧

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

将追责引导至系统层面：“是什么条件导致了这个失误？”
控制无关讨论的时间
记录不同意见
鼓励沉默的参与者发言

Anti-Patterns

反模式

Don't	Do Instead
Aim for 100% SLO	Accept error budget exists
Skip small incidents	Small incidents reveal patterns
Orphan action items	Every item needs owner + date + ticket
Blame individuals	Ask "what conditions allowed this?"
Create busywork actions	Actions should prevent recurrence

不要做	正确做法
追求100%的SLO	接受错误预算的存在
忽略小型事件	小型事件能暴露潜在模式
无主行动项	每个行动项都需要负责人+截止日期+工单
指责个人	询问“是什么条件导致了这个问题？”
制定无意义的行动项	行动项应能预防事件复发

Verification

验证

Run:

python scripts/verify.py

运行：

python scripts/verify.py

References

参考资料

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

references/slo-alerting.md - Prometheus规则、燃烧速率告警、Grafana仪表盘