create-postmortem
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCreate Postmortem
生成事后复盘报告
Overview
概述
Generate a blameless postmortem following Google SRE principles, combining 5 Whys root cause analysis with the Swiss cheese model for safety barrier analysis. Postmortems focus on systemic contributing factors rather than a single root cause, and refer to roles rather than individuals to maintain a blameless culture.
遵循Google SRE原则生成无责事后复盘报告,结合5 Whys根本原因分析与瑞士奶酪模型(Swiss cheese model)进行安全屏障分析。复盘报告聚焦于系统性促成因素而非单一根本原因,并以角色指代而非具体个人,以维护无责文化。
Workflow
工作流程
-
Read incident context — Scanfor incident reports matching the described incident. If an incident report exists, use its timeline, impact data, and preliminary root cause as input. Also check for previous postmortems to identify recurring patterns.
.chalk/docs/engineering/ -
Parse the incident — Extract fromand any linked incident report: what happened, when, what was affected, and how it was resolved. If details are insufficient, ask one round of clarifying questions covering the timeline, impact scope, and resolution steps.
$ARGUMENTS -
Verify blamelessness — Before writing, review all input for named individuals. Replace names with roles (e.g., "the on-call engineer" instead of "Alice"). The postmortem describes systems, processes, and decisions — not people.
-
Identify contributing factors — Avoid naming a single "root cause." Most incidents result from multiple contributing factors that align (Swiss cheese model). Identify each layer that failed: process, tooling, monitoring, testing, communication, architecture.
-
Run 5 Whys per contributing factor — For each contributing factor, ask "Why?" five times to trace from the symptom to the systemic issue. Stop when you reach a factor that is actionable and systemic, not when you reach a person's decision.
-
Analyze safety barriers — Using the Swiss cheese model, document which safety barriers existed and which failed. Categories: Detection (monitoring, alerting), Prevention (code review, testing, feature flags), Mitigation (rollback, circuit breakers, graceful degradation), Communication (status pages, incident channels, escalation paths).
-
Check for recurrence — Searchfor previous postmortems or incident reports with similar contributing factors. If this is a recurring theme, flag it explicitly and reference previous documents.
.chalk/docs/engineering/ -
Write action items — Every action item must be categorized as Detect, Prevent, or Mitigate. Each must have an owner (role or team, not individual name) and a due date. Prioritize actions that address the deepest systemic issues, not just the immediate trigger.
-
Document what went well — Identify aspects of the response that worked: fast detection, effective communication, successful rollback, team coordination. This section is mandatory.
-
Determine the next file number — List files into find the highest numbered file. Increment by 1.
.chalk/docs/engineering/ -
Write the file — Save to.
.chalk/docs/engineering/<n>_postmortem_<incident>.md -
Confirm — Present the postmortem with a summary of contributing factors, the number of action items, and any recurrence warnings.
-
读取事件上下文 — 扫描目录,查找与描述事件匹配的事件报告。如果存在事件报告,则将其时间线、影响数据和初步根本原因作为输入。同时检查过往复盘报告,识别重复出现的模式。
.chalk/docs/engineering/ -
解析事件信息 — 从及任何关联的事件报告中提取:事件内容、发生时间、受影响范围以及解决方式。如果信息不足,可发起一轮澄清提问,覆盖时间线、影响范围和解决步骤等内容。
$ARGUMENTS -
验证无责性 — 撰写前,检查所有输入内容是否提及具体个人姓名。将姓名替换为角色(例如,用“值班工程师”代替“Alice”)。复盘报告应描述系统、流程和决策,而非个人行为。
-
识别促成因素 — 避免将单一因素列为“根本原因”。大多数事件是由多个促成因素共同导致的(瑞士奶酪模型)。识别每一层失效的环节:流程、工具、监控、测试、沟通、架构。
-
针对每个促成因素开展5 Whys分析 — 对每个促成因素,连续五次询问“为什么?”,从问题表象追溯到系统性问题。当找到可采取行动的系统性因素时停止,而非停留在个人决策层面。
-
分析安全屏障失效情况 — 利用瑞士奶酪模型,记录已有的安全屏障以及失效的屏障。分类包括:检测(监控、告警)、预防(代码评审、测试、功能开关)、缓解(回滚、断路器、优雅降级)、沟通(状态页面、事件沟通渠道、升级路径)。
-
检查重复发生情况 — 搜索目录中是否有包含类似促成因素的过往复盘报告或事件报告。如果这是重复出现的问题,需明确标记并引用过往文档。
.chalk/docs/engineering/ -
制定行动项 — 每个行动项必须归类为检测、预防或缓解类。每个行动项必须有负责人(角色或团队,而非个人姓名)和截止日期。优先处理解决深层系统性问题的行动项,而非仅针对直接触发因素的措施。
-
记录有效做法 — 识别事件响应中表现良好的方面:快速检测、有效沟通、成功回滚、团队协作。此部分为必填项。
-
确定下一个文件编号 — 列出目录中的文件,找到编号最大的文件,将其编号加1。
.chalk/docs/engineering/ -
撰写文件 — 将报告保存至。
.chalk/docs/engineering/<n>_postmortem_<incident>.md -
确认提交 — 提交复盘报告时,附带促成因素摘要、行动项数量以及重复发生问题的警示信息。
Postmortem Structure
复盘报告结构
markdown
undefinedmarkdown
undefinedPostmortem: <Incident Title>
Postmortem: <事件标题>
Date of Incident: <YYYY-MM-DD>
Postmortem Date: <YYYY-MM-DD>
Status: Draft | Reviewed | Final
Severity: SEV-1 | SEV-2 | SEV-3 | SEV-4
Related Incident Report: <link or filename, if available>
事件日期: <YYYY-MM-DD>
复盘报告日期: <YYYY-MM-DD>
状态: 草稿 | 已审核 | 最终版
严重级别: SEV-1 | SEV-2 | SEV-3 | SEV-4
关联事件报告: <链接或文件名(如有)>
Summary
摘要
<2-3 sentences: What happened, what was the impact, and how was it resolved. Written for someone encountering this postmortem without prior context.>
<2-3句话:事件内容、影响范围以及解决方式。面向未了解该事件的读者撰写。>
Impact
影响范围
| Dimension | Measurement |
|---|---|
| Users Affected | <number or percentage> |
| Duration (user-facing) | <total time users experienced the issue> |
| Revenue Impact | <estimated amount or "not measurable"> |
| Data Impact | <records affected or "no data impact"> |
| SLA Breach | <Yes — details / No> |
| 维度 | 量化指标 |
|---|---|
| 受影响用户数 | <数量或百分比> |
| 用户侧持续时间 | <用户受影响的总时长> |
| 收入影响 | <估算金额或“无法量化”> |
| 数据影响 | <受影响记录数或“无数据影响”> |
| SLA违约 | <是——详情 / 否> |
Timeline
时间线
All times in <timezone>.
| Time | Event |
|---|---|
| HH:MM | <event description> |
| HH:MM | <event description> |
所有时间采用<时区>。
| 时间 | 事件 |
|---|---|
| HH:MM | <事件描述> |
| HH:MM | <事件描述> |
Contributing Factors
促成因素
There is rarely a single root cause. The following factors combined to produce this incident.
几乎不存在单一根本原因。以下因素共同导致了本次事件。
Factor 1: <Name>
因素1: <名称>
<Description of this contributing factor and how it contributed to the incident.>
<该促成因素的描述及其如何引发事件。>
5 Whys
5 Whys分析
- Why <symptom>? → <answer>
- Why <answer>? → <deeper answer>
- Why <deeper answer>? → <systemic issue>
- Why <systemic issue>? → <organizational gap>
- Why <organizational gap>? → <actionable root>
- 为什么 <问题表象>? → <答案>
- 为什么 <答案>? → <更深层答案>
- 为什么 <更深层答案>? → <系统性问题>
- 为什么 <系统性问题>? → <组织层面缺口>
- 为什么 <组织层面缺口>? → <可行动的根源>
Factor 2: <Name>
因素2: <名称>
<Description and 5 Whys for this factor.>
<该因素的描述及5 Whys分析。>
What Safety Barriers Failed
安全屏障失效情况
Using the Swiss cheese model: each barrier is a layer of defense. When holes in multiple layers align, incidents occur.
基于瑞士奶酪模型:每个屏障都是一层防御机制。当多层屏障的漏洞同时出现时,事件就会发生。
Detection
检测
- <What monitoring or alerting should have caught this? Did alerts fire? Were they actionable?>
- <哪些监控或告警本应发现该问题?告警是否触发?是否具备可执行性?>
Prevention
预防
- <What process or tooling should have prevented this from reaching production? Code review, testing, feature flags, validation?>
- <哪些流程或工具本应阻止该问题进入生产环境?代码评审、测试、功能开关、验证?>
Mitigation
缓解
- <What mechanisms should have limited the blast radius? Rollback, circuit breakers, rate limiting, graceful degradation?>
- <哪些机制本应限制影响范围?回滚、断路器、限流、优雅降级?>
Communication
沟通
- <Was the right information communicated to the right people at the right time? Status pages, incident channels, customer communication?>
- <是否在正确的时间将正确的信息传达给了正确的人员?状态页面、事件沟通渠道、客户沟通?>
Action Items
行动项
Detect
检测
| ID | Action | Owner | Due Date |
|---|---|---|---|
| D-1 | <action> | <team or role> | <YYYY-MM-DD> |
| ID | 行动内容 | 负责人 | 截止日期 |
|---|---|---|---|
| D-1 | <行动内容> | <团队或角色> | <YYYY-MM-DD> |
Prevent
预防
| ID | Action | Owner | Due Date |
|---|---|---|---|
| P-1 | <action> | <team or role> | <YYYY-MM-DD> |
| ID | 行动内容 | 负责人 | 截止日期 |
|---|---|---|---|
| P-1 | <行动内容> | <团队或角色> | <YYYY-MM-DD> |
Mitigate
缓解
| ID | Action | Owner | Due Date |
|---|---|---|---|
| M-1 | <action> | <team or role> | <YYYY-MM-DD> |
| ID | 行动内容 | 负责人 | 截止日期 |
|---|---|---|---|
| M-1 | <行动内容> | <团队或角色> | <YYYY-MM-DD> |
What Went Well
有效做法
- <Positive aspects of the incident response>
- <Practices that should be reinforced>
- <事件响应中的积极方面>
- <应持续强化的实践>
Recurrence Check
重复发生情况检查
<Have similar incidents occurred before? If yes, reference the previous postmortem/incident report and explain why prior action items did not prevent recurrence. If no prior incidents, state that explicitly.>
undefined<是否发生过类似事件?如果是,引用过往复盘报告/事件报告,并说明为何之前的行动项未能阻止本次事件。如果没有过往类似事件,需明确说明。>
undefinedOutput
输出要求
- File:
.chalk/docs/engineering/<n>_postmortem_<incident>.md - Format: Plain markdown, no YAML frontmatter
- First line:
# Postmortem: <Incident Title>
- 文件路径:
.chalk/docs/engineering/<n>_postmortem_<incident>.md - 格式: 纯Markdown格式,无YAML前置元数据
- 首行:
# Postmortem: <事件标题>
Anti-patterns
反模式
- Blame language — "The deploy engineer pushed broken code" assigns personal blame. "A deployment introduced a regression that was not caught by existing test coverage" describes the system failure. Refer to roles, not names. Focus on what the system allowed, not what a person did.
- Single root cause — Declaring one root cause oversimplifies complex system failures. Incidents almost always have multiple contributing factors. If you find yourself writing "Root Cause:" (singular), stop and look for what else had to be true for this incident to occur.
- Action items without owners or due dates — "We should improve monitoring" is a wish. "Add latency alerting for payment service P99 > 2s — Owner: Platform Team — Due: 2024-12-20" is an action item. Every action item must have both an owner and a due date.
- No "what went well" section — Postmortems that only catalog failures demoralize teams and discourage incident reporting. Always acknowledge what worked: fast detection, effective rollback, clear communication, team coordination.
- Not checking for recurrence — If a similar incident happened before, the most important question is: why didn't the previous action items prevent this one? Failing to check previous postmortems misses the systemic pattern.
- Shallow 5 Whys — Stopping at "because the test was missing" instead of asking why the testing process didn't require that test. Keep asking until you reach a systemic or organizational factor that can be addressed with a process or tooling change.
- Postmortem without follow-through — Writing a thorough postmortem and never completing the action items is worse than not writing one at all. Action items must be tracked to completion.
- 指责性语言 — “部署工程师推送了有问题的代码”属于个人指责。“某次部署引入了回归问题,而现有测试覆盖未发现该问题”则描述了系统失效。应指代角色而非姓名,聚焦于系统允许发生的情况,而非个人行为。
- 单一根本原因 — 宣称存在单一根本原因会过度简化复杂的系统失效。事件几乎总是由多个促成因素共同导致的。如果你发现自己在写“根本原因:”(单数),请停下来思考,还有哪些因素必须同时存在才会引发本次事件。
- 无负责人或截止日期的行动项 — “我们应该改进监控”是一个愿望。“为支付服务添加P99延迟超过2秒的告警 — 负责人:平台团队 — 截止日期:2024-12-20”才是一个有效的行动项。每个行动项必须同时包含负责人和截止日期。
- 缺少“有效做法”部分 — 仅记录失效的复盘报告会打击团队士气,阻碍事件上报。必须认可有效的做法:快速检测、有效回滚、清晰沟通、团队协作。
- 未检查重复发生情况 — 如果之前发生过类似事件,最重要的问题是:为什么之前的行动项未能阻止本次事件?不检查过往复盘报告会错过系统性模式。
- 浅层5 Whys分析 — 停留在“因为缺少测试”而非进一步询问为什么测试流程未要求该测试。持续提问直到找到可通过流程或工具变更解决的系统性或组织层面因素。
- 无跟进的复盘报告 — 撰写详尽的复盘报告但从未完成行动项,比不写报告更糟糕。必须跟踪行动项直至完成。