incident-postmortem
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Postmortem Skill
事件事后分析技能
This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.
本技能可生成符合行业标准格式的完整、无责式事件事后分析文档。输出全程采用无责视角——聚焦系统漏洞而非个人失误——并产出具体可落地的行动项,而非模糊的流程承诺。
Required Inputs
必要输入信息
Ask the user for these if not provided:
- Incident title / ID
- Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
- Date and duration of the incident
- What happened (rough notes are fine — the skill will structure them)
- Services or systems affected
- Customer impact (how many users, what was degraded)
- How it was detected
- How it was resolved
- Initial thoughts on root cause
- Action items already identified (optional)
- Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
- Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)
若用户未提供以下信息,请主动询问:
- 事件标题/ID
- 严重级别(P1/P2/P3 或 SEV1/SEV2/SEV3)
- 事件发生日期及持续时长
- 事件经过(粗略记录即可——本技能会进行结构化整理)
- 受影响的服务或系统
- 对客户的影响(受影响用户数量、服务降级情况)
- 事件检测方式
- 事件解决方式
- 对根本原因的初步判断
- 已确定的行动项(可选)
- 响应人员(值班或参与响应的人员——姓名或角色;用于时间线记录,而非追责)
- 已发送的客户或外部沟通内容(可选——带时间戳的状态页更新、邮件或支持消息)
Output Format
输出格式
Incident Postmortem: [Incident Title]
事件事后分析报告:[事件标题]
Incident ID: [ID]
Severity: [P1/P2/P3]
Date: [Date]
Duration: [Start time → Resolution time — total duration]
Status: [Resolved / Monitoring / Ongoing]
Author: [Leave blank for user to fill]
Last updated: [Date]
事件ID: [ID]
严重级别: [P1/P2/P3]
日期: [日期]
持续时长: [开始时间 → 解决时间 — 总时长]
状态: [已解决/监控中/进行中]
作者: [留空供用户填写]
最后更新时间: [日期]
Executive Summary
执行摘要
[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]
[3-5句话。描述事件经过、受影响对象及解决措施。面向非技术利益相关者撰写,避免术语,不追责。]
Impact
影响范围
| Dimension | Details |
|---|---|
| Users affected | [Number or percentage] |
| Services degraded | [List affected services] |
| Business impact | [Revenue, SLA breach, support tickets, etc. if known] |
| Duration | [Total time from first detection to full resolution] |
| 维度 | 详情 |
|---|---|
| 受影响用户 | [数量或百分比] |
| 降级服务 | [受影响服务列表] |
| 业务影响 | [如已知,包括收入损失、SLA违约、支持工单数量等] |
| 持续时长 | [从首次检测到完全解决的总时长] |
Timeline
时间线
List events in chronological order. Each entry:
[HH:MM UTC] — [What happened. Who did what. What changed.]Rules for timeline entries:
- Use passive or system-focused language — avoid "X made a mistake"
- Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
- Note time between key events (e.g. "22 minutes between detection and escalation")
按时间顺序列出事件。每条记录格式:
[UTC时间 HH:MM] — [事件内容。执行了哪些操作。发生了哪些变更。]时间线记录规则:
- 使用被动语态或聚焦系统的表述——避免"X犯了错误"这类表述
- 需包含:首个症状、检测、升级、测试的假设、应用的修复、解决确认
- 标注关键事件之间的时间间隔(例如:"检测到事件到升级耗时22分钟")
Root Cause
根本原因
Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]
Contributing factors:
- [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
- [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
- [Factor 3 — add as many as are relevant]
Why did our existing safeguards not prevent this?
[Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]
主要根本原因: [清晰的一句话表述。技术但通俗易懂,例如"配置错误的部署配置导致..."]
影响因素:
- [因素1 — 例如:缺少金丝雀部署,导致变更直接影响100%流量]
- [因素2 — 例如:告警阈值设置过高,未能捕捉到初始服务降级]
- [因素3 — 添加所有相关因素]
为何现有防护措施未能阻止事件发生?
[坦诚的段落,解释监控、测试或流程为何未能提前发现问题。这是无责分析的核心——聚焦系统漏洞,而非个人失误。]
Detection
检测情况
- How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
- Time from incident start to detection: [X minutes]
- Should we have detected this faster? [Yes / No — and why]
- 首次检测方式: [客户反馈/自动告警/内部监控/人工观测]
- 事件发生到检测的时长: [X分钟]
- 是否本应更快检测到? [是/否 — 说明原因]
Resolution
解决情况
What fixed it? [Clear description of the actual fix — one paragraph]
Why did this work? [Brief technical explanation]
Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]
修复措施: [清晰描述实际修复操作——一段内容]
修复原理: [简要技术解释]
完全解决前是否有临时缓解措施? [是/否 — 若是则描述]
Action Items
行动项
| # | Action | Owner | Due Date | Priority |
|---|---|---|---|---|
| 1 | [Specific, testable action] | [Team or person] | [Date] | P1/P2/P3 |
Rules for action items:
- Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
- Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
- Assign a real owner — not "team" or "TBD" if avoidable
- Flag P1 actions as items that block the incident from being marked fully closed
| 序号 | 行动内容 | 负责人 | 截止日期 | 优先级 |
|---|---|---|---|---|
| 1 | [具体、可验证的行动] | [团队或个人] | [日期] | P1/P2/P3 |
行动项规则:
- 每个行动项必须足够具体,可明确标记为"已完成"或"未完成"——禁止"优化监控"这类模糊表述
- 区分三类行动:防止复发(修复根本原因)、提升检测能力(下次更快发现)、优化响应流程(下次更快解决)
- 分配明确负责人——尽量避免"团队"或"待定"
- 将P1行动项标记为事件完全结案的前置条件
What Went Well
亮点总结
[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]
[3-5条关于响应过程的真实观察。包括:快速协作、有效使用运行手册、合理升级流程、清晰沟通。此部分用于增强团队信心,巩固良好习惯。]
Lessons Learned
经验教训
[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]
[3-5条从本次事件中获得的、值得跨团队分享的关键见解。写成可迁移的经验——例如:"我们的数据库故障转移运行手册未考虑只读副本延迟问题。所有涉及数据库故障转移的运行手册均需重新审核。"]
Communication Log
沟通日志
[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]
[可选——列出已发送的外部沟通内容:状态页更新、客户邮件、支持响应。包含时间戳。]
Quality Checks
质量检查清单
- Timeline has no blame-focused language
- Root cause is specific (not "human error")
- Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
- Contributing factors explain the systemic gaps
- Every action item has an owner and due date
- "What went well" section is genuine, not token
- No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
- Executive summary is readable by non-technical leadership
- 时间线无追责表述
- 根本原因具体明确(而非"人为失误")
- 根本原因回答了"为何发生"而非仅"发生了什么"——指出系统或流程漏洞,而非症状
- 影响因素解释了系统性漏洞
- 每个行动项均有负责人和截止日期
- "亮点总结"部分真实可信,非敷衍内容
- 行动项无"优化监控"、"提升韧性"或"改进测试"这类模糊表述——每个行动项需明确具体变更内容
- 执行摘要可供非技术领导层阅读
Usage Examples
使用示例
- "Write a postmortem for the [incident name] outage"
- "Help me write a P1 incident report"
- "Generate an RCA document for [service] going down on [date]"
- "Draft a blameless postmortem from these notes: [paste notes]"
- "为[事件名称]故障编写事后分析报告"
- "帮我撰写一份P1级别事件报告"
- "为[服务]在[日期]的故障生成RCA文档"
- "根据以下记录草拟一份无责式事后分析报告:[粘贴记录]"