incident-postmortem

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Postmortem Skill

事件事后分析技能

This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.
本技能可生成符合行业标准格式的完整、无责式事件事后分析文档。输出全程采用无责视角——聚焦系统漏洞而非个人失误——并产出具体可落地的行动项,而非模糊的流程承诺。

Required Inputs

必要输入信息

Ask the user for these if not provided:
  • Incident title / ID
  • Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
  • Date and duration of the incident
  • What happened (rough notes are fine — the skill will structure them)
  • Services or systems affected
  • Customer impact (how many users, what was degraded)
  • How it was detected
  • How it was resolved
  • Initial thoughts on root cause
  • Action items already identified (optional)
  • Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
  • Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)
若用户未提供以下信息,请主动询问:
  • 事件标题/ID
  • 严重级别(P1/P2/P3 或 SEV1/SEV2/SEV3)
  • 事件发生日期及持续时长
  • 事件经过(粗略记录即可——本技能会进行结构化整理)
  • 受影响的服务或系统
  • 对客户的影响(受影响用户数量、服务降级情况)
  • 事件检测方式
  • 事件解决方式
  • 对根本原因的初步判断
  • 已确定的行动项(可选)
  • 响应人员(值班或参与响应的人员——姓名或角色;用于时间线记录,而非追责)
  • 已发送的客户或外部沟通内容(可选——带时间戳的状态页更新、邮件或支持消息)

Output Format

输出格式



Incident Postmortem: [Incident Title]

事件事后分析报告:[事件标题]

Incident ID: [ID] Severity: [P1/P2/P3] Date: [Date] Duration: [Start time → Resolution time — total duration] Status: [Resolved / Monitoring / Ongoing] Author: [Leave blank for user to fill] Last updated: [Date]

事件ID: [ID] 严重级别: [P1/P2/P3] 日期: [日期] 持续时长: [开始时间 → 解决时间 — 总时长] 状态: [已解决/监控中/进行中] 作者: [留空供用户填写] 最后更新时间: [日期]

Executive Summary

执行摘要

[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]

[3-5句话。描述事件经过、受影响对象及解决措施。面向非技术利益相关者撰写,避免术语,不追责。]

Impact

影响范围

DimensionDetails
Users affected[Number or percentage]
Services degraded[List affected services]
Business impact[Revenue, SLA breach, support tickets, etc. if known]
Duration[Total time from first detection to full resolution]

维度详情
受影响用户[数量或百分比]
降级服务[受影响服务列表]
业务影响[如已知,包括收入损失、SLA违约、支持工单数量等]
持续时长[从首次检测到完全解决的总时长]

Timeline

时间线

List events in chronological order. Each entry:
[HH:MM UTC] — [What happened. Who did what. What changed.]
Rules for timeline entries:
  • Use passive or system-focused language — avoid "X made a mistake"
  • Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
  • Note time between key events (e.g. "22 minutes between detection and escalation")

按时间顺序列出事件。每条记录格式:
[UTC时间 HH:MM] — [事件内容。执行了哪些操作。发生了哪些变更。]
时间线记录规则:
  • 使用被动语态或聚焦系统的表述——避免"X犯了错误"这类表述
  • 需包含:首个症状、检测、升级、测试的假设、应用的修复、解决确认
  • 标注关键事件之间的时间间隔(例如:"检测到事件到升级耗时22分钟")

Root Cause

根本原因

Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]
Contributing factors:
  • [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
  • [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
  • [Factor 3 — add as many as are relevant]
Why did our existing safeguards not prevent this? [Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]

主要根本原因: [清晰的一句话表述。技术但通俗易懂,例如"配置错误的部署配置导致..."]
影响因素:
  • [因素1 — 例如:缺少金丝雀部署,导致变更直接影响100%流量]
  • [因素2 — 例如:告警阈值设置过高,未能捕捉到初始服务降级]
  • [因素3 — 添加所有相关因素]
为何现有防护措施未能阻止事件发生? [坦诚的段落,解释监控、测试或流程为何未能提前发现问题。这是无责分析的核心——聚焦系统漏洞,而非个人失误。]

Detection

检测情况

  • How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
  • Time from incident start to detection: [X minutes]
  • Should we have detected this faster? [Yes / No — and why]

  • 首次检测方式: [客户反馈/自动告警/内部监控/人工观测]
  • 事件发生到检测的时长: [X分钟]
  • 是否本应更快检测到? [是/否 — 说明原因]

Resolution

解决情况

What fixed it? [Clear description of the actual fix — one paragraph] Why did this work? [Brief technical explanation] Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]

修复措施: [清晰描述实际修复操作——一段内容] 修复原理: [简要技术解释] 完全解决前是否有临时缓解措施? [是/否 — 若是则描述]

Action Items

行动项

#ActionOwnerDue DatePriority
1[Specific, testable action][Team or person][Date]P1/P2/P3
Rules for action items:
  • Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
  • Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
  • Assign a real owner — not "team" or "TBD" if avoidable
  • Flag P1 actions as items that block the incident from being marked fully closed

序号行动内容负责人截止日期优先级
1[具体、可验证的行动][团队或个人][日期]P1/P2/P3
行动项规则:
  • 每个行动项必须足够具体,可明确标记为"已完成"或"未完成"——禁止"优化监控"这类模糊表述
  • 区分三类行动:防止复发(修复根本原因)、提升检测能力(下次更快发现)、优化响应流程(下次更快解决)
  • 分配明确负责人——尽量避免"团队"或"待定"
  • 将P1行动项标记为事件完全结案的前置条件

What Went Well

亮点总结

[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]

[3-5条关于响应过程的真实观察。包括:快速协作、有效使用运行手册、合理升级流程、清晰沟通。此部分用于增强团队信心,巩固良好习惯。]

Lessons Learned

经验教训

[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]

[3-5条从本次事件中获得的、值得跨团队分享的关键见解。写成可迁移的经验——例如:"我们的数据库故障转移运行手册未考虑只读副本延迟问题。所有涉及数据库故障转移的运行手册均需重新审核。"]

Communication Log

沟通日志

[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]

[可选——列出已发送的外部沟通内容:状态页更新、客户邮件、支持响应。包含时间戳。]

Quality Checks

质量检查清单

  • Timeline has no blame-focused language
  • Root cause is specific (not "human error")
  • Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
  • Contributing factors explain the systemic gaps
  • Every action item has an owner and due date
  • "What went well" section is genuine, not token
  • No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
  • Executive summary is readable by non-technical leadership
  • 时间线无追责表述
  • 根本原因具体明确(而非"人为失误")
  • 根本原因回答了"为何发生"而非仅"发生了什么"——指出系统或流程漏洞,而非症状
  • 影响因素解释了系统性漏洞
  • 每个行动项均有负责人和截止日期
  • "亮点总结"部分真实可信,非敷衍内容
  • 行动项无"优化监控"、"提升韧性"或"改进测试"这类模糊表述——每个行动项需明确具体变更内容
  • 执行摘要可供非技术领导层阅读

Usage Examples

使用示例

  • "Write a postmortem for the [incident name] outage"
  • "Help me write a P1 incident report"
  • "Generate an RCA document for [service] going down on [date]"
  • "Draft a blameless postmortem from these notes: [paste notes]"
  • "为[事件名称]故障编写事后分析报告"
  • "帮我撰写一份P1级别事件报告"
  • "为[服务]在[日期]的故障生成RCA文档"
  • "根据以下记录草拟一份无责式事后分析报告:[粘贴记录]"