incident-postmortem

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Postmortem Skill

事件事后分析技能

This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.

本技能可生成符合行业标准格式的完整、无责式事件事后分析文档。输出全程采用无责视角——聚焦系统漏洞而非个人失误——并产出具体可落地的行动项，而非模糊的流程承诺。

Required Inputs

必要输入信息

Ask the user for these if not provided:

Incident title / ID
Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
Date and duration of the incident
What happened (rough notes are fine — the skill will structure them)
Services or systems affected
Customer impact (how many users, what was degraded)
How it was detected
How it was resolved
Initial thoughts on root cause
Action items already identified (optional)
Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)

若用户未提供以下信息，请主动询问：

事件标题/ID
严重级别（P1/P2/P3 或 SEV1/SEV2/SEV3）
事件发生日期及持续时长
事件经过（粗略记录即可——本技能会进行结构化整理）
受影响的服务或系统
对客户的影响（受影响用户数量、服务降级情况）
事件检测方式
事件解决方式
对根本原因的初步判断
已确定的行动项（可选）
响应人员（值班或参与响应的人员——姓名或角色；用于时间线记录，而非追责）
已发送的客户或外部沟通内容（可选——带时间戳的状态页更新、邮件或支持消息）

Output Format

输出格式

Incident Postmortem: [Incident Title]

事件事后分析报告：[事件标题]

Incident ID: [ID] Severity: [P1/P2/P3] Date: [Date] Duration: [Start time → Resolution time — total duration] Status: [Resolved / Monitoring / Ongoing] Author: [Leave blank for user to fill] Last updated: [Date]

事件ID： [ID] 严重级别： [P1/P2/P3] 日期： [日期] 持续时长： [开始时间 → 解决时间 — 总时长] 状态： [已解决/监控中/进行中] 作者： [留空供用户填写] 最后更新时间： [日期]

Executive Summary

执行摘要

[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]

[3-5句话。描述事件经过、受影响对象及解决措施。面向非技术利益相关者撰写，避免术语，不追责。]

Impact

影响范围

Dimension	Details
Users affected	[Number or percentage]
Services degraded	[List affected services]
Business impact	[Revenue, SLA breach, support tickets, etc. if known]
Duration	[Total time from first detection to full resolution]

维度	详情
受影响用户	[数量或百分比]
降级服务	[受影响服务列表]
业务影响	[如已知，包括收入损失、SLA违约、支持工单数量等]
持续时长	[从首次检测到完全解决的总时长]

Timeline

时间线

List events in chronological order. Each entry:

[HH:MM UTC] — [What happened. Who did what. What changed.]

Rules for timeline entries:

Use passive or system-focused language — avoid "X made a mistake"
Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
Note time between key events (e.g. "22 minutes between detection and escalation")

按时间顺序列出事件。每条记录格式：

[UTC时间 HH:MM] — [事件内容。执行了哪些操作。发生了哪些变更。]

时间线记录规则：

使用被动语态或聚焦系统的表述——避免"X犯了错误"这类表述
需包含：首个症状、检测、升级、测试的假设、应用的修复、解决确认
标注关键事件之间的时间间隔（例如："检测到事件到升级耗时22分钟"）

Root Cause

根本原因

Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]

Contributing factors:

[Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
[Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
[Factor 3 — add as many as are relevant]

Why did our existing safeguards not prevent this? [Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]

主要根本原因： [清晰的一句话表述。技术但通俗易懂，例如"配置错误的部署配置导致..."]

影响因素：

[因素1 — 例如：缺少金丝雀部署，导致变更直接影响100%流量]
[因素2 — 例如：告警阈值设置过高，未能捕捉到初始服务降级]
[因素3 — 添加所有相关因素]

为何现有防护措施未能阻止事件发生？ [坦诚的段落，解释监控、测试或流程为何未能提前发现问题。这是无责分析的核心——聚焦系统漏洞，而非个人失误。]

Detection

检测情况

How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
Time from incident start to detection: [X minutes]
Should we have detected this faster? [Yes / No — and why]

首次检测方式： [客户反馈/自动告警/内部监控/人工观测]
事件发生到检测的时长： [X分钟]
是否本应更快检测到？ [是/否 — 说明原因]

Resolution

解决情况

What fixed it? [Clear description of the actual fix — one paragraph] Why did this work? [Brief technical explanation] Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]

修复措施： [清晰描述实际修复操作——一段内容] 修复原理： [简要技术解释] 完全解决前是否有临时缓解措施？ [是/否 — 若是则描述]

Action Items

行动项

#	Action	Owner	Due Date	Priority
1	[Specific, testable action]	[Team or person]	[Date]	P1/P2/P3

Rules for action items:

Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
Assign a real owner — not "team" or "TBD" if avoidable
Flag P1 actions as items that block the incident from being marked fully closed

序号	行动内容	负责人	截止日期	优先级
1	[具体、可验证的行动]	[团队或个人]	[日期]	P1/P2/P3

行动项规则：

每个行动项必须足够具体，可明确标记为"已完成"或"未完成"——禁止"优化监控"这类模糊表述
区分三类行动：防止复发（修复根本原因）、提升检测能力（下次更快发现）、优化响应流程（下次更快解决）
分配明确负责人——尽量避免"团队"或"待定"
将P1行动项标记为事件完全结案的前置条件

What Went Well

亮点总结

[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]

[3-5条关于响应过程的真实观察。包括：快速协作、有效使用运行手册、合理升级流程、清晰沟通。此部分用于增强团队信心，巩固良好习惯。]

Lessons Learned

经验教训

[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]

[3-5条从本次事件中获得的、值得跨团队分享的关键见解。写成可迁移的经验——例如："我们的数据库故障转移运行手册未考虑只读副本延迟问题。所有涉及数据库故障转移的运行手册均需重新审核。"]

Communication Log

沟通日志

[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]

[可选——列出已发送的外部沟通内容：状态页更新、客户邮件、支持响应。包含时间戳。]