Incident Postmortem Skill
This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.
Required Inputs
Ask the user for these if not provided:
- Incident title / ID
- Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
- Date and duration of the incident
- What happened (rough notes are fine — the skill will structure them)
- Services or systems affected
- Customer impact (how many users, what was degraded)
- How it was detected
- How it was resolved
- Initial thoughts on root cause
- Action items already identified (optional)
- Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
- Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)
Output Format
Incident Postmortem: [Incident Title]
Incident ID: [ID]
Severity: [P1/P2/P3]
Date: [Date]
Duration: [Start time → Resolution time — total duration]
Status: [Resolved / Monitoring / Ongoing]
Author: [Leave blank for user to fill]
Last updated: [Date]
Executive Summary
[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]
Impact
| Dimension | Details |
|---|
| Users affected | [Number or percentage] |
| Services degraded | [List affected services] |
| Business impact | [Revenue, SLA breach, support tickets, etc. if known] |
| Duration | [Total time from first detection to full resolution] |
Timeline
List events in chronological order. Each entry:
[HH:MM UTC] — [What happened. Who did what. What changed.]
Rules for timeline entries:
- Use passive or system-focused language — avoid "X made a mistake"
- Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
- Note time between key events (e.g. "22 minutes between detection and escalation")
Root Cause
Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]
Contributing factors:
- [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
- [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
- [Factor 3 — add as many as are relevant]
Why did our existing safeguards not prevent this?
[Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]
Detection
- How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
- Time from incident start to detection: [X minutes]
- Should we have detected this faster? [Yes / No — and why]
Resolution
What fixed it? [Clear description of the actual fix — one paragraph]
Why did this work? [Brief technical explanation]
Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]
Action Items
| # | Action | Owner | Due Date | Priority |
|---|
| 1 | [Specific, testable action] | [Team or person] | [Date] | P1/P2/P3 |
Rules for action items:
- Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
- Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
- Assign a real owner — not "team" or "TBD" if avoidable
- Flag P1 actions as items that block the incident from being marked fully closed
What Went Well
[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]
Lessons Learned
[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]
Communication Log
[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]
Quality Checks
Usage Examples
- "Write a postmortem for the [incident name] outage"
- "Help me write a P1 incident report"
- "Generate an RCA document for [service] going down on [date]"
- "Draft a blameless postmortem from these notes: [paste notes]"