Loading...
Loading...
Create a blameless postmortem when the user asks to write a postmortem, document what went wrong, analyze an incident, or run a 5 Whys analysis
npx skill4agent add generaljerel/chalk-skills create-postmortem.chalk/docs/engineering/$ARGUMENTS.chalk/docs/engineering/.chalk/docs/engineering/.chalk/docs/engineering/<n>_postmortem_<incident>.md# Postmortem: <Incident Title>
**Date of Incident**: <YYYY-MM-DD>
**Postmortem Date**: <YYYY-MM-DD>
**Status**: Draft | Reviewed | Final
**Severity**: SEV-1 | SEV-2 | SEV-3 | SEV-4
**Related Incident Report**: <link or filename, if available>
## Summary
<2-3 sentences: What happened, what was the impact, and how was it resolved. Written for someone encountering this postmortem without prior context.>
## Impact
| Dimension | Measurement |
|-----------|-------------|
| Users Affected | <number or percentage> |
| Duration (user-facing) | <total time users experienced the issue> |
| Revenue Impact | <estimated amount or "not measurable"> |
| Data Impact | <records affected or "no data impact"> |
| SLA Breach | <Yes — details / No> |
## Timeline
All times in <timezone>.
| Time | Event |
|------|-------|
| <HH:MM> | <event description> |
| <HH:MM> | <event description> |
## Contributing Factors
> There is rarely a single root cause. The following factors combined to produce this incident.
### Factor 1: <Name>
<Description of this contributing factor and how it contributed to the incident.>
#### 5 Whys
1. **Why** <symptom>? → <answer>
2. **Why** <answer>? → <deeper answer>
3. **Why** <deeper answer>? → <systemic issue>
4. **Why** <systemic issue>? → <organizational gap>
5. **Why** <organizational gap>? → <actionable root>
### Factor 2: <Name>
<Description and 5 Whys for this factor.>
## What Safety Barriers Failed
Using the Swiss cheese model: each barrier is a layer of defense. When holes in multiple layers align, incidents occur.
### Detection
- <What monitoring or alerting should have caught this? Did alerts fire? Were they actionable?>
### Prevention
- <What process or tooling should have prevented this from reaching production? Code review, testing, feature flags, validation?>
### Mitigation
- <What mechanisms should have limited the blast radius? Rollback, circuit breakers, rate limiting, graceful degradation?>
### Communication
- <Was the right information communicated to the right people at the right time? Status pages, incident channels, customer communication?>
## Action Items
### Detect
| ID | Action | Owner | Due Date |
|----|--------|-------|----------|
| D-1 | <action> | <team or role> | <YYYY-MM-DD> |
### Prevent
| ID | Action | Owner | Due Date |
|----|--------|-------|----------|
| P-1 | <action> | <team or role> | <YYYY-MM-DD> |
### Mitigate
| ID | Action | Owner | Due Date |
|----|--------|-------|----------|
| M-1 | <action> | <team or role> | <YYYY-MM-DD> |
## What Went Well
- <Positive aspects of the incident response>
- <Practices that should be reinforced>
## Recurrence Check
<Have similar incidents occurred before? If yes, reference the previous postmortem/incident report and explain why prior action items did not prevent recurrence. If no prior incidents, state that explicitly.>.chalk/docs/engineering/<n>_postmortem_<incident>.md# Postmortem: <Incident Title>