incident-response
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Response
事件响应
When to Use
适用场景
Activate this skill when:
- Production service is down or returning errors to users
- Error rate has spiked beyond normal thresholds
- Performance has degraded significantly (latency increase, timeouts)
- An alert has fired from the monitoring system
- Users are reporting issues that indicate a systemic problem
- A failed deployment needs investigation and remediation
- Conducting a post-mortem or root cause analysis after an incident
Output: Write runbooks to and post-mortems to .
docs/runbooks/<service>-runbook.mdpostmortem-YYYY-MM-DD.mdDo NOT use this skill for:
- Setting up monitoring or alerting rules (use )
monitoring-setup - Performing routine deployments (use )
deployment-pipeline - Docker image or infrastructure issues (use )
docker-best-practices - Feature development or code changes (use or
python-backend-expert)react-frontend-expert
在以下场景下启用本技能:
- 生产环境服务宕机或向用户返回错误
- 错误率超出正常阈值激增
- 性能显著下降(延迟增加、超时)
- 监控系统触发告警
- 用户反馈表明存在系统性问题
- 失败的部署需要排查和修复
- 事件发生后开展事后复盘或根因分析
输出: 将运行手册写入,将事后复盘文档写入。
docs/runbooks/<service>-runbook.mdpostmortem-YYYY-MM-DD.md请勿在以下场景使用本技能:
- 设置监控或告警规则(请使用技能)
monitoring-setup - 执行常规部署(请使用技能)
deployment-pipeline - Docker镜像或基础设施问题(请使用技能)
docker-best-practices - 功能开发或代码变更(请使用或
python-backend-expert技能)react-frontend-expert
Instructions
操作指南
Severity Classification
严重程度分级
Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Impact | Examples | Response Time | Update Cadence |
|---|---|---|---|---|
| SEV1 (P1) | Complete outage, all users affected | Service down, data loss, security breach | Immediate (< 5 min) | Every 15 min |
| SEV2 (P2) | Major degradation, most users affected | Core feature broken, severe latency | < 15 min | Every 30 min |
| SEV3 (P3) | Partial degradation, some users affected | Non-critical feature broken, intermittent errors | < 1 hour | Every 2 hours |
| SEV4 (P4) | Minor issue, few users affected | Cosmetic bug, edge case error | < 4 hours | Daily |
Escalation rules:
- SEV1: Page on-call engineer + engineering manager immediately
- SEV2: Page on-call engineer, notify engineering manager
- SEV3: Notify on-call engineer via Slack
- SEV4: Create ticket, address during normal working hours
See for the contact matrix.
references/escalation-contacts.md立即对每个事件进行分级。严重程度决定响应优先级、沟通频率和升级路径。
| 严重程度 | 影响 | 示例 | 响应时间 | 更新频率 |
|---|---|---|---|---|
| SEV1 (P1) | 完全宕机,所有用户受影响 | 服务下线、数据丢失、安全漏洞 | 立即响应(<5分钟) | 每15分钟更新 |
| SEV2 (P2) | 严重性能下降,多数用户受影响 | 核心功能故障、严重延迟 | <15分钟响应 | 每30分钟更新 |
| SEV3 (P3) | 部分性能下降,部分用户受影响 | 非核心功能故障、间歇性错误 | <1小时响应 | 每2小时更新 |
| SEV4 (P4) | 轻微问题,少数用户受影响 | 界面显示bug、边缘场景错误 | <4小时响应 | 每日更新 |
升级规则:
- SEV1:立即呼叫值班工程师+工程经理
- SEV2:呼叫值班工程师,通知工程经理
- SEV3:通过Slack通知值班工程师
- SEV4:创建工单,在正常工作时间处理
请查看获取联系人矩阵。
references/escalation-contacts.md5-Minute Triage Workflow
5分钟分流工作流
When an incident is detected, follow this triage workflow within the first 5 minutes.
┌─────────────────────────────────────────────────────────┐
│ MINUTE 0-1: Acknowledge and Classify │
│ • Acknowledge the alert or report │
│ • Assign severity (SEV1-SEV4) │
│ • Designate incident commander │
├─────────────────────────────────────────────────────────┤
│ MINUTE 1-2: Assess Scope │
│ • Check health endpoints for all services │
│ • Check error rate and latency dashboards │
│ • Determine: which services are affected? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 2-3: Identify Recent Changes │
│ • Check: was there a recent deployment? │
│ • Check: any infrastructure changes? │
│ • Check: any external dependency issues? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 3-4: Initial Communication │
│ • Post in #incidents channel │
│ • Update status page if SEV1/SEV2 │
│ • Page additional responders if needed │
├─────────────────────────────────────────────────────────┤
│ MINUTE 4-5: Begin Investigation or Mitigate │
│ • If recent deploy: consider immediate rollback │
│ • If not deploy-related: begin diagnostic commands │
│ • Start incident timeline log │
└─────────────────────────────────────────────────────────┘Quick health check command:
bash
./skills/incident-response/scripts/health-check-all-services.sh \
--output-dir ./incident-triage/检测到事件后,在最初5分钟内遵循以下分流工作流。
┌─────────────────────────────────────────────────────────┐
│ MINUTE 0-1: Acknowledge and Classify │
│ • Acknowledge the alert or report │
│ • Assign severity (SEV1-SEV4) │
│ • Designate incident commander │
├─────────────────────────────────────────────────────────┤
│ MINUTE 1-2: Assess Scope │
│ • Check health endpoints for all services │
│ • Check error rate and latency dashboards │
│ • Determine: which services are affected? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 2-3: Identify Recent Changes │
│ • Check: was there a recent deployment? │
│ • Check: any infrastructure changes? │
│ • Check: any external dependency issues? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 3-4: Initial Communication │
│ • Post in #incidents channel │
│ • Update status page if SEV1/SEV2 │
│ • Page additional responders if needed │
├─────────────────────────────────────────────────────────┤
│ MINUTE 4-5: Begin Investigation or Mitigate │
│ • If recent deploy: consider immediate rollback │
│ • If not deploy-related: begin diagnostic commands │
│ • Start incident timeline log │
└─────────────────────────────────────────────────────────┘快速健康检查命令:
bash
./skills/incident-response/scripts/health-check-all-services.sh \
--output-dir ./incident-triage/Incident Commander Role
事件指挥官职责
The incident commander (IC) coordinates the response. They do NOT investigate directly.
IC responsibilities:
- Coordinate -- Assign tasks to responders, prevent duplicate work
- Communicate -- Post regular updates to stakeholders
- Decide -- Make go/no-go decisions on rollback, escalation, communication
- Track -- Maintain the incident timeline
- Close -- Declare the incident resolved and schedule the post-mortem
IC communication template (initial):
INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]IC communication template (update):
INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]事件指挥官(IC)负责协调响应工作,不直接参与排查。
IC职责:
- 协调 -- 为响应人员分配任务,避免重复工作
- 沟通 -- 向利益相关者定期发布更新
- 决策 -- 就回滚、升级、沟通事宜做出是否执行的决定
- 跟踪 -- 维护事件时间线
- 收尾 -- 宣布事件已解决并安排事后复盘
IC初始沟通模板:
INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]IC更新沟通模板:
INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]Investigation Steps
排查步骤
Follow these diagnostic steps based on the type of issue.
根据问题类型遵循以下诊断步骤。
Application Errors (FastAPI)
应用错误(FastAPI)
bash
undefinedbash
undefined1. Check application logs for errors
1. 检查应用日志中的错误
./skills/incident-response/scripts/fetch-logs.sh
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/
./skills/incident-response/scripts/fetch-logs.sh
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/
2. Check error rate from logs
2. 从日志中统计错误率
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"
3. Check active connections and request patterns
3. 检查活跃连接和请求模式
curl -s http://localhost:8000/health/ready | jq .
curl -s http://localhost:8000/health/ready | jq .
4. Check if the issue is in a specific endpoint
4. 检查问题是否出在特定端点
docker logs app-backend --since 15m 2>&1 |
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
docker logs app-backend --since 15m 2>&1 |
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
5. Check Python process status
5. 检查Python进程状态
docker exec app-backend ps aux
docker exec app-backend python -c "import sys; print(sys.version)"
undefineddocker exec app-backend ps aux
docker exec app-backend python -c "import sys; print(sys.version)"
undefinedDatabase Issues (PostgreSQL)
数据库问题(PostgreSQL)
bash
undefinedbash
undefined1. Check database connectivity
1. 检查数据库连通性
docker exec app-db pg_isready -U postgres
docker exec app-db pg_isready -U postgres
2. Check active connections (connection pool exhaustion?)
2. 检查活跃连接(是否连接池耗尽?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT count(*), state FROM pg_stat_activity
GROUP BY state ORDER BY count DESC;
"
docker exec app-db psql -U postgres -d app_prod -c "
SELECT count(*), state FROM pg_stat_activity
GROUP BY state ORDER BY count DESC;
"
3. Check for long-running queries (locks, deadlocks?)
3. 检查长运行查询(是否有锁、死锁?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
AND state != 'idle'
ORDER BY duration DESC;
"
docker exec app-db psql -U postgres -d app_prod -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
AND state != 'idle'
ORDER BY duration DESC;
"
4. Check for lock contention
4. 检查锁竞争
docker exec app-db psql -U postgres -d app_prod -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
docker exec app-db psql -U postgres -d app_prod -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
5. Check disk space
5. 检查磁盘空间
docker exec app-db df -h /var/lib/postgresql/data
undefineddocker exec app-db df -h /var/lib/postgresql/data
undefinedRedis Issues
Redis问题
bash
undefinedbash
undefined1. Check Redis connectivity
1. 检查Redis连通性
docker exec app-redis redis-cli ping
docker exec app-redis redis-cli ping
2. Check memory usage
2. 检查内存使用情况
docker exec app-redis redis-cli info memory | grep used_memory_human
docker exec app-redis redis-cli info memory | grep used_memory_human
3. Check connected clients
3. 检查连接客户端数量
docker exec app-redis redis-cli info clients | grep connected_clients
docker exec app-redis redis-cli info clients | grep connected_clients
4. Check slow log
4. 检查慢查询日志
docker exec app-redis redis-cli slowlog get 10
docker exec app-redis redis-cli slowlog get 10
5. Check keyspace
5. 检查键空间
docker exec app-redis redis-cli info keyspace
undefineddocker exec app-redis redis-cli info keyspace
undefinedNetwork and Infrastructure
网络与基础设施
bash
undefinedbash
undefined1. Check DNS resolution
1. 检查DNS解析
nslookup api.example.com
nslookup api.example.com
2. Check SSL certificate expiry
2. 检查SSL证书有效期
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
openssl x509 -noout -dates
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
openssl x509 -noout -dates
3. Check container resource usage
3. 检查容器资源使用情况
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
4. Check disk space on host
4. 检查主机磁盘空间
df -h /
df -h /
5. Check if dependent services are reachable
5. 检查依赖服务是否可达
curl -sf https://external-api.example.com/health || echo "External API unreachable"
undefinedcurl -sf https://external-api.example.com/health || echo "External API unreachable"
undefinedRemediation Actions
修复操作
Immediate Mitigations (apply within minutes)
立即缓解措施(数分钟内实施)
| Issue | Mitigation | Command |
|---|---|---|
| Bad deployment | Rollback | |
| Connection pool exhausted | Restart backend | |
| Long-running query | Kill query | |
| Memory leak | Restart service | |
| Redis full | Flush non-critical keys | |
| SSL expired | Apply new cert | Update cert in load balancer |
| Disk full | Clean logs/temp files | |
| 问题 | 缓解措施 | 命令 |
|---|---|---|
| 错误部署 | 回滚 | |
| 连接池耗尽 | 重启后端服务 | |
| 长运行查询 | 终止查询 | |
| 内存泄漏 | 重启服务 | |
| Redis内存已满 | 清理非核心键 | |
| SSL证书过期 | 应用新证书 | 在负载均衡器中更新证书 |
| 磁盘已满 | 清理日志/临时文件 | |
Longer-Term Fixes (apply after stabilization)
长期修复方案(稳定后实施)
- Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
- Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
- Add tests -- Write regression tests for the failure scenario
- Update runbooks -- Document the new failure mode and remediation steps
- 在代码中修复根因 -- 创建分支、修复、测试、通过正常流水线部署
- 添加监控 -- 如果现有告警未检测到问题,添加新的告警规则
- 添加测试 -- 为故障场景编写回归测试
- 更新运行手册 -- 记录新的故障模式和修复步骤
Communication Protocol
沟通协议
Internal Communication
内部沟通
Channels:
- -- Active incident coordination (SEV1/SEV2)
#incidents - -- SEV3/SEV4 tracking
#incidents-low - -- Post-incident summaries
#engineering
Rules:
- All communication happens in the designated incident channel
- Use threads for investigation details, keep main channel for status updates
- IC posts updates at the defined cadence (see severity table)
- Tag relevant people explicitly, do not assume they are watching
- Timestamp all significant findings and actions
渠道:
- -- 活跃事件协调(SEV1/SEV2)
#incidents - -- SEV3/SEV4跟踪
#incidents-low - -- 事件后总结
#engineering
规则:
- 所有沟通在指定的事件渠道进行
- 使用线程记录排查细节,主渠道仅用于状态更新
- IC按照定义的频率发布更新(请查看严重程度表格)
- 明确标记相关人员,不要假设他们在关注
- 为所有重要发现和操作添加时间戳
External Communication (SEV1/SEV2)
外部沟通(SEV1/SEV2)
Status page update template:
[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.状态页面更新模板:
[调查中] 我们正在调查[问题描述]的相关报告。
用户可能会遇到[用户可见影响]。
我们将在[时间]内提供更新。[已定位] 问题已被确定为[简要描述]。
我们正在修复,预计解决时间:[时间预估]。[已解决] 影响[服务]的问题已解决。
根因为[简要描述]。
我们对此次中断表示歉意,将发布详细的事后复盘文档。Post-Mortem / RCA Framework
事后复盘/根因分析框架
Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.
See for the full template.
references/post-mortem-template.mdPost-mortem principles:
- Blameless -- Focus on systems and processes, not individuals
- Thorough -- Identify all contributing factors, not just the trigger
- Actionable -- Every finding must produce a concrete action item with an owner
- Timely -- Conduct within 48 hours while details are fresh
- Shared -- Publish to the entire engineering team
Post-mortem structure:
- Summary -- What happened, when, and what was the impact
- Timeline -- Minute-by-minute account of detection, investigation, mitigation
- Root cause -- The fundamental reason the incident occurred
- Contributing factors -- Other conditions that made the incident worse
- What went well -- Effective parts of the response
- What could be improved -- Gaps in detection, response, or tooling
- Action items -- Specific tasks with owners and due dates
Five Whys technique for root cause analysis:
Why did users see 500 errors?
-> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
-> Because database connections timed out.
Why did database connections time out?
-> Because the connection pool was exhausted.
Why was the connection pool exhausted?
-> Because a new endpoint opened connections without releasing them.
Why were connections not released?
-> Because the endpoint was missing the async context manager for sessions.
Root cause: Missing async context manager for database sessions in new endpoint.Generate a structured incident report:
bash
python skills/incident-response/scripts/generate-incident-report.py \
--title "Database connection pool exhaustion" \
--severity SEV2 \
--start-time "2024-01-15T14:30:00Z" \
--end-time "2024-01-15T15:15:00Z" \
--output-dir ./post-mortems/所有SEV1/SEV2事件需在48小时内开展无责事后复盘。SEV3事件进行轻量级回顾。
请查看获取完整模板。
references/post-mortem-template.md事后复盘原则:
- 无责 -- 聚焦系统和流程,而非个人
- 全面 -- 识别所有促成因素,而非仅触发点
- 可执行 -- 每个发现必须产生具体的行动项并指定负责人
- 及时 -- 在细节清晰的48小时内开展
- 共享 -- 向整个工程团队发布
事后复盘结构:
- 摘要 -- 发生了什么、时间、影响
- 时间线 -- 检测、调查、缓解的分分钟记录
- 根因 -- 事件发生的根本原因
- 促成因素 -- 其他导致问题恶化的条件
- 做得好的地方 -- 响应中的有效部分
- 待改进之处 -- 检测、响应或工具中的差距
- 行动项 -- 具体任务,包含负责人和截止日期
根因分析的5Why技术:
为什么用户看到500错误?
-> 因为后端服务向负载均衡器返回错误。
为什么后端服务返回错误?
-> 因为数据库连接超时。
为什么数据库连接超时?
-> 因为连接池耗尽。
为什么连接池耗尽?
-> 因为新端点打开连接后未释放。
为什么连接未被释放?
-> 因为端点缺少会话的异步上下文管理器。
根因:新端点中缺少数据库会话的异步上下文管理器。生成结构化事件报告:
bash
python skills/incident-response/scripts/generate-incident-report.py \
--title "Database connection pool exhaustion" \
--severity SEV2 \
--start-time "2024-01-15T14:30:00Z" \
--end-time "2024-01-15T15:15:00Z" \
--output-dir ./post-mortems/Incident Response Scripts
事件响应脚本
| Script | Purpose | Usage |
|---|---|---|
| Fetch recent logs from services | |
| Check health of all services | |
| Generate structured incident report | |
| 脚本 | 用途 | 使用方法 |
|---|---|---|
| 获取服务的近期日志 | |
| 检查所有服务的健康状态 | |
| 生成结构化事件报告 | |
Quick Reference: Common Incident Patterns
快速参考:常见事件模式
| Pattern | Symptom | Likely Cause | First Action |
|---|---|---|---|
| 502/503 errors | Users see error page | Backend crashed or overloaded | Check |
| Slow responses | High latency, timeouts | DB queries, external API | Check slow query log, DB connections |
| Partial failures | Some endpoints fail | Single dependency down | Check individual service health |
| Memory growth | OOM kills, restarts | Memory leak | Check |
| Error spike after deploy | Errors start exactly at deploy time | Bug in new code | Rollback immediately |
| Gradual degradation | Slowly worsening metrics | Resource exhaustion, connection leak | Check resource usage trends |
| 模式 | 症状 | 可能原因 | 首要操作 |
|---|---|---|---|
| 502/503错误 | 用户看到错误页面 | 后端崩溃或过载 | 检查 |
| 响应缓慢 | 高延迟、超时 | 数据库查询、外部API | 检查慢查询日志、数据库连接 |
| 部分失败 | 部分端点故障 | 单个依赖服务下线 | 检查单个服务健康状态 |
| 内存增长 | OOM终止、重启 | 内存泄漏 | 检查 |
| 部署后错误激增 | 错误在部署时开始出现 | 新代码存在bug | 立即回滚 |
| 渐进式性能下降 | 指标逐渐恶化 | 资源耗尽、连接泄漏 | 检查资源使用趋势 |
Output Files
输出文件
Runbooks: Write to :
docs/runbooks/<service>-runbook.mdmarkdown
undefined运行手册: 写入:
docs/runbooks/<service>-runbook.mdmarkdown
undefinedRunbook: [Service Name]
Runbook: [Service Name]
Service Overview
Service Overview
- Purpose, dependencies, critical paths
- Purpose, dependencies, critical paths
Common Issues
Common Issues
Issue 1: [Description]
Issue 1: [Description]
- Symptoms: [What you see]
- Diagnosis: [Commands to run]
- Resolution: [Steps to fix]
- Symptoms: [What you see]
- Diagnosis: [Commands to run]
- Resolution: [Steps to fix]
Escalation
Escalation
- On-call: #ops-oncall
- Service owner: @team-name
**Post-mortems:** Write to `postmortem-YYYY-MM-DD.md`:
```markdown- On-call: #ops-oncall
- Service owner: @team-name
**事后复盘文档:** 写入`postmortem-YYYY-MM-DD.md`:
```markdownPost-Mortem: [Incident Title]
Post-Mortem: [Incident Title]
Summary
Summary
- Date: YYYY-MM-DD
- Severity: SEV1-4
- Duration: X hours
- Impact: [Users/revenue affected]
- Date: YYYY-MM-DD
- Severity: SEV1-4
- Duration: X hours
- Impact: [Users/revenue affected]
Timeline
Timeline
- HH:MM - [Event]
- HH:MM - [Event]
Root Cause
Root Cause
[Technical explanation]
[Technical explanation]
Action Items
Action Items
- [Preventive measure] - Owner: @name - Due: YYYY-MM-DD
undefined- [Preventive measure] - Owner: @name - Due: YYYY-MM-DD
undefined