incident

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Response

事故响应

Handle production incidents systematically.
系统化处理生产环境事故。

When to Use

适用场景

  • Production is down or degraded
  • Critical errors affecting users
  • Security incidents
  • Data issues
  • Performance emergencies
  • 生产环境宕机或性能下降
  • 影响用户的严重错误
  • 安全事件
  • 数据问题
  • 性能紧急事件

Incident Workflow

事故处理流程

DETECT → TRIAGE → MITIGATE → RESOLVE → REVIEW
DETECT → TRIAGE → MITIGATE → RESOLVE → REVIEW

1. Detect & Triage

1. 检测与分级

bash
undefined
bash
undefined

Quick health checks

Quick health checks

curl -s https://api.example.com/health | jq . kubectl get pods -n production | grep -v Running
curl -s https://api.example.com/health | jq . kubectl get pods -n production | grep -v Running

Check recent deployments

Check recent deployments

git log --oneline -5 kubectl rollout history deployment/app
git log --oneline -5 kubectl rollout history deployment/app

Error rates

Error rates

grep -c "ERROR" /var/log/app.log
undefined
grep -c "ERROR" /var/log/app.log
undefined

2. Mitigate First

2. 优先缓解

Priority: Stop the bleeding before finding root cause
bash
undefined
优先级:在找到根本原因前先止损
bash
undefined

Rollback deployment

Rollback deployment

kubectl rollout undo deployment/app
kubectl rollout undo deployment/app

Scale up if overloaded

Scale up if overloaded

kubectl scale deployment/app --replicas=10
kubectl scale deployment/app --replicas=10

Feature flag disable

Feature flag disable

curl -X POST api.example.com/admin/flags -d '{"feature": false}'
curl -X POST api.example.com/admin/flags -d '{"feature": false}'

Circuit breaker

Circuit breaker

Block problematic endpoint or dependency

Block problematic endpoint or dependency

undefined
undefined

3. Investigate

3. 调查分析

bash
undefined
bash
undefined

Recent logs

Recent logs

kubectl logs -l app=myapp --since=30m | grep -i error
kubectl logs -l app=myapp --since=30m | grep -i error

Resource usage

Resource usage

kubectl top pods -n production
kubectl top pods -n production

Database connections

Database connections

SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

Network issues

Network issues

curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com
undefined
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com
undefined

Severity Levels

严重级别

LevelImpactResponse TimeExample
P1Complete outageImmediateSite down
P2Major feature broken15 minPayments failing
P3Minor feature broken1 hourSearch slow
P4Low impactNext dayUI glitch
级别影响范围响应时间示例
P1完全宕机立即响应站点无法访问
P2核心功能故障15分钟内响应支付功能失效
P3次要功能故障1小时内响应搜索功能缓慢
P4低影响次日响应UI界面小问题

Communication Template

沟通模板

markdown
undefined
markdown
undefined

Incident Update

Incident Update

Status: Investigating | Identified | Mitigated | Resolved Severity: P1/P2/P3 Started: YYYY-MM-DD HH:MM UTC Duration: X hours
Status: Investigating | Identified | Mitigated | Resolved Severity: P1/P2/P3 Started: YYYY-MM-DD HH:MM UTC Duration: X hours

Summary

Summary

[1-2 sentences on what's happening]
[1-2 sentences on what's happening]

Impact

Impact

[Who is affected and how]
[Who is affected and how]

Current Actions

Current Actions

  • [Action 1]
  • [Action 2]
  • [Action 1]
  • [Action 2]

Next Update

Next Update

[Time of next update]
undefined
[Time of next update]
undefined

Post-Mortem Template

事后复盘模板

markdown
undefined
markdown
undefined

Incident Post-Mortem

Incident Post-Mortem

Date: YYYY-MM-DD Duration: X hours Severity: P1
Date: YYYY-MM-DD Duration: X hours Severity: P1

Summary

Summary

[What happened in 2-3 sentences]
[What happened in 2-3 sentences]

Timeline

Timeline

  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [Event]

Root Cause

Root Cause

[Technical explanation]
[Technical explanation]

Impact

Impact

  • Users affected: X
  • Revenue impact: $Y
  • Data loss: None/Describe
  • Users affected: X
  • Revenue impact: $Y
  • Data loss: None/Describe

Action Items

Action Items

ActionOwnerDue Date
Add monitoring for X@nameYYYY-MM-DD
Improve circuit breaker@nameYYYY-MM-DD
ActionOwnerDue Date
Add monitoring for X@nameYYYY-MM-DD
Improve circuit breaker@nameYYYY-MM-DD

Lessons Learned

Lessons Learned

  • [What we learned]
undefined
  • [What we learned]
undefined

Examples

示例

Input: "API is returning 500 errors" Action: Check logs, identify failing component, rollback if recent deploy, fix
Input: "Database is overloaded" Action: Kill long queries, scale read replicas, optimize or cache hot queries
输入: "API返回500错误" 行动: 查看日志,定位故障组件,若为近期部署则回滚,修复问题
输入: "数据库负载过高" 行动: 终止长查询,扩容只读副本,优化或缓存热点查询