incident
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Response
事故响应
Handle production incidents systematically.
系统化处理生产环境事故。
When to Use
适用场景
- Production is down or degraded
- Critical errors affecting users
- Security incidents
- Data issues
- Performance emergencies
- 生产环境宕机或性能下降
- 影响用户的严重错误
- 安全事件
- 数据问题
- 性能紧急事件
Incident Workflow
事故处理流程
DETECT → TRIAGE → MITIGATE → RESOLVE → REVIEWDETECT → TRIAGE → MITIGATE → RESOLVE → REVIEW1. Detect & Triage
1. 检测与分级
bash
undefinedbash
undefinedQuick health checks
Quick health checks
curl -s https://api.example.com/health | jq .
kubectl get pods -n production | grep -v Running
curl -s https://api.example.com/health | jq .
kubectl get pods -n production | grep -v Running
Check recent deployments
Check recent deployments
git log --oneline -5
kubectl rollout history deployment/app
git log --oneline -5
kubectl rollout history deployment/app
Error rates
Error rates
grep -c "ERROR" /var/log/app.log
undefinedgrep -c "ERROR" /var/log/app.log
undefined2. Mitigate First
2. 优先缓解
Priority: Stop the bleeding before finding root cause
bash
undefined优先级:在找到根本原因前先止损
bash
undefinedRollback deployment
Rollback deployment
kubectl rollout undo deployment/app
kubectl rollout undo deployment/app
Scale up if overloaded
Scale up if overloaded
kubectl scale deployment/app --replicas=10
kubectl scale deployment/app --replicas=10
Feature flag disable
Feature flag disable
curl -X POST api.example.com/admin/flags -d '{"feature": false}'
curl -X POST api.example.com/admin/flags -d '{"feature": false}'
Circuit breaker
Circuit breaker
Block problematic endpoint or dependency
Block problematic endpoint or dependency
undefinedundefined3. Investigate
3. 调查分析
bash
undefinedbash
undefinedRecent logs
Recent logs
kubectl logs -l app=myapp --since=30m | grep -i error
kubectl logs -l app=myapp --since=30m | grep -i error
Resource usage
Resource usage
kubectl top pods -n production
kubectl top pods -n production
Database connections
Database connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Network issues
Network issues
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com
undefinedcurl -w "@curl-format.txt" -o /dev/null -s https://api.example.com
undefinedSeverity Levels
严重级别
| Level | Impact | Response Time | Example |
|---|---|---|---|
| P1 | Complete outage | Immediate | Site down |
| P2 | Major feature broken | 15 min | Payments failing |
| P3 | Minor feature broken | 1 hour | Search slow |
| P4 | Low impact | Next day | UI glitch |
| 级别 | 影响范围 | 响应时间 | 示例 |
|---|---|---|---|
| P1 | 完全宕机 | 立即响应 | 站点无法访问 |
| P2 | 核心功能故障 | 15分钟内响应 | 支付功能失效 |
| P3 | 次要功能故障 | 1小时内响应 | 搜索功能缓慢 |
| P4 | 低影响 | 次日响应 | UI界面小问题 |
Communication Template
沟通模板
markdown
undefinedmarkdown
undefinedIncident Update
Incident Update
Status: Investigating | Identified | Mitigated | Resolved
Severity: P1/P2/P3
Started: YYYY-MM-DD HH:MM UTC
Duration: X hours
Status: Investigating | Identified | Mitigated | Resolved
Severity: P1/P2/P3
Started: YYYY-MM-DD HH:MM UTC
Duration: X hours
Summary
Summary
[1-2 sentences on what's happening]
[1-2 sentences on what's happening]
Impact
Impact
[Who is affected and how]
[Who is affected and how]
Current Actions
Current Actions
- [Action 1]
- [Action 2]
- [Action 1]
- [Action 2]
Next Update
Next Update
[Time of next update]
undefined[Time of next update]
undefinedPost-Mortem Template
事后复盘模板
markdown
undefinedmarkdown
undefinedIncident Post-Mortem
Incident Post-Mortem
Date: YYYY-MM-DD
Duration: X hours
Severity: P1
Date: YYYY-MM-DD
Duration: X hours
Severity: P1
Summary
Summary
[What happened in 2-3 sentences]
[What happened in 2-3 sentences]
Timeline
Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]
Root Cause
Root Cause
[Technical explanation]
[Technical explanation]
Impact
Impact
- Users affected: X
- Revenue impact: $Y
- Data loss: None/Describe
- Users affected: X
- Revenue impact: $Y
- Data loss: None/Describe
Action Items
Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add monitoring for X | @name | YYYY-MM-DD |
| Improve circuit breaker | @name | YYYY-MM-DD |
| Action | Owner | Due Date |
|---|---|---|
| Add monitoring for X | @name | YYYY-MM-DD |
| Improve circuit breaker | @name | YYYY-MM-DD |
Lessons Learned
Lessons Learned
- [What we learned]
undefined- [What we learned]
undefinedExamples
示例
Input: "API is returning 500 errors"
Action: Check logs, identify failing component, rollback if recent deploy, fix
Input: "Database is overloaded"
Action: Kill long queries, scale read replicas, optimize or cache hot queries
输入: "API返回500错误"
行动: 查看日志,定位故障组件,若为近期部署则回滚,修复问题
输入: "数据库负载过高"
行动: 终止长查询,扩容只读副本,优化或缓存热点查询