incident-runbook-templates
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Runbook Templates
事件响应运行手册模板
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
可直接用于生产环境的事件响应运行手册模板,涵盖检测、分类、缓解、解决和沟通环节。
When to Use This Skill
何时使用此技能
- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers
- 制定事件响应流程
- 构建特定服务的运行手册
- 确立升级路径
- 记录恢复流程
- 响应活跃事件
- 值班工程师入职培训
Core Concepts
核心概念
1. Incident Severity Levels
1. 事件严重级别
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
| 严重级别 | 影响范围 | 响应时间要求 | 示例 |
|---|---|---|---|
| SEV1 | 完全中断、数据丢失 | 15分钟内响应 | 生产环境宕机 |
| SEV2 | 严重性能下降 | 30分钟内响应 | 核心功能故障 |
| SEV3 | 轻微影响 | 2小时内响应 | 非核心功能bug |
| SEV4 | 极小影响 | 下一个工作日响应 | 界面显示问题 |
2. Runbook Structure
2. 运行手册结构
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix1. 概述与影响评估
2. 检测与告警
3. 初始分类
4. 缓解步骤
5. 根本原因调查
6. 解决流程
7. 验证与回滚
8. 沟通模板
9. 升级矩阵Runbook Templates
运行手册模板
Template 1: Service Outage Runbook
模板1:服务中断运行手册
markdown
undefinedmarkdown
undefined[Service Name] Outage Runbook
[服务名称] 中断运行手册
Overview
概述
Service: Payment Processing Service
Owner: Platform Team
Slack: #payments-incidents
PagerDuty: payments-oncall
服务:支付处理服务
负责人:平台团队
Slack频道:#payments-incidents
PagerDuty:payments-oncall
Impact Assessment
影响评估
- Which customers are affected?
- What percentage of traffic is impacted?
- Are there financial implications?
- What's the blast radius?
- 哪些客户受到影响?
- 受影响的流量占比是多少?
- 是否存在财务影响?
- 影响范围有多大?
Detection
检测
Alerts
告警
- (PagerDuty)
payment_error_rate > 5% - (Slack)
payment_latency_p99 > 2s - (PagerDuty)
payment_success_rate < 95%
- (PagerDuty)
payment_error_rate > 5% - (Slack)
payment_latency_p99 > 2s - (PagerDuty)
payment_success_rate < 95%
Dashboards
仪表盘
Initial Triage (First 5 Minutes)
初始分类(前5分钟)
1. Assess Scope
1. 评估影响范围
bash
undefinedbash
undefinedCheck service health
检查服务健康状态
kubectl get pods -n payments -l app=payment-service
kubectl get pods -n payments -l app=payment-service
Check recent deployments
检查最近的部署记录
kubectl rollout history deployment/payment-service -n payments
kubectl rollout history deployment/payment-service -n payments
Check error rates
检查错误率
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
undefinedcurl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
undefined2. Quick Health Checks
2. 快速健康检查
- Can you reach the service?
curl -I https://api.company.com/payments/health - Database connectivity? Check connection pool metrics
- External dependencies? Check Stripe, bank API status
- Recent changes? Check deploy history
- 能否访问服务?
curl -I https://api.company.com/payments/health - 数据库连接是否正常?检查连接池指标
- 外部依赖是否正常?检查Stripe、银行API状态
- 最近是否有变更?检查部署历史
3. Initial Classification
3. 初始分类
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
| 症状表现 | 可能原因 | 跳转章节 |
|---|---|---|
| 所有请求失败 | 服务宕机 | 第4.1节 |
| 高延迟 | 数据库/依赖问题 | 第4.2节 |
| 部分请求失败 | 代码bug | 第4.3节 |
| 错误量激增 | 流量突增 | 第4.4节 |
Mitigation Procedures
缓解流程
4.1 Service Completely Down
4.1 服务完全宕机
bash
undefinedbash
undefinedStep 1: Check pod status
步骤1:检查Pod状态
kubectl get pods -n payments
kubectl get pods -n payments
Step 2: If pods are crash-looping, check logs
步骤2:如果Pod崩溃循环,查看日志
kubectl logs -n payments -l app=payment-service --tail=100
kubectl logs -n payments -l app=payment-service --tail=100
Step 3: Check recent deployments
步骤3:检查最近的部署
kubectl rollout history deployment/payment-service -n payments
kubectl rollout history deployment/payment-service -n payments
Step 4: ROLLBACK if recent deploy is suspect
步骤4:如果怀疑是最近部署导致,执行回滚
kubectl rollout undo deployment/payment-service -n payments
kubectl rollout undo deployment/payment-service -n payments
Step 5: Scale up if resource constrained
步骤5:如果资源受限,扩容服务
kubectl scale deployment/payment-service -n payments --replicas=10
kubectl scale deployment/payment-service -n payments --replicas=10
Step 6: Verify recovery
步骤6:验证恢复状态
kubectl rollout status deployment/payment-service -n payments
undefinedkubectl rollout status deployment/payment-service -n payments
undefined4.2 High Latency
4.2 高延迟
bash
undefinedbash
undefinedStep 1: Check database connections
步骤1:检查数据库连接数
kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool
curl localhost:8080/metrics | grep db_pool
kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool
curl localhost:8080/metrics | grep db_pool
Step 2: Check slow queries (if DB issue)
步骤2:检查慢查询(如果是数据库问题)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
Step 3: Kill long-running queries if needed
步骤3:必要时终止长时运行的查询
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
Step 4: Check external dependency latency
步骤4:检查外部依赖延迟
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
Step 5: Enable circuit breaker if dependency is slow
步骤5:如果依赖服务缓慢,启用断路器
kubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
undefinedkubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
undefined4.3 Partial Failures (Specific Errors)
4.3 部分请求失败(特定错误)
bash
undefinedbash
undefinedStep 1: Identify error pattern
步骤1:识别错误模式
kubectl logs -n payments -l app=payment-service --tail=500 |
grep -i error | sort | uniq -c | sort -rn | head -20
grep -i error | sort | uniq -c | sort -rn | head -20
kubectl logs -n payments -l app=payment-service --tail=500 |
grep -i error | sort | uniq -c | sort -rn | head -20
grep -i error | sort | uniq -c | sort -rn | head -20
Step 2: Check error tracking
步骤2:查看错误追踪
Go to Sentry: https://sentry.io/payments
访问Sentry: https://sentry.io/payments
Step 3: If specific endpoint, enable feature flag to disable
步骤3:如果是特定端点问题,启用功能标志禁用该功能
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
Step 4: If data issue, check recent data changes
步骤4:如果是数据问题,检查最近的数据变更
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
undefinedpsql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
undefined4.4 Traffic Surge
4.4 流量突增
bash
undefinedbash
undefinedStep 1: Check current request rate
步骤1:检查当前请求速率
kubectl top pods -n payments
kubectl top pods -n payments
Step 2: Scale horizontally
步骤2:水平扩容服务
kubectl scale deployment/payment-service -n payments --replicas=20
kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting
步骤3:启用速率限制
kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs
步骤4:如果是攻击,阻止可疑IP
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range EOF
- ipBlock:
cidr: 0.0.0.0/0
except:
undefinedkubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # 可疑IP段 EOF
- ipBlock:
cidr: 0.0.0.0/0
except:
undefinedVerification Steps
验证步骤
bash
undefinedbash
undefinedVerify service is healthy
验证服务健康状态
curl -s https://api.company.com/payments/health | jq
curl -s https://api.company.com/payments/health | jq
Verify error rate is back to normal
验证错误率恢复正常
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
Verify latency is acceptable
验证延迟在可接受范围内
Smoke test critical flows
冒烟测试核心流程
./scripts/smoke-test-payments.sh
undefined./scripts/smoke-test-payments.sh
undefinedRollback Procedures
回滚流程
bash
undefinedbash
undefinedRollback Kubernetes deployment
回滚Kubernetes部署
kubectl rollout undo deployment/payment-service -n payments
kubectl rollout undo deployment/payment-service -n payments
Rollback database migration (if applicable)
回滚数据库迁移(如适用)
./scripts/db-rollback.sh $MIGRATION_VERSION
./scripts/db-rollback.sh $MIGRATION_VERSION
Rollback feature flag
回滚功能标志
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
undefinedcurl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
undefinedEscalation Matrix
升级矩阵
| Condition | Escalate To | Contact |
|---|---|---|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
| 触发条件 | 升级对象 | 联系方式 |
|---|---|---|
| SEV1事件超过15分钟未解决 | 工程经理 | @manager(Slack) |
| 怀疑存在数据泄露 | 安全团队 | #security-incidents |
| 财务影响超过1万美元 | 财务+法务 | @finance-oncall |
| 需要与客户沟通 | 支持主管 | @support-lead |
Communication Templates
沟通模板
Initial Notification (Internal)
初始通知(内部)
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents🚨 事件通知:支付服务性能下降
严重级别:SEV2
状态:调查中
影响:约20%的支付请求失败
开始时间:[时间]
事件负责人:[姓名]
当前行动:
- 调查根本原因
- 扩容服务
- 监控仪表盘
后续更新将在#payments-incidents频道发布Status Update
状态更新
📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes📊 事件更新:支付服务事件
状态:缓解中
影响:失败率已降至约5%
持续时间:25分钟
已采取行动:
- 将部署版本从v2.3.4回滚至v2.3.3
- 将服务副本数从5扩容至10
下一步计划:
- 持续监控
- 进行根本原因分析
预计解决时间:约15分钟Resolution Notification
解决通知
✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progressundefined✅ 事件解决:支付服务事件
持续时间:45分钟
影响:约5000笔交易受影响
根本原因:v2.3.4版本存在内存泄漏
解决措施:
- 回滚至v2.3.3版本
- 交易已自动重试成功
后续跟进:
- 将于[日期]召开事后复盘会
- 修复bug中undefinedTemplate 2: Database Incident Runbook
模板2:数据库事件运行手册
markdown
undefinedmarkdown
undefinedDatabase Incident Runbook
数据库事件运行手册
Quick Reference
快速参考
| Issue | Command |
|---|---|
| Check connections | |
| Kill query | |
| Check replication lag | |
| Check locks | |
| 问题 | 命令 |
|---|---|
| 检查连接数 | |
| 终止查询 | |
| 检查复制延迟 | |
| 检查锁 | |
Connection Pool Exhaustion
连接池耗尽
sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';sql
-- 检查当前连接数
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- 识别长时运行的连接
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- 终止空闲连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';Replication Lag
复制延迟
sql
-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverablesql
-- 检查副本延迟
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- 如果延迟>60秒,可考虑:
-- 1. 检查主从节点间网络
-- 2. 检查副本节点磁盘I/O
-- 3. 若无法恢复,考虑故障转移Disk Space Critical
磁盘空间不足
bash
undefinedbash
undefinedCheck disk usage
检查磁盘使用情况
df -h /var/lib/postgresql/data
df -h /var/lib/postgresql/data
Find large tables
查找大表
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
VACUUM to reclaim space
执行VACUUM回收空间
psql -c "VACUUM FULL large_table;"
psql -c "VACUUM FULL large_table;"
If emergency, delete old data or expand disk
紧急情况下,删除旧数据或扩容磁盘
undefinedundefinedBest Practices
最佳实践
Do's
建议
- Keep runbooks updated - Review after every incident
- Test runbooks regularly - Game days, chaos engineering
- Include rollback steps - Always have an escape hatch
- Document assumptions - What must be true for steps to work
- Link to dashboards - Quick access during stress
- 保持运行手册更新 - 每次事件后进行回顾更新
- 定期测试运行手册 - 开展演练日、混沌工程测试
- 包含回滚步骤 - 始终保留逃生方案
- 记录假设条件 - 明确步骤生效的前提
- 关联仪表盘 - 压力情况下快速访问
Don'ts
禁忌
- Don't assume knowledge - Write for 3 AM brain
- Don't skip verification - Confirm each step worked
- Don't forget communication - Keep stakeholders informed
- Don't work alone - Escalate early
- Don't skip postmortems - Learn from every incident
- 不要假设已有知识 - 按照凌晨3点的认知水平编写
- 不要跳过验证步骤 - 确认每一步都生效
- 不要忘记沟通 - 及时告知相关方
- 不要独自处理 - 尽早升级
- 不要跳过事后复盘 - 从每次事件中学习