incident-runbook-templates

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Runbook Templates

事件响应运行手册模板

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
可直接用于生产环境的事件响应运行手册模板,涵盖检测、分类、缓解、解决和沟通环节。

When to Use This Skill

何时使用此技能

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers
  • 制定事件响应流程
  • 构建特定服务的运行手册
  • 确立升级路径
  • 记录恢复流程
  • 响应活跃事件
  • 值班工程师入职培训

Core Concepts

核心概念

1. Incident Severity Levels

1. 事件严重级别

SeverityImpactResponse TimeExample
SEV1Complete outage, data loss15 minProduction down
SEV2Major degradation30 minCritical feature broken
SEV3Minor impact2 hoursNon-critical bug
SEV4Minimal impactNext business dayCosmetic issue
严重级别影响范围响应时间要求示例
SEV1完全中断、数据丢失15分钟内响应生产环境宕机
SEV2严重性能下降30分钟内响应核心功能故障
SEV3轻微影响2小时内响应非核心功能bug
SEV4极小影响下一个工作日响应界面显示问题

2. Runbook Structure

2. 运行手册结构

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
1. 概述与影响评估
2. 检测与告警
3. 初始分类
4. 缓解步骤
5. 根本原因调查
6. 解决流程
7. 验证与回滚
8. 沟通模板
9. 升级矩阵

Runbook Templates

运行手册模板

Template 1: Service Outage Runbook

模板1:服务中断运行手册

markdown
undefined
markdown
undefined

[Service Name] Outage Runbook

[服务名称] 中断运行手册

Overview

概述

Service: Payment Processing Service Owner: Platform Team Slack: #payments-incidents PagerDuty: payments-oncall
服务:支付处理服务 负责人:平台团队 Slack频道:#payments-incidents PagerDuty:payments-oncall

Impact Assessment

影响评估

  • Which customers are affected?
  • What percentage of traffic is impacted?
  • Are there financial implications?
  • What's the blast radius?
  • 哪些客户受到影响?
  • 受影响的流量占比是多少?
  • 是否存在财务影响?
  • 影响范围有多大?

Detection

检测

Alerts

告警

  • payment_error_rate > 5%
    (PagerDuty)
  • payment_latency_p99 > 2s
    (Slack)
  • payment_success_rate < 95%
    (PagerDuty)
  • payment_error_rate > 5%
    (PagerDuty)
  • payment_latency_p99 > 2s
    (Slack)
  • payment_success_rate < 95%
    (PagerDuty)

Dashboards

仪表盘

Initial Triage (First 5 Minutes)

初始分类(前5分钟)

1. Assess Scope

1. 评估影响范围

bash
undefined
bash
undefined

Check service health

检查服务健康状态

kubectl get pods -n payments -l app=payment-service
kubectl get pods -n payments -l app=payment-service

Check recent deployments

检查最近的部署记录

kubectl rollout history deployment/payment-service -n payments
kubectl rollout history deployment/payment-service -n payments

Check error rates

检查错误率

2. Quick Health Checks

2. 快速健康检查

  • Can you reach the service?
    curl -I https://api.company.com/payments/health
  • Database connectivity? Check connection pool metrics
  • External dependencies? Check Stripe, bank API status
  • Recent changes? Check deploy history
  • 能否访问服务?
    curl -I https://api.company.com/payments/health
  • 数据库连接是否正常?检查连接池指标
  • 外部依赖是否正常?检查Stripe、银行API状态
  • 最近是否有变更?检查部署历史

3. Initial Classification

3. 初始分类

SymptomLikely CauseGo To Section
All requests failingService downSection 4.1
High latencyDatabase/dependencySection 4.2
Partial failuresCode bugSection 4.3
Spike in errorsTraffic surgeSection 4.4
症状表现可能原因跳转章节
所有请求失败服务宕机第4.1节
高延迟数据库/依赖问题第4.2节
部分请求失败代码bug第4.3节
错误量激增流量突增第4.4节

Mitigation Procedures

缓解流程

4.1 Service Completely Down

4.1 服务完全宕机

bash
undefined
bash
undefined

Step 1: Check pod status

步骤1:检查Pod状态

kubectl get pods -n payments
kubectl get pods -n payments

Step 2: If pods are crash-looping, check logs

步骤2:如果Pod崩溃循环,查看日志

kubectl logs -n payments -l app=payment-service --tail=100
kubectl logs -n payments -l app=payment-service --tail=100

Step 3: Check recent deployments

步骤3:检查最近的部署

kubectl rollout history deployment/payment-service -n payments
kubectl rollout history deployment/payment-service -n payments

Step 4: ROLLBACK if recent deploy is suspect

步骤4:如果怀疑是最近部署导致,执行回滚

kubectl rollout undo deployment/payment-service -n payments
kubectl rollout undo deployment/payment-service -n payments

Step 5: Scale up if resource constrained

步骤5:如果资源受限,扩容服务

kubectl scale deployment/payment-service -n payments --replicas=10
kubectl scale deployment/payment-service -n payments --replicas=10

Step 6: Verify recovery

步骤6:验证恢复状态

kubectl rollout status deployment/payment-service -n payments
undefined
kubectl rollout status deployment/payment-service -n payments
undefined

4.2 High Latency

4.2 高延迟

bash
undefined
bash
undefined

Step 1: Check database connections

步骤1:检查数据库连接数

kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool
kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool

Step 2: Check slow queries (if DB issue)

步骤2:检查慢查询(如果是数据库问题)

psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;"
psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;"

Step 3: Kill long-running queries if needed

步骤3:必要时终止长时运行的查询

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

Step 4: Check external dependency latency

步骤4:检查外部依赖延迟

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

Step 5: Enable circuit breaker if dependency is slow

步骤5:如果依赖服务缓慢,启用断路器

kubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
undefined
kubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
undefined

4.3 Partial Failures (Specific Errors)

4.3 部分请求失败(特定错误)

bash
undefined
bash
undefined

Step 1: Identify error pattern

步骤1:识别错误模式

kubectl logs -n payments -l app=payment-service --tail=500 |
grep -i error | sort | uniq -c | sort -rn | head -20
kubectl logs -n payments -l app=payment-service --tail=500 |
grep -i error | sort | uniq -c | sort -rn | head -20

Step 2: Check error tracking

步骤2:查看错误追踪

Step 3: If specific endpoint, enable feature flag to disable

步骤3:如果是特定端点问题,启用功能标志禁用该功能

curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

Step 4: If data issue, check recent data changes

步骤4:如果是数据问题,检查最近的数据变更

psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"
undefined
psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"
undefined

4.4 Traffic Surge

4.4 流量突增

bash
undefined
bash
undefined

Step 1: Check current request rate

步骤1:检查当前请求速率

kubectl top pods -n payments
kubectl top pods -n payments

Step 2: Scale horizontally

步骤2:水平扩容服务

kubectl scale deployment/payment-service -n payments --replicas=20
kubectl scale deployment/payment-service -n payments --replicas=20

Step 3: Enable rate limiting

步骤3:启用速率限制

kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments

Step 4: If attack, block suspicious IPs

步骤4:如果是攻击,阻止可疑IP

kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress:
  • from:
    • ipBlock: cidr: 0.0.0.0/0 except:
      • 192.168.1.0/24 # Suspicious range EOF
undefined
kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress:
  • from:
    • ipBlock: cidr: 0.0.0.0/0 except:
      • 192.168.1.0/24 # 可疑IP段 EOF
undefined

Verification Steps

验证步骤

bash
undefined
bash
undefined

Verify service is healthy

验证服务健康状态

Verify error rate is back to normal

验证错误率恢复正常

Verify latency is acceptable

验证延迟在可接受范围内

Smoke test critical flows

冒烟测试核心流程

./scripts/smoke-test-payments.sh
undefined
./scripts/smoke-test-payments.sh
undefined

Rollback Procedures

回滚流程

bash
undefined
bash
undefined

Rollback Kubernetes deployment

回滚Kubernetes部署

kubectl rollout undo deployment/payment-service -n payments
kubectl rollout undo deployment/payment-service -n payments

Rollback database migration (if applicable)

回滚数据库迁移(如适用)

./scripts/db-rollback.sh $MIGRATION_VERSION
./scripts/db-rollback.sh $MIGRATION_VERSION

Rollback feature flag

回滚功能标志

curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
undefined
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
undefined

Escalation Matrix

升级矩阵

ConditionEscalate ToContact
> 15 min unresolved SEV1Engineering Manager@manager (Slack)
Data breach suspectedSecurity Team#security-incidents
Financial impact > $10kFinance + Legal@finance-oncall
Customer communication neededSupport Lead@support-lead
触发条件升级对象联系方式
SEV1事件超过15分钟未解决工程经理@manager(Slack)
怀疑存在数据泄露安全团队#security-incidents
财务影响超过1万美元财务+法务@finance-oncall
需要与客户沟通支持主管@support-lead

Communication Templates

沟通模板

Initial Notification (Internal)

初始通知(内部)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents
🚨 事件通知:支付服务性能下降

严重级别:SEV2
状态:调查中
影响:约20%的支付请求失败
开始时间:[时间]
事件负责人:[姓名]

当前行动:
- 调查根本原因
- 扩容服务
- 监控仪表盘

后续更新将在#payments-incidents频道发布

Status Update

状态更新

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes
📊 事件更新:支付服务事件

状态:缓解中
影响:失败率已降至约5%
持续时间:25分钟

已采取行动:
- 将部署版本从v2.3.4回滚至v2.3.3
- 将服务副本数从5扩容至10

下一步计划:
- 持续监控
- 进行根本原因分析

预计解决时间:约15分钟

Resolution Notification

解决通知

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
undefined
✅ 事件解决:支付服务事件

持续时间:45分钟
影响:约5000笔交易受影响
根本原因:v2.3.4版本存在内存泄漏

解决措施:
- 回滚至v2.3.3版本
- 交易已自动重试成功

后续跟进:
- 将于[日期]召开事后复盘会
- 修复bug中
undefined

Template 2: Database Incident Runbook

模板2:数据库事件运行手册

markdown
undefined
markdown
undefined

Database Incident Runbook

数据库事件运行手册

Quick Reference

快速参考

IssueCommand
Check connections
SELECT count(*) FROM pg_stat_activity;
Kill query
SELECT pg_terminate_backend(pid);
Check replication lag
SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));
Check locks
SELECT * FROM pg_locks WHERE NOT granted;
问题命令
检查连接数
SELECT count(*) FROM pg_stat_activity;
终止查询
SELECT pg_terminate_backend(pid);
检查复制延迟
SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));
检查锁
SELECT * FROM pg_locks WHERE NOT granted;

Connection Pool Exhaustion

连接池耗尽

sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
sql
-- 检查当前连接数
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- 识别长时运行的连接
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- 终止空闲连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

复制延迟

sql
-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
sql
-- 检查副本延迟
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- 如果延迟>60秒,可考虑:
-- 1. 检查主从节点间网络
-- 2. 检查副本节点磁盘I/O
-- 3. 若无法恢复,考虑故障转移

Disk Space Critical

磁盘空间不足

bash
undefined
bash
undefined

Check disk usage

检查磁盘使用情况

df -h /var/lib/postgresql/data
df -h /var/lib/postgresql/data

Find large tables

查找大表

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"

VACUUM to reclaim space

执行VACUUM回收空间

psql -c "VACUUM FULL large_table;"
psql -c "VACUUM FULL large_table;"

If emergency, delete old data or expand disk

紧急情况下,删除旧数据或扩容磁盘

undefined
undefined

Best Practices

最佳实践

Do's

建议

  • Keep runbooks updated - Review after every incident
  • Test runbooks regularly - Game days, chaos engineering
  • Include rollback steps - Always have an escape hatch
  • Document assumptions - What must be true for steps to work
  • Link to dashboards - Quick access during stress
  • 保持运行手册更新 - 每次事件后进行回顾更新
  • 定期测试运行手册 - 开展演练日、混沌工程测试
  • 包含回滚步骤 - 始终保留逃生方案
  • 记录假设条件 - 明确步骤生效的前提
  • 关联仪表盘 - 压力情况下快速访问

Don'ts

禁忌

  • Don't assume knowledge - Write for 3 AM brain
  • Don't skip verification - Confirm each step worked
  • Don't forget communication - Keep stakeholders informed
  • Don't work alone - Escalate early
  • Don't skip postmortems - Learn from every incident
  • 不要假设已有知识 - 按照凌晨3点的认知水平编写
  • 不要跳过验证步骤 - 确认每一步都生效
  • 不要忘记沟通 - 及时告知相关方
  • 不要独自处理 - 尽早升级
  • 不要跳过事后复盘 - 从每次事件中学习

Resources

参考资源