write-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Write Incident Runbook

编写事件响应Runbook

Create actionable runbooks that guide responders through incident diagnosis and resolution.

创建可落地执行的Runbook，指导响应人员完成事件诊断与修复全流程。

When to Use

适用场景

Documenting response procedures for recurring alerts or incidents
Standardizing incident response across on-call rotation members
Reducing mean time to resolution (MTTR) with clear diagnostic steps
Creating training materials for new team members on incident handling
Establishing escalation paths and communication protocols
Migrating tribal knowledge to written documentation
Linking alerts to resolution procedures (alert annotations)

为重复出现的告警或事件编写响应流程
统一所有on-call轮值成员的事件响应标准
通过清晰的诊断步骤降低平均恢复时间(MTTR)
为新团队成员制作事件处理相关的培训材料
明确升级路径和沟通协议
将团队口口相传的经验转化为书面文档
为告警关联对应的解决流程（告警注释）

Inputs

输入要求

Required: Incident or alert name/description
Required: Historical incident data and resolution patterns
Optional: Diagnostic queries (Prometheus, logs, traces)
Optional: Escalation contacts and communication channels
Optional: Previous incident post-mortems

必填：事件/告警名称与描述
必填：历史事件数据和解决规律
可选：诊断查询语句（Prometheus、日志、链路追踪）
可选：升级联系人列表和沟通渠道
可选：往期事件复盘报告

Procedure

操作流程

Step 1: Choose Runbook Template Structure

步骤1：选择Runbook模板结构

See Extended Examples for complete template files.

Select an appropriate template based on incident type and complexity.

Basic runbook template structure:

markdown

undefined

查看扩展示例获取完整的模板文件。

根据事件类型和复杂度选择合适的模板。

基础Runbook模板结构:

markdown

undefined

[Alert/Incident Name] Runbook

[告警/事件名称] Runbook

Overview | Severity | Symptoms

概述 | 严重等级 | 现象说明

Diagnostic Steps | Resolution Steps

诊断步骤 | 解决步骤

Escalation | Communication | Prevention | Related

升级路径 | 沟通方案 | 预防措施 | 相关内容


**Advanced SRE runbook template** (excerpt):
```markdown


**高级SRE Runbook模板**（节选）:
```markdown

[Service Name] - [Incident Type] Runbook

[服务名称] - [事件类型] Runbook

Metadata

元数据

Service, Owner, Severity, On-Call, Last Updated

服务、负责人、严重等级、On-call联系人、最后更新时间

Diagnostic Phase

诊断阶段

Quick Health Check (< 5 min): Dashboard, error rate, deployments

快速健康检查(<5分钟)：看板、错误率、部署记录

Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns

详细排查(5-20分钟)：指标、日志、链路追踪、故障模式

... (see EXAMPLES.md for complete template)

... 完整模板见EXAMPLES.md


Key template components:
- **Metadata**: Service ownership, severity, on-call rotation
- **Diagnostic Phase**: Quick checks → detailed investigation → failure patterns
- **Resolution Phase**: Immediate mitigation → root cause fix → verification
- **Escalation**: Criteria and contact paths
- **Communication**: Internal/external templates
- **Prevention**: Short/long-term actions

**Expected:** Template selected matches incident complexity, sections appropriate for service type.

**On failure:**
- Start with basic template, iterate based on incident patterns
- Review industry examples (Google SRE books, vendor runbooks)
- Adapt template based on team feedback after first use


核心模板组件：
- **元数据**：服务归属、严重等级、on-call轮值信息
- **诊断阶段**：快速检查 → 详细排查 → 故障模式匹配
- **解决阶段**：即时止损 → 根因修复 → 效果验证
- **升级路径**：升级触发条件和联系人路径
- **沟通方案**：内部/外部沟通模板
- **预防措施**：短期/长期优化动作

**预期结果**：选择的模板匹配事件复杂度，板块设置适配对应服务类型。

**异常处理:**
- 先使用基础模板，再根据事件规律迭代优化
- 参考行业公开示例（Google SRE书籍、厂商官方Runbook）
- 首次使用后根据团队反馈调整模板

Step 2: Document Diagnostic Procedures

步骤2：编写诊断流程

See Extended Examples for complete diagnostic queries and decision trees.

Create step-by-step investigation procedures with specific queries.

Six-step diagnostic checklist:

Verify Service Health: Health endpoint checks and uptime metrics

bash

curl -I https://api.example.com/health  # Expected: HTTP 200 OK

promql

up{job="api-service"}  # Expected: 1 for all instances

Check Error Rate: Current error percentage and breakdown by endpoint

promql

sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100  # Expected: < 1%

Analyze Logs: Recent errors and top error messages from Loki
logql
```
{job="api-service"} |= "error" | json | level="error"
```

Check Resource Utilization: CPU, memory, and connection pool status

promql

avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# Expected: < 70%

Review Recent Changes: Deployments, git commits, infrastructure changes
Examine Dependencies: Downstream service health, database/API latency

Failure pattern decision tree (excerpt):

Service down? → Check all pods/instances
Error rate elevated? → Check specific error types (5xx, gateway, database, timeouts)
When did it start? → After deployment (rollback), gradual (resource leak), sudden (traffic/dependency)

Expected: Diagnostic procedures are specific, include expected vs actual values, guide responder through investigation.

On failure:

Test queries in actual monitoring system before documenting
Include screenshots of dashboards for visual reference
Add "Common mistakes" section for frequently missed steps
Iterate based on feedback from incident responders

查看扩展示例获取完整的诊断查询语句和决策树。

编写带具体查询语句的分步排查流程。

六步诊断检查清单:

验证服务健康状态：健康接口检查和可用率指标

bash

curl -I https://api.example.com/health  # 预期返回: HTTP 200 OK

promql

up{job="api-service"}  # 预期结果: 所有实例返回1

检查错误率：当前错误率和按接口拆分的错误分布

promql

sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100  # 预期结果: < 1%

分析日志：Loki中采集的近期错误和高频错误信息
logql
```
{job="api-service"} |= "error" | json | level="error"
```

检查资源使用率：CPU、内存和连接池状态

promql

avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# 预期结果: < 70%

回顾近期变更：部署记录、git提交、基础设施变更
检查依赖状态：下游服务健康度、数据库/API延迟

故障模式决策树（节选）:

服务完全不可用？→ 检查所有Pod/实例状态
错误率升高？→ 检查具体错误类型（5xx、网关错误、数据库错误、超时）
故障开始时间？→ 部署后出现（回滚）、逐步恶化（资源泄漏）、突然出现（流量/依赖故障）

预期结果：诊断流程具体明确，包含预期值和实际值对比，可引导响应人员完成全流程排查。

异常处理:

编写前先在实际监控系统中测试查询语句有效性
加入监控看板截图作为可视化参考
增加「常见误区」板块标注容易遗漏的排查步骤
根据事件响应人员的反馈迭代优化内容

Step 3: Define Resolution Procedures

步骤3：定义解决流程

See Extended Examples for all 5 resolution options with full commands and rollback procedures.

Document step-by-step remediation with rollback options.

Five resolution options (brief summary):

Rollback Deployment (fastest): For post-deployment errors
bash
```
kubectl rollout undo deployment/api-service
```
Verify → Monitor → Confirm resolution (error rate < 1%, latency normal, no alerts)
Scale Up Resources: For high CPU/memory, connection pool exhaustion
bash
```
kubectl scale deployment/api-service --replicas=$((current * 3/2))
```
Restart Service: For memory leaks, stuck connections, cache corruption
bash
```
kubectl rollout restart deployment/api-service
```
Feature Flag / Circuit Breaker: For specific feature errors or external dependency failures
bash
```
kubectl set env deployment/api-service FEATURE_NAME=false
```
Database Remediation: For database connections, slow queries, pool exhaustion
sql
```
-- Kill long-running queries, restart connection pool, increase pool size
```

Universal verification checklist:

Error rate < 1%
Latency P99 < threshold
Throughput at baseline
Resource usage healthy (CPU < 70%, Memory < 80%)
Dependencies healthy
User-facing tests pass
No active alerts

Rollback procedure: If resolution worsens situation → pause/cancel → revert → reassess

Expected: Resolution steps are clear, include verification checks, provide rollback options for each action.

On failure:

Add more granular steps for complex procedures
Include screenshots or diagrams for multi-step processes
Document command outputs (expected vs actual)
Create separate runbook for complex resolution procedures

查看扩展示例获取全部5种解决方案的完整命令和回滚流程。

编写带回滚方案的分步修复流程。

五种解决方案（简要说明）:

回滚部署（最快方案）：适用于部署后出现的错误
bash
```
kubectl rollout undo deployment/api-service
```
验证 → 监控 → 确认修复（错误率<1%、延迟正常、无告警触发）
扩容资源：适用于CPU/内存使用率过高、连接池耗尽场景
bash
```
kubectl scale deployment/api-service --replicas=$((current * 3/2))
```
重启服务：适用于内存泄漏、连接卡住、缓存损坏场景
bash
```
kubectl rollout restart deployment/api-service
```
关闭功能开关/熔断：适用于特定功能错误或外部依赖故障场景
bash
```
kubectl set env deployment/api-service FEATURE_NAME=false
```
数据库修复：适用于数据库连接异常、慢查询、连接池耗尽场景
sql
```
-- 终止长运行查询、重启连接池、扩容连接池大小
```

通用验证检查清单:

错误率 < 1%
P99延迟低于阈值
吞吐量恢复到基线水平
资源使用健康（CPU < 70%、内存 < 80%）
所有依赖服务健康
用户侧测试全部通过
无活跃告警触发

回滚流程：如果修复动作导致情况恶化 → 暂停/取消操作 → 回滚变更 → 重新评估方案

预期结果：修复步骤清晰明确，包含验证检查项，每个操作都提供对应的回滚方案。

异常处理:

复杂流程拆分为更细的分步操作
多步骤流程可以加入截图或流程图辅助说明
标注命令的预期输出和实际输出对比
复杂修复流程单独编写专属Runbook

Step 4: Establish Escalation Paths

步骤4：明确升级路径

See Extended Examples for full escalation levels and contact directory template.

Define when and how to escalate incidents.

When to escalate immediately:

Customer-facing outage > 15 minutes
SLO error budget > 10% depleted
Data loss/corruption or security breach suspected
Unable to identify root cause within 20 minutes
Mitigation attempts fail or worsen situation

Five escalation levels:

Primary On-Call (5 min response): Deploy fixes, rollback, scale (up to 30 min solo)
Secondary On-Call (auto after 15 min): Additional investigation support
Team Lead (architectural decisions): Database changes, vendor escalation, incidents > 1 hour
Incident Commander (cross-team coord): Multiple teams, customer comms, incidents > 2 hours
Executive (C-level): Major impact (>50% users), SLA breach, media/PR, outages > 4 hours

Escalation process:

Notify target with: current status, impact, actions taken, help needed, dashboard link
Handoff if needed: share timeline, actions, access, remain available
Don't go silent: update every 15 min, ask questions, provide feedback

Contact directory: Maintain table with role, Slack, phone, PagerDuty for:

Platform/Database/Security/Network teams
Incident Commander
External vendors (AWS, database vendor, CDN provider)

Expected: Clear criteria for escalation, contact information readily accessible, escalation paths aligned with organizational structure.

On failure:

Validate contact information is current (test quarterly)
Add decision tree for when to escalate
Include examples of escalation messages
Document response time expectations for each level

查看扩展示例获取完整的升级等级和联系人目录模板。

定义事件升级的触发条件和操作流程。

需要立即升级的场景:

用户侧故障持续超过15分钟
SLO错误预算消耗超过10%
疑似出现数据丢失/损坏或安全漏洞
20分钟内无法定位根因
修复尝试失败或导致情况恶化

五个升级等级:

一级On-call联系人（5分钟响应）：可独立执行修复、回滚、扩容操作（最多独立处理30分钟）
二级On-call联系人（15分钟未响应自动升级）：提供额外排查支持
团队负责人（架构决策）：数据库变更、厂商升级、故障持续超过1小时的场景
事件指挥官（跨团队协调）：涉及多团队、需要对外用户沟通、故障持续超过2小时的场景
高管层（C级）：大范围影响（超过50%用户）、SLA违约、需要媒体/PR介入、故障持续超过4小时的场景

升级流程:

通知目标联系人时需附带：当前状态、影响范围、已执行动作、需要的支持、看板链接
如需交接：同步时间线、已执行操作、权限信息、保持通讯畅通
不要失联：每15分钟同步一次进展、主动提问、及时反馈

联系人目录：维护包含角色、Slack账号、电话、PagerDuty账号的表格，覆盖以下角色:

平台/数据库/安全/网络团队
事件指挥官
外部厂商（AWS、数据库服务商、CDN服务商）

预期结果：升级触发条件清晰，联系信息容易获取，升级路径匹配公司组织架构。

异常处理:

每季度验证联系信息有效性
新增升级决策树明确升级时机
附带升级通知的话术示例
明确每个等级的预期响应时间

Step 5: Create Communication Templates

步骤5：创建沟通模板

See Extended Examples for all internal and external templates with full formatting.

Provide pre-written messages for incident updates.

Internal templates (Slack #incident-response):

Initial Declaration:

🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium]
Impact: [users/services] | Owner: @username | Dashboard: [link]
Quick Summary: [1-2 sentences] | Next update: 15 min

Progress Update (every 15-30 min):

📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring]
Actions: [what we tried and outcomes]
Theory: [what we think is happening]
Next: [planned actions]

Mitigation Complete:

✅ MITIGATION | Metrics: Error [before→after], Latency [before→after]
Root Cause: [brief or "investigating"] | Monitoring 30min before resolved

Resolution:

🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions

False Alarm: No impact, no follow-up needed

External templates (status page):

Initial: Investigating, started time, next update in 15 min
Progress: Identified cause (customer-friendly), implementing fix, estimated resolution
Resolution: Resolved time, root cause (simple), duration, prevention measures

Customer email template: Timeline, impact description, resolution, prevention, compensation (if applicable)

Expected: Templates save time during incidents, ensure consistent communication, reduce cognitive load on responders.

On failure:

Customize templates to match company communication style
Pre-fill templates with common incident types
Create Slack workflow/bot to populate templates automatically
Review templates during incident retrospectives

查看扩展示例获取完整的内外部沟通模板和格式规范。

提供预编写的事件更新话术模板。

内部沟通模板（Slack #incident-response 频道）:

事件首次宣告:

🚨 事件告警: [标题] | 严重等级: [Critical/High/Medium]
影响范围: [用户/服务] | 负责人: @username | 监控看板: [链接]
简要说明: [1-2句话概述] | 下次更新时间: 15分钟后

进度更新（每15-30分钟同步一次）:

📊 第N次更新 | 当前状态: [排查中/止损中/监控中]
已执行动作: [尝试过的操作和结果]
故障假设: [当前对根因的判断]
下一步计划: [待执行的动作]

止损完成:

✅ 已止损 | 指标变化: 错误率[之前→之后], 延迟[之前→之后]
根因: [简要说明或「排查中」] | 监控30分钟后确认完全解决

故障解决:

🎉 已解决 | 持续时长: [时间] | 根因+影响范围+后续跟进动作

误告警：无实际影响，无需后续跟进

外部沟通模板（状态页）:

首次更新：正在排查、故障开始时间、15分钟后同步最新进展
进度更新：已定位根因（用户友好表述）、正在执行修复、预估修复时间
解决通知：恢复时间、根因（简化表述）、故障时长、预防措施

客户邮件模板：时间线、影响说明、解决措施、预防方案、补偿方案（如有）

预期结果：模板可以在故障时节省沟通时间、保证信息同步的一致性、降低响应人员的认知负担。

异常处理:

调整模板匹配公司的沟通风格
针对常见事件类型预填模板内容
创建Slack工作流/机器人自动填充模板
事件复盘时同步评审优化模板

Step 6: Link Runbook to Monitoring

步骤6：将Runbook关联到监控体系

See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.

Integrate runbook with alerts and dashboards.

Add runbook links to Prometheus alerts:

yaml

- alert: HighErrorRate
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.example.com/d/service-overview"
    incident_channel: "#incident-platform"

Embed quick diagnostic links in runbook:

Service Overview Dashboard
Error Rate Last 1h (Prometheus direct link)
Recent Error Logs (Loki/Grafana Explore)
Recent Deployments (GitHub/CI)
PagerDuty Incidents

Create Grafana dashboard panel with runbook links (markdown panel listing all incident runbooks with on-call and escalation info)

Expected: Responders can access runbooks directly from alerts or dashboards, diagnostic queries pre-filled, one-click access to relevant tools.

On failure:

Verify runbook URLs are accessible without VPN/login
Use URL shorteners for complex Grafana/Prometheus links
Test links quarterly to ensure they don't break
Create browser bookmarks for frequently used runbooks

查看扩展示例获取完整的Prometheus告警配置和Grafana看板JSON。

将Runbook和告警、监控看板打通。

在Prometheus告警中添加Runbook链接:

yaml

- alert: HighErrorRate
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.example.com/d/service-overview"
    incident_channel: "#incident-platform"

在Runbook中嵌入快速诊断链接:

服务概览看板
过去1小时错误率（Prometheus直接链接）
近期错误日志（Loki/Grafana Explore链接）
近期部署记录（GitHub/CI链接）
PagerDuty事件列表

创建Grafana看板面板展示Runbook链接（markdown面板列出所有事件Runbook，附带on-call和升级信息）

预期结果：响应人员可以从告警或看板直接访问Runbook，诊断查询已预填，可一键跳转相关工具。

异常处理:

确认Runbook链接无需VPN/登录即可访问
复杂的Grafana/Prometheus链接使用短链接简化
每季度测试链接有效性避免失效
高频使用的Runbook可以添加浏览器书签

Validation

验证清单

Common Pitfalls

常见误区

Too generic: Runbooks with vague steps like "check the logs" without specific queries are not actionable. Be specific.
Outdated information: Runbooks referencing old systems or commands become useless. Review quarterly.
No verification steps: Resolution without verification leads to false positives. Always include "how to confirm it's fixed."
Missing rollback procedures: Every action should have a rollback plan. Don't trap responders in worse state.
Assuming knowledge: Runbooks for experts only exclude junior engineers. Write for the least experienced person on rotation.
No ownership: Runbooks without owners become stale. Assign team/person responsible for updates.
Hidden behind auth: Runbooks inaccessible during VPN/SSO issues are useless during crisis. Cache copies or use public wiki.

过于泛化：Runbook只写了「查看日志」这类模糊步骤，没有具体查询语句，无法落地。要尽量具体。
信息过时：Runbook引用了旧系统或旧命令，失去参考价值。每季度要评审更新。
缺少验证步骤：修复后没有验证环节，容易出现误判已修复的情况。必须包含「如何确认故障已解决」的内容。
缺少回滚流程：每个操作都要有回滚方案，避免让响应人员陷入更糟糕的处境。
假设受众有足够经验：只面向专家编写的Runbook会把初级工程师排除在外。要以轮值团队中经验最少的成员为受众标准编写。
没有负责人：没有归属人的Runbook很快就会过时。要指定团队或个人负责更新维护。
需要认证才能访问：VPN/SSO故障时无法访问的Runbook在危机场景下毫无用处。要缓存副本或使用公开wiki存储。

write-incident-runbook

Original

Translation

Write Incident Runbook

编写事件响应Runbook

When to Use

适用场景

Inputs

输入要求

Procedure

操作流程

Step 1: Choose Runbook Template Structure

步骤1：选择Runbook模板结构

[Alert/Incident Name] Runbook

[告警/事件名称] Runbook

Overview | Severity | Symptoms

概述 | 严重等级 | 现象说明

Diagnostic Steps | Resolution Steps

诊断步骤 | 解决步骤

Escalation | Communication | Prevention | Related

升级路径 | 沟通方案 | 预防措施 | 相关内容

[Service Name] - [Incident Type] Runbook

[服务名称] - [事件类型] Runbook

Metadata

元数据

Diagnostic Phase

诊断阶段

Quick Health Check (< 5 min): Dashboard, error rate, deployments

快速健康检查(<5分钟)：看板、错误率、部署记录

Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns

详细排查(5-20分钟)：指标、日志、链路追踪、故障模式

... (see EXAMPLES.md for complete template)

... 完整模板见EXAMPLES.md

Step 2: Document Diagnostic Procedures

步骤2：编写诊断流程

Step 3: Define Resolution Procedures

步骤3：定义解决流程

Step 4: Establish Escalation Paths

步骤4：明确升级路径

Step 5: Create Communication Templates

步骤5：创建沟通模板

Step 6: Link Runbook to Monitoring

步骤6：将Runbook关联到监控体系

Validation

验证清单

Common Pitfalls

常见误区

Related Skills

相关技能