write-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Write Incident Runbook

编写事件响应Runbook

Create actionable runbooks that guide responders through incident diagnosis and resolution.
创建可落地执行的Runbook,指导响应人员完成事件诊断与修复全流程。

When to Use

适用场景

  • Documenting response procedures for recurring alerts or incidents
  • Standardizing incident response across on-call rotation members
  • Reducing mean time to resolution (MTTR) with clear diagnostic steps
  • Creating training materials for new team members on incident handling
  • Establishing escalation paths and communication protocols
  • Migrating tribal knowledge to written documentation
  • Linking alerts to resolution procedures (alert annotations)
  • 为重复出现的告警或事件编写响应流程
  • 统一所有on-call轮值成员的事件响应标准
  • 通过清晰的诊断步骤降低平均恢复时间(MTTR)
  • 为新团队成员制作事件处理相关的培训材料
  • 明确升级路径和沟通协议
  • 将团队口口相传的经验转化为书面文档
  • 为告警关联对应的解决流程(告警注释)

Inputs

输入要求

  • Required: Incident or alert name/description
  • Required: Historical incident data and resolution patterns
  • Optional: Diagnostic queries (Prometheus, logs, traces)
  • Optional: Escalation contacts and communication channels
  • Optional: Previous incident post-mortems
  • 必填:事件/告警名称与描述
  • 必填:历史事件数据和解决规律
  • 可选:诊断查询语句(Prometheus、日志、链路追踪)
  • 可选:升级联系人列表和沟通渠道
  • 可选:往期事件复盘报告

Procedure

操作流程

Step 1: Choose Runbook Template Structure

步骤1:选择Runbook模板结构

See Extended Examples for complete template files.
Select an appropriate template based on incident type and complexity.
Basic runbook template structure:
markdown
undefined
查看扩展示例获取完整的模板文件。
根据事件类型和复杂度选择合适的模板。
基础Runbook模板结构:
markdown
undefined

[Alert/Incident Name] Runbook

[告警/事件名称] Runbook

Overview | Severity | Symptoms

概述 | 严重等级 | 现象说明

Diagnostic Steps | Resolution Steps

诊断步骤 | 解决步骤

Escalation | Communication | Prevention | Related

升级路径 | 沟通方案 | 预防措施 | 相关内容


**Advanced SRE runbook template** (excerpt):
```markdown

**高级SRE Runbook模板**(节选):
```markdown

[Service Name] - [Incident Type] Runbook

[服务名称] - [事件类型] Runbook

Metadata

元数据

  • Service, Owner, Severity, On-Call, Last Updated
  • 服务、负责人、严重等级、On-call联系人、最后更新时间

Diagnostic Phase

诊断阶段

Quick Health Check (< 5 min): Dashboard, error rate, deployments

快速健康检查(<5分钟):看板、错误率、部署记录

Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns

详细排查(5-20分钟):指标、日志、链路追踪、故障模式

... (see EXAMPLES.md for complete template)

... 完整模板见EXAMPLES.md


Key template components:
- **Metadata**: Service ownership, severity, on-call rotation
- **Diagnostic Phase**: Quick checks → detailed investigation → failure patterns
- **Resolution Phase**: Immediate mitigation → root cause fix → verification
- **Escalation**: Criteria and contact paths
- **Communication**: Internal/external templates
- **Prevention**: Short/long-term actions

**Expected:** Template selected matches incident complexity, sections appropriate for service type.

**On failure:**
- Start with basic template, iterate based on incident patterns
- Review industry examples (Google SRE books, vendor runbooks)
- Adapt template based on team feedback after first use

核心模板组件:
- **元数据**:服务归属、严重等级、on-call轮值信息
- **诊断阶段**:快速检查 → 详细排查 → 故障模式匹配
- **解决阶段**:即时止损 → 根因修复 → 效果验证
- **升级路径**:升级触发条件和联系人路径
- **沟通方案**:内部/外部沟通模板
- **预防措施**:短期/长期优化动作

**预期结果**:选择的模板匹配事件复杂度,板块设置适配对应服务类型。

**异常处理:**
- 先使用基础模板,再根据事件规律迭代优化
- 参考行业公开示例(Google SRE书籍、厂商官方Runbook)
- 首次使用后根据团队反馈调整模板

Step 2: Document Diagnostic Procedures

步骤2:编写诊断流程

See Extended Examples for complete diagnostic queries and decision trees.
Create step-by-step investigation procedures with specific queries.
Six-step diagnostic checklist:
  1. Verify Service Health: Health endpoint checks and uptime metrics
    bash
    curl -I https://api.example.com/health  # Expected: HTTP 200 OK
    promql
    up{job="api-service"}  # Expected: 1 for all instances
  2. Check Error Rate: Current error percentage and breakdown by endpoint
    promql
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) * 100  # Expected: < 1%
  3. Analyze Logs: Recent errors and top error messages from Loki
    logql
    {job="api-service"} |= "error" | json | level="error"
  4. Check Resource Utilization: CPU, memory, and connection pool status
    promql
    avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
    # Expected: < 70%
  5. Review Recent Changes: Deployments, git commits, infrastructure changes
  6. Examine Dependencies: Downstream service health, database/API latency
Failure pattern decision tree (excerpt):
  • Service down? → Check all pods/instances
  • Error rate elevated? → Check specific error types (5xx, gateway, database, timeouts)
  • When did it start? → After deployment (rollback), gradual (resource leak), sudden (traffic/dependency)
Expected: Diagnostic procedures are specific, include expected vs actual values, guide responder through investigation.
On failure:
  • Test queries in actual monitoring system before documenting
  • Include screenshots of dashboards for visual reference
  • Add "Common mistakes" section for frequently missed steps
  • Iterate based on feedback from incident responders
查看扩展示例获取完整的诊断查询语句和决策树。
编写带具体查询语句的分步排查流程。
六步诊断检查清单:
  1. 验证服务健康状态:健康接口检查和可用率指标
    bash
    curl -I https://api.example.com/health  # 预期返回: HTTP 200 OK
    promql
    up{job="api-service"}  # 预期结果: 所有实例返回1
  2. 检查错误率:当前错误率和按接口拆分的错误分布
    promql
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) * 100  # 预期结果: < 1%
  3. 分析日志:Loki中采集的近期错误和高频错误信息
    logql
    {job="api-service"} |= "error" | json | level="error"
  4. 检查资源使用率:CPU、内存和连接池状态
    promql
    avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
    # 预期结果: < 70%
  5. 回顾近期变更:部署记录、git提交、基础设施变更
  6. 检查依赖状态:下游服务健康度、数据库/API延迟
故障模式决策树(节选):
  • 服务完全不可用?→ 检查所有Pod/实例状态
  • 错误率升高?→ 检查具体错误类型(5xx、网关错误、数据库错误、超时)
  • 故障开始时间?→ 部署后出现(回滚)、逐步恶化(资源泄漏)、突然出现(流量/依赖故障)
预期结果:诊断流程具体明确,包含预期值和实际值对比,可引导响应人员完成全流程排查。
异常处理:
  • 编写前先在实际监控系统中测试查询语句有效性
  • 加入监控看板截图作为可视化参考
  • 增加「常见误区」板块标注容易遗漏的排查步骤
  • 根据事件响应人员的反馈迭代优化内容

Step 3: Define Resolution Procedures

步骤3:定义解决流程

See Extended Examples for all 5 resolution options with full commands and rollback procedures.
Document step-by-step remediation with rollback options.
Five resolution options (brief summary):
  1. Rollback Deployment (fastest): For post-deployment errors
    bash
    kubectl rollout undo deployment/api-service
    Verify → Monitor → Confirm resolution (error rate < 1%, latency normal, no alerts)
  2. Scale Up Resources: For high CPU/memory, connection pool exhaustion
    bash
    kubectl scale deployment/api-service --replicas=$((current * 3/2))
  3. Restart Service: For memory leaks, stuck connections, cache corruption
    bash
    kubectl rollout restart deployment/api-service
  4. Feature Flag / Circuit Breaker: For specific feature errors or external dependency failures
    bash
    kubectl set env deployment/api-service FEATURE_NAME=false
  5. Database Remediation: For database connections, slow queries, pool exhaustion
    sql
    -- Kill long-running queries, restart connection pool, increase pool size
Universal verification checklist:
  • Error rate < 1%
  • Latency P99 < threshold
  • Throughput at baseline
  • Resource usage healthy (CPU < 70%, Memory < 80%)
  • Dependencies healthy
  • User-facing tests pass
  • No active alerts
Rollback procedure: If resolution worsens situation → pause/cancel → revert → reassess
Expected: Resolution steps are clear, include verification checks, provide rollback options for each action.
On failure:
  • Add more granular steps for complex procedures
  • Include screenshots or diagrams for multi-step processes
  • Document command outputs (expected vs actual)
  • Create separate runbook for complex resolution procedures
查看扩展示例获取全部5种解决方案的完整命令和回滚流程。
编写带回滚方案的分步修复流程。
五种解决方案(简要说明):
  1. 回滚部署(最快方案):适用于部署后出现的错误
    bash
    kubectl rollout undo deployment/api-service
    验证 → 监控 → 确认修复(错误率<1%、延迟正常、无告警触发)
  2. 扩容资源:适用于CPU/内存使用率过高、连接池耗尽场景
    bash
    kubectl scale deployment/api-service --replicas=$((current * 3/2))
  3. 重启服务:适用于内存泄漏、连接卡住、缓存损坏场景
    bash
    kubectl rollout restart deployment/api-service
  4. 关闭功能开关/熔断:适用于特定功能错误或外部依赖故障场景
    bash
    kubectl set env deployment/api-service FEATURE_NAME=false
  5. 数据库修复:适用于数据库连接异常、慢查询、连接池耗尽场景
    sql
    -- 终止长运行查询、重启连接池、扩容连接池大小
通用验证检查清单:
  • 错误率 < 1%
  • P99延迟低于阈值
  • 吞吐量恢复到基线水平
  • 资源使用健康(CPU < 70%、内存 < 80%)
  • 所有依赖服务健康
  • 用户侧测试全部通过
  • 无活跃告警触发
回滚流程:如果修复动作导致情况恶化 → 暂停/取消操作 → 回滚变更 → 重新评估方案
预期结果:修复步骤清晰明确,包含验证检查项,每个操作都提供对应的回滚方案。
异常处理:
  • 复杂流程拆分为更细的分步操作
  • 多步骤流程可以加入截图或流程图辅助说明
  • 标注命令的预期输出和实际输出对比
  • 复杂修复流程单独编写专属Runbook

Step 4: Establish Escalation Paths

步骤4:明确升级路径

See Extended Examples for full escalation levels and contact directory template.
Define when and how to escalate incidents.
When to escalate immediately:
  • Customer-facing outage > 15 minutes
  • SLO error budget > 10% depleted
  • Data loss/corruption or security breach suspected
  • Unable to identify root cause within 20 minutes
  • Mitigation attempts fail or worsen situation
Five escalation levels:
  1. Primary On-Call (5 min response): Deploy fixes, rollback, scale (up to 30 min solo)
  2. Secondary On-Call (auto after 15 min): Additional investigation support
  3. Team Lead (architectural decisions): Database changes, vendor escalation, incidents > 1 hour
  4. Incident Commander (cross-team coord): Multiple teams, customer comms, incidents > 2 hours
  5. Executive (C-level): Major impact (>50% users), SLA breach, media/PR, outages > 4 hours
Escalation process:
  1. Notify target with: current status, impact, actions taken, help needed, dashboard link
  2. Handoff if needed: share timeline, actions, access, remain available
  3. Don't go silent: update every 15 min, ask questions, provide feedback
Contact directory: Maintain table with role, Slack, phone, PagerDuty for:
  • Platform/Database/Security/Network teams
  • Incident Commander
  • External vendors (AWS, database vendor, CDN provider)
Expected: Clear criteria for escalation, contact information readily accessible, escalation paths aligned with organizational structure.
On failure:
  • Validate contact information is current (test quarterly)
  • Add decision tree for when to escalate
  • Include examples of escalation messages
  • Document response time expectations for each level
查看扩展示例获取完整的升级等级和联系人目录模板。
定义事件升级的触发条件和操作流程。
需要立即升级的场景:
  • 用户侧故障持续超过15分钟
  • SLO错误预算消耗超过10%
  • 疑似出现数据丢失/损坏或安全漏洞
  • 20分钟内无法定位根因
  • 修复尝试失败或导致情况恶化
五个升级等级:
  1. 一级On-call联系人(5分钟响应):可独立执行修复、回滚、扩容操作(最多独立处理30分钟)
  2. 二级On-call联系人(15分钟未响应自动升级):提供额外排查支持
  3. 团队负责人(架构决策):数据库变更、厂商升级、故障持续超过1小时的场景
  4. 事件指挥官(跨团队协调):涉及多团队、需要对外用户沟通、故障持续超过2小时的场景
  5. 高管层(C级):大范围影响(超过50%用户)、SLA违约、需要媒体/PR介入、故障持续超过4小时的场景
升级流程:
  1. 通知目标联系人时需附带:当前状态、影响范围、已执行动作、需要的支持、看板链接
  2. 如需交接:同步时间线、已执行操作、权限信息、保持通讯畅通
  3. 不要失联:每15分钟同步一次进展、主动提问、及时反馈
联系人目录:维护包含角色、Slack账号、电话、PagerDuty账号的表格,覆盖以下角色:
  • 平台/数据库/安全/网络团队
  • 事件指挥官
  • 外部厂商(AWS、数据库服务商、CDN服务商)
预期结果:升级触发条件清晰,联系信息容易获取,升级路径匹配公司组织架构。
异常处理:
  • 每季度验证联系信息有效性
  • 新增升级决策树明确升级时机
  • 附带升级通知的话术示例
  • 明确每个等级的预期响应时间

Step 5: Create Communication Templates

步骤5:创建沟通模板

See Extended Examples for all internal and external templates with full formatting.
Provide pre-written messages for incident updates.
Internal templates (Slack #incident-response):
  1. Initial Declaration:
    🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium]
    Impact: [users/services] | Owner: @username | Dashboard: [link]
    Quick Summary: [1-2 sentences] | Next update: 15 min
  2. Progress Update (every 15-30 min):
    📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring]
    Actions: [what we tried and outcomes]
    Theory: [what we think is happening]
    Next: [planned actions]
  3. Mitigation Complete:
    ✅ MITIGATION | Metrics: Error [before→after], Latency [before→after]
    Root Cause: [brief or "investigating"] | Monitoring 30min before resolved
  4. Resolution:
    🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions
  5. False Alarm: No impact, no follow-up needed
External templates (status page):
  • Initial: Investigating, started time, next update in 15 min
  • Progress: Identified cause (customer-friendly), implementing fix, estimated resolution
  • Resolution: Resolved time, root cause (simple), duration, prevention measures
Customer email template: Timeline, impact description, resolution, prevention, compensation (if applicable)
Expected: Templates save time during incidents, ensure consistent communication, reduce cognitive load on responders.
On failure:
  • Customize templates to match company communication style
  • Pre-fill templates with common incident types
  • Create Slack workflow/bot to populate templates automatically
  • Review templates during incident retrospectives
查看扩展示例获取完整的内外部沟通模板和格式规范。
提供预编写的事件更新话术模板。
内部沟通模板(Slack #incident-response 频道):
  1. 事件首次宣告:
    🚨 事件告警: [标题] | 严重等级: [Critical/High/Medium]
    影响范围: [用户/服务] | 负责人: @username | 监控看板: [链接]
    简要说明: [1-2句话概述] | 下次更新时间: 15分钟后
  2. 进度更新(每15-30分钟同步一次):
    📊 第N次更新 | 当前状态: [排查中/止损中/监控中]
    已执行动作: [尝试过的操作和结果]
    故障假设: [当前对根因的判断]
    下一步计划: [待执行的动作]
  3. 止损完成:
    ✅ 已止损 | 指标变化: 错误率[之前→之后], 延迟[之前→之后]
    根因: [简要说明或「排查中」] | 监控30分钟后确认完全解决
  4. 故障解决:
    🎉 已解决 | 持续时长: [时间] | 根因+影响范围+后续跟进动作
  5. 误告警:无实际影响,无需后续跟进
外部沟通模板(状态页):
  • 首次更新:正在排查、故障开始时间、15分钟后同步最新进展
  • 进度更新:已定位根因(用户友好表述)、正在执行修复、预估修复时间
  • 解决通知:恢复时间、根因(简化表述)、故障时长、预防措施
客户邮件模板:时间线、影响说明、解决措施、预防方案、补偿方案(如有)
预期结果:模板可以在故障时节省沟通时间、保证信息同步的一致性、降低响应人员的认知负担。
异常处理:
  • 调整模板匹配公司的沟通风格
  • 针对常见事件类型预填模板内容
  • 创建Slack工作流/机器人自动填充模板
  • 事件复盘时同步评审优化模板

Step 6: Link Runbook to Monitoring

步骤6:将Runbook关联到监控体系

See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.
Integrate runbook with alerts and dashboards.
Add runbook links to Prometheus alerts:
yaml
- alert: HighErrorRate
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.example.com/d/service-overview"
    incident_channel: "#incident-platform"
Embed quick diagnostic links in runbook:
  • Service Overview Dashboard
  • Error Rate Last 1h (Prometheus direct link)
  • Recent Error Logs (Loki/Grafana Explore)
  • Recent Deployments (GitHub/CI)
  • PagerDuty Incidents
Create Grafana dashboard panel with runbook links (markdown panel listing all incident runbooks with on-call and escalation info)
Expected: Responders can access runbooks directly from alerts or dashboards, diagnostic queries pre-filled, one-click access to relevant tools.
On failure:
  • Verify runbook URLs are accessible without VPN/login
  • Use URL shorteners for complex Grafana/Prometheus links
  • Test links quarterly to ensure they don't break
  • Create browser bookmarks for frequently used runbooks
查看扩展示例获取完整的Prometheus告警配置和Grafana看板JSON。
将Runbook和告警、监控看板打通。
在Prometheus告警中添加Runbook链接:
yaml
- alert: HighErrorRate
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.example.com/d/service-overview"
    incident_channel: "#incident-platform"
在Runbook中嵌入快速诊断链接:
  • 服务概览看板
  • 过去1小时错误率(Prometheus直接链接)
  • 近期错误日志(Loki/Grafana Explore链接)
  • 近期部署记录(GitHub/CI链接)
  • PagerDuty事件列表
创建Grafana看板面板展示Runbook链接(markdown面板列出所有事件Runbook,附带on-call和升级信息)
预期结果:响应人员可以从告警或看板直接访问Runbook,诊断查询已预填,可一键跳转相关工具。
异常处理:
  • 确认Runbook链接无需VPN/登录即可访问
  • 复杂的Grafana/Prometheus链接使用短链接简化
  • 每季度测试链接有效性避免失效
  • 高频使用的Runbook可以添加浏览器书签

Validation

验证清单

  • Runbook follows consistent template structure
  • Diagnostic procedures include specific queries and expected values
  • Resolution steps are actionable with clear commands
  • Escalation criteria and contacts are current
  • Communication templates provided for internal and external audiences
  • Runbook linked from monitoring alerts and dashboards
  • Runbook tested during incident simulation or actual incident
  • Feedback from responders incorporated into runbook
  • Revision history tracked with dates and authors
  • Runbook accessible without authentication (or cached offline)
  • Runbook遵循统一的模板结构
  • 诊断流程包含具体查询语句和预期值
  • 解决步骤可落地,附带明确命令
  • 升级触发条件和联系人信息为最新
  • 提供内外部受众的沟通模板
  • Runbook已关联到监控告警和看板
  • Runbook已在故障演练或真实事件中测试过
  • 已合并响应人员的反馈优化内容
  • 记录了版本历史,包含更新日期和作者
  • Runbook无需认证即可访问(或有离线缓存)

Common Pitfalls

常见误区

  • Too generic: Runbooks with vague steps like "check the logs" without specific queries are not actionable. Be specific.
  • Outdated information: Runbooks referencing old systems or commands become useless. Review quarterly.
  • No verification steps: Resolution without verification leads to false positives. Always include "how to confirm it's fixed."
  • Missing rollback procedures: Every action should have a rollback plan. Don't trap responders in worse state.
  • Assuming knowledge: Runbooks for experts only exclude junior engineers. Write for the least experienced person on rotation.
  • No ownership: Runbooks without owners become stale. Assign team/person responsible for updates.
  • Hidden behind auth: Runbooks inaccessible during VPN/SSO issues are useless during crisis. Cache copies or use public wiki.
  • 过于泛化:Runbook只写了「查看日志」这类模糊步骤,没有具体查询语句,无法落地。要尽量具体。
  • 信息过时:Runbook引用了旧系统或旧命令,失去参考价值。每季度要评审更新。
  • 缺少验证步骤:修复后没有验证环节,容易出现误判已修复的情况。必须包含「如何确认故障已解决」的内容。
  • 缺少回滚流程:每个操作都要有回滚方案,避免让响应人员陷入更糟糕的处境。
  • 假设受众有足够经验:只面向专家编写的Runbook会把初级工程师排除在外。要以轮值团队中经验最少的成员为受众标准编写。
  • 没有负责人:没有归属人的Runbook很快就会过时。要指定团队或个人负责更新维护。
  • 需要认证才能访问:VPN/SSO故障时无法访问的Runbook在危机场景下毫无用处。要缓存副本或使用公开wiki存储。

Related Skills

相关技能

  • configure-alerting-rules
    - Link runbooks to alert annotations for immediate access during incidents
  • build-grafana-dashboards
    - Embed runbook links in dashboards and diagnostic panels
  • setup-prometheus-monitoring
    - Include diagnostic queries from Prometheus in runbook procedures
  • define-slo-sli-sla
    - Reference SLO impact in incident severity classification
  • configure-alerting-rules
    - 将Runbook关联到告警注释,故障时可以立即访问
  • build-grafana-dashboards
    - 在看板和诊断面板中嵌入Runbook链接
  • setup-prometheus-monitoring
    - 在Runbook流程中加入Prometheus诊断查询语句
  • define-slo-sli-sla
    - 在事件严重等级分类中参考SLO影响