write-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWrite Incident Runbook
编写事件响应Runbook
Create actionable runbooks that guide responders through incident diagnosis and resolution.
创建可落地执行的Runbook,指导响应人员完成事件诊断与修复全流程。
When to Use
适用场景
- Documenting response procedures for recurring alerts or incidents
- Standardizing incident response across on-call rotation members
- Reducing mean time to resolution (MTTR) with clear diagnostic steps
- Creating training materials for new team members on incident handling
- Establishing escalation paths and communication protocols
- Migrating tribal knowledge to written documentation
- Linking alerts to resolution procedures (alert annotations)
- 为重复出现的告警或事件编写响应流程
- 统一所有on-call轮值成员的事件响应标准
- 通过清晰的诊断步骤降低平均恢复时间(MTTR)
- 为新团队成员制作事件处理相关的培训材料
- 明确升级路径和沟通协议
- 将团队口口相传的经验转化为书面文档
- 为告警关联对应的解决流程(告警注释)
Inputs
输入要求
- Required: Incident or alert name/description
- Required: Historical incident data and resolution patterns
- Optional: Diagnostic queries (Prometheus, logs, traces)
- Optional: Escalation contacts and communication channels
- Optional: Previous incident post-mortems
- 必填:事件/告警名称与描述
- 必填:历史事件数据和解决规律
- 可选:诊断查询语句(Prometheus、日志、链路追踪)
- 可选:升级联系人列表和沟通渠道
- 可选:往期事件复盘报告
Procedure
操作流程
Step 1: Choose Runbook Template Structure
步骤1:选择Runbook模板结构
See Extended Examples for complete template files.
Select an appropriate template based on incident type and complexity.
Basic runbook template structure:
markdown
undefined查看扩展示例获取完整的模板文件。
根据事件类型和复杂度选择合适的模板。
基础Runbook模板结构:
markdown
undefined[Alert/Incident Name] Runbook
[告警/事件名称] Runbook
Overview | Severity | Symptoms
概述 | 严重等级 | 现象说明
Diagnostic Steps | Resolution Steps
诊断步骤 | 解决步骤
Escalation | Communication | Prevention | Related
升级路径 | 沟通方案 | 预防措施 | 相关内容
**Advanced SRE runbook template** (excerpt):
```markdown
**高级SRE Runbook模板**(节选):
```markdown[Service Name] - [Incident Type] Runbook
[服务名称] - [事件类型] Runbook
Metadata
元数据
- Service, Owner, Severity, On-Call, Last Updated
- 服务、负责人、严重等级、On-call联系人、最后更新时间
Diagnostic Phase
诊断阶段
Quick Health Check (< 5 min): Dashboard, error rate, deployments
快速健康检查(<5分钟):看板、错误率、部署记录
Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns
详细排查(5-20分钟):指标、日志、链路追踪、故障模式
... (see EXAMPLES.md for complete template)
... 完整模板见EXAMPLES.md
Key template components:
- **Metadata**: Service ownership, severity, on-call rotation
- **Diagnostic Phase**: Quick checks → detailed investigation → failure patterns
- **Resolution Phase**: Immediate mitigation → root cause fix → verification
- **Escalation**: Criteria and contact paths
- **Communication**: Internal/external templates
- **Prevention**: Short/long-term actions
**Expected:** Template selected matches incident complexity, sections appropriate for service type.
**On failure:**
- Start with basic template, iterate based on incident patterns
- Review industry examples (Google SRE books, vendor runbooks)
- Adapt template based on team feedback after first use
核心模板组件:
- **元数据**:服务归属、严重等级、on-call轮值信息
- **诊断阶段**:快速检查 → 详细排查 → 故障模式匹配
- **解决阶段**:即时止损 → 根因修复 → 效果验证
- **升级路径**:升级触发条件和联系人路径
- **沟通方案**:内部/外部沟通模板
- **预防措施**:短期/长期优化动作
**预期结果**:选择的模板匹配事件复杂度,板块设置适配对应服务类型。
**异常处理:**
- 先使用基础模板,再根据事件规律迭代优化
- 参考行业公开示例(Google SRE书籍、厂商官方Runbook)
- 首次使用后根据团队反馈调整模板Step 2: Document Diagnostic Procedures
步骤2:编写诊断流程
See Extended Examples for complete diagnostic queries and decision trees.
Create step-by-step investigation procedures with specific queries.
Six-step diagnostic checklist:
-
Verify Service Health: Health endpoint checks and uptime metricsbash
curl -I https://api.example.com/health # Expected: HTTP 200 OKpromqlup{job="api-service"} # Expected: 1 for all instances -
Check Error Rate: Current error percentage and breakdown by endpointpromql
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Expected: < 1% -
Analyze Logs: Recent errors and top error messages from Lokilogql
{job="api-service"} |= "error" | json | level="error" -
Check Resource Utilization: CPU, memory, and connection pool statuspromql
avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100 # Expected: < 70% -
Review Recent Changes: Deployments, git commits, infrastructure changes
-
Examine Dependencies: Downstream service health, database/API latency
Failure pattern decision tree (excerpt):
- Service down? → Check all pods/instances
- Error rate elevated? → Check specific error types (5xx, gateway, database, timeouts)
- When did it start? → After deployment (rollback), gradual (resource leak), sudden (traffic/dependency)
Expected: Diagnostic procedures are specific, include expected vs actual values, guide responder through investigation.
On failure:
- Test queries in actual monitoring system before documenting
- Include screenshots of dashboards for visual reference
- Add "Common mistakes" section for frequently missed steps
- Iterate based on feedback from incident responders
查看扩展示例获取完整的诊断查询语句和决策树。
编写带具体查询语句的分步排查流程。
六步诊断检查清单:
-
验证服务健康状态:健康接口检查和可用率指标bash
curl -I https://api.example.com/health # 预期返回: HTTP 200 OKpromqlup{job="api-service"} # 预期结果: 所有实例返回1 -
检查错误率:当前错误率和按接口拆分的错误分布promql
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # 预期结果: < 1% -
分析日志:Loki中采集的近期错误和高频错误信息logql
{job="api-service"} |= "error" | json | level="error" -
检查资源使用率:CPU、内存和连接池状态promql
avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100 # 预期结果: < 70% -
回顾近期变更:部署记录、git提交、基础设施变更
-
检查依赖状态:下游服务健康度、数据库/API延迟
故障模式决策树(节选):
- 服务完全不可用?→ 检查所有Pod/实例状态
- 错误率升高?→ 检查具体错误类型(5xx、网关错误、数据库错误、超时)
- 故障开始时间?→ 部署后出现(回滚)、逐步恶化(资源泄漏)、突然出现(流量/依赖故障)
预期结果:诊断流程具体明确,包含预期值和实际值对比,可引导响应人员完成全流程排查。
异常处理:
- 编写前先在实际监控系统中测试查询语句有效性
- 加入监控看板截图作为可视化参考
- 增加「常见误区」板块标注容易遗漏的排查步骤
- 根据事件响应人员的反馈迭代优化内容
Step 3: Define Resolution Procedures
步骤3:定义解决流程
See Extended Examples for all 5 resolution options with full commands and rollback procedures.
Document step-by-step remediation with rollback options.
Five resolution options (brief summary):
-
Rollback Deployment (fastest): For post-deployment errorsbash
kubectl rollout undo deployment/api-serviceVerify → Monitor → Confirm resolution (error rate < 1%, latency normal, no alerts) -
Scale Up Resources: For high CPU/memory, connection pool exhaustionbash
kubectl scale deployment/api-service --replicas=$((current * 3/2)) -
Restart Service: For memory leaks, stuck connections, cache corruptionbash
kubectl rollout restart deployment/api-service -
Feature Flag / Circuit Breaker: For specific feature errors or external dependency failuresbash
kubectl set env deployment/api-service FEATURE_NAME=false -
Database Remediation: For database connections, slow queries, pool exhaustionsql
-- Kill long-running queries, restart connection pool, increase pool size
Universal verification checklist:
- Error rate < 1%
- Latency P99 < threshold
- Throughput at baseline
- Resource usage healthy (CPU < 70%, Memory < 80%)
- Dependencies healthy
- User-facing tests pass
- No active alerts
Rollback procedure: If resolution worsens situation → pause/cancel → revert → reassess
Expected: Resolution steps are clear, include verification checks, provide rollback options for each action.
On failure:
- Add more granular steps for complex procedures
- Include screenshots or diagrams for multi-step processes
- Document command outputs (expected vs actual)
- Create separate runbook for complex resolution procedures
查看扩展示例获取全部5种解决方案的完整命令和回滚流程。
编写带回滚方案的分步修复流程。
五种解决方案(简要说明):
-
回滚部署(最快方案):适用于部署后出现的错误bash
kubectl rollout undo deployment/api-service验证 → 监控 → 确认修复(错误率<1%、延迟正常、无告警触发) -
扩容资源:适用于CPU/内存使用率过高、连接池耗尽场景bash
kubectl scale deployment/api-service --replicas=$((current * 3/2)) -
重启服务:适用于内存泄漏、连接卡住、缓存损坏场景bash
kubectl rollout restart deployment/api-service -
关闭功能开关/熔断:适用于特定功能错误或外部依赖故障场景bash
kubectl set env deployment/api-service FEATURE_NAME=false -
数据库修复:适用于数据库连接异常、慢查询、连接池耗尽场景sql
-- 终止长运行查询、重启连接池、扩容连接池大小
通用验证检查清单:
- 错误率 < 1%
- P99延迟低于阈值
- 吞吐量恢复到基线水平
- 资源使用健康(CPU < 70%、内存 < 80%)
- 所有依赖服务健康
- 用户侧测试全部通过
- 无活跃告警触发
回滚流程:如果修复动作导致情况恶化 → 暂停/取消操作 → 回滚变更 → 重新评估方案
预期结果:修复步骤清晰明确,包含验证检查项,每个操作都提供对应的回滚方案。
异常处理:
- 复杂流程拆分为更细的分步操作
- 多步骤流程可以加入截图或流程图辅助说明
- 标注命令的预期输出和实际输出对比
- 复杂修复流程单独编写专属Runbook
Step 4: Establish Escalation Paths
步骤4:明确升级路径
See Extended Examples for full escalation levels and contact directory template.
Define when and how to escalate incidents.
When to escalate immediately:
- Customer-facing outage > 15 minutes
- SLO error budget > 10% depleted
- Data loss/corruption or security breach suspected
- Unable to identify root cause within 20 minutes
- Mitigation attempts fail or worsen situation
Five escalation levels:
- Primary On-Call (5 min response): Deploy fixes, rollback, scale (up to 30 min solo)
- Secondary On-Call (auto after 15 min): Additional investigation support
- Team Lead (architectural decisions): Database changes, vendor escalation, incidents > 1 hour
- Incident Commander (cross-team coord): Multiple teams, customer comms, incidents > 2 hours
- Executive (C-level): Major impact (>50% users), SLA breach, media/PR, outages > 4 hours
Escalation process:
- Notify target with: current status, impact, actions taken, help needed, dashboard link
- Handoff if needed: share timeline, actions, access, remain available
- Don't go silent: update every 15 min, ask questions, provide feedback
Contact directory: Maintain table with role, Slack, phone, PagerDuty for:
- Platform/Database/Security/Network teams
- Incident Commander
- External vendors (AWS, database vendor, CDN provider)
Expected: Clear criteria for escalation, contact information readily accessible, escalation paths aligned with organizational structure.
On failure:
- Validate contact information is current (test quarterly)
- Add decision tree for when to escalate
- Include examples of escalation messages
- Document response time expectations for each level
查看扩展示例获取完整的升级等级和联系人目录模板。
定义事件升级的触发条件和操作流程。
需要立即升级的场景:
- 用户侧故障持续超过15分钟
- SLO错误预算消耗超过10%
- 疑似出现数据丢失/损坏或安全漏洞
- 20分钟内无法定位根因
- 修复尝试失败或导致情况恶化
五个升级等级:
- 一级On-call联系人(5分钟响应):可独立执行修复、回滚、扩容操作(最多独立处理30分钟)
- 二级On-call联系人(15分钟未响应自动升级):提供额外排查支持
- 团队负责人(架构决策):数据库变更、厂商升级、故障持续超过1小时的场景
- 事件指挥官(跨团队协调):涉及多团队、需要对外用户沟通、故障持续超过2小时的场景
- 高管层(C级):大范围影响(超过50%用户)、SLA违约、需要媒体/PR介入、故障持续超过4小时的场景
升级流程:
- 通知目标联系人时需附带:当前状态、影响范围、已执行动作、需要的支持、看板链接
- 如需交接:同步时间线、已执行操作、权限信息、保持通讯畅通
- 不要失联:每15分钟同步一次进展、主动提问、及时反馈
联系人目录:维护包含角色、Slack账号、电话、PagerDuty账号的表格,覆盖以下角色:
- 平台/数据库/安全/网络团队
- 事件指挥官
- 外部厂商(AWS、数据库服务商、CDN服务商)
预期结果:升级触发条件清晰,联系信息容易获取,升级路径匹配公司组织架构。
异常处理:
- 每季度验证联系信息有效性
- 新增升级决策树明确升级时机
- 附带升级通知的话术示例
- 明确每个等级的预期响应时间
Step 5: Create Communication Templates
步骤5:创建沟通模板
See Extended Examples for all internal and external templates with full formatting.
Provide pre-written messages for incident updates.
Internal templates (Slack #incident-response):
-
Initial Declaration:
🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium] Impact: [users/services] | Owner: @username | Dashboard: [link] Quick Summary: [1-2 sentences] | Next update: 15 min -
Progress Update (every 15-30 min):
📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring] Actions: [what we tried and outcomes] Theory: [what we think is happening] Next: [planned actions] -
Mitigation Complete:
✅ MITIGATION | Metrics: Error [before→after], Latency [before→after] Root Cause: [brief or "investigating"] | Monitoring 30min before resolved -
Resolution:
🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions -
False Alarm: No impact, no follow-up needed
External templates (status page):
- Initial: Investigating, started time, next update in 15 min
- Progress: Identified cause (customer-friendly), implementing fix, estimated resolution
- Resolution: Resolved time, root cause (simple), duration, prevention measures
Customer email template: Timeline, impact description, resolution, prevention, compensation (if applicable)
Expected: Templates save time during incidents, ensure consistent communication, reduce cognitive load on responders.
On failure:
- Customize templates to match company communication style
- Pre-fill templates with common incident types
- Create Slack workflow/bot to populate templates automatically
- Review templates during incident retrospectives
查看扩展示例获取完整的内外部沟通模板和格式规范。
提供预编写的事件更新话术模板。
内部沟通模板(Slack #incident-response 频道):
-
事件首次宣告:
🚨 事件告警: [标题] | 严重等级: [Critical/High/Medium] 影响范围: [用户/服务] | 负责人: @username | 监控看板: [链接] 简要说明: [1-2句话概述] | 下次更新时间: 15分钟后 -
进度更新(每15-30分钟同步一次):
📊 第N次更新 | 当前状态: [排查中/止损中/监控中] 已执行动作: [尝试过的操作和结果] 故障假设: [当前对根因的判断] 下一步计划: [待执行的动作] -
止损完成:
✅ 已止损 | 指标变化: 错误率[之前→之后], 延迟[之前→之后] 根因: [简要说明或「排查中」] | 监控30分钟后确认完全解决 -
故障解决:
🎉 已解决 | 持续时长: [时间] | 根因+影响范围+后续跟进动作 -
误告警:无实际影响,无需后续跟进
外部沟通模板(状态页):
- 首次更新:正在排查、故障开始时间、15分钟后同步最新进展
- 进度更新:已定位根因(用户友好表述)、正在执行修复、预估修复时间
- 解决通知:恢复时间、根因(简化表述)、故障时长、预防措施
客户邮件模板:时间线、影响说明、解决措施、预防方案、补偿方案(如有)
预期结果:模板可以在故障时节省沟通时间、保证信息同步的一致性、降低响应人员的认知负担。
异常处理:
- 调整模板匹配公司的沟通风格
- 针对常见事件类型预填模板内容
- 创建Slack工作流/机器人自动填充模板
- 事件复盘时同步评审优化模板
Step 6: Link Runbook to Monitoring
步骤6:将Runbook关联到监控体系
See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.
Integrate runbook with alerts and dashboards.
Add runbook links to Prometheus alerts:
yaml
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"Embed quick diagnostic links in runbook:
- Service Overview Dashboard
- Error Rate Last 1h (Prometheus direct link)
- Recent Error Logs (Loki/Grafana Explore)
- Recent Deployments (GitHub/CI)
- PagerDuty Incidents
Create Grafana dashboard panel with runbook links (markdown panel listing all incident runbooks with on-call and escalation info)
Expected: Responders can access runbooks directly from alerts or dashboards, diagnostic queries pre-filled, one-click access to relevant tools.
On failure:
- Verify runbook URLs are accessible without VPN/login
- Use URL shorteners for complex Grafana/Prometheus links
- Test links quarterly to ensure they don't break
- Create browser bookmarks for frequently used runbooks
查看扩展示例获取完整的Prometheus告警配置和Grafana看板JSON。
将Runbook和告警、监控看板打通。
在Prometheus告警中添加Runbook链接:
yaml
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"在Runbook中嵌入快速诊断链接:
- 服务概览看板
- 过去1小时错误率(Prometheus直接链接)
- 近期错误日志(Loki/Grafana Explore链接)
- 近期部署记录(GitHub/CI链接)
- PagerDuty事件列表
创建Grafana看板面板展示Runbook链接(markdown面板列出所有事件Runbook,附带on-call和升级信息)
预期结果:响应人员可以从告警或看板直接访问Runbook,诊断查询已预填,可一键跳转相关工具。
异常处理:
- 确认Runbook链接无需VPN/登录即可访问
- 复杂的Grafana/Prometheus链接使用短链接简化
- 每季度测试链接有效性避免失效
- 高频使用的Runbook可以添加浏览器书签
Validation
验证清单
- Runbook follows consistent template structure
- Diagnostic procedures include specific queries and expected values
- Resolution steps are actionable with clear commands
- Escalation criteria and contacts are current
- Communication templates provided for internal and external audiences
- Runbook linked from monitoring alerts and dashboards
- Runbook tested during incident simulation or actual incident
- Feedback from responders incorporated into runbook
- Revision history tracked with dates and authors
- Runbook accessible without authentication (or cached offline)
- Runbook遵循统一的模板结构
- 诊断流程包含具体查询语句和预期值
- 解决步骤可落地,附带明确命令
- 升级触发条件和联系人信息为最新
- 提供内外部受众的沟通模板
- Runbook已关联到监控告警和看板
- Runbook已在故障演练或真实事件中测试过
- 已合并响应人员的反馈优化内容
- 记录了版本历史,包含更新日期和作者
- Runbook无需认证即可访问(或有离线缓存)
Common Pitfalls
常见误区
- Too generic: Runbooks with vague steps like "check the logs" without specific queries are not actionable. Be specific.
- Outdated information: Runbooks referencing old systems or commands become useless. Review quarterly.
- No verification steps: Resolution without verification leads to false positives. Always include "how to confirm it's fixed."
- Missing rollback procedures: Every action should have a rollback plan. Don't trap responders in worse state.
- Assuming knowledge: Runbooks for experts only exclude junior engineers. Write for the least experienced person on rotation.
- No ownership: Runbooks without owners become stale. Assign team/person responsible for updates.
- Hidden behind auth: Runbooks inaccessible during VPN/SSO issues are useless during crisis. Cache copies or use public wiki.
- 过于泛化:Runbook只写了「查看日志」这类模糊步骤,没有具体查询语句,无法落地。要尽量具体。
- 信息过时:Runbook引用了旧系统或旧命令,失去参考价值。每季度要评审更新。
- 缺少验证步骤:修复后没有验证环节,容易出现误判已修复的情况。必须包含「如何确认故障已解决」的内容。
- 缺少回滚流程:每个操作都要有回滚方案,避免让响应人员陷入更糟糕的处境。
- 假设受众有足够经验:只面向专家编写的Runbook会把初级工程师排除在外。要以轮值团队中经验最少的成员为受众标准编写。
- 没有负责人:没有归属人的Runbook很快就会过时。要指定团队或个人负责更新维护。
- 需要认证才能访问:VPN/SSO故障时无法访问的Runbook在危机场景下毫无用处。要缓存副本或使用公开wiki存储。
Related Skills
相关技能
- - Link runbooks to alert annotations for immediate access during incidents
configure-alerting-rules - - Embed runbook links in dashboards and diagnostic panels
build-grafana-dashboards - - Include diagnostic queries from Prometheus in runbook procedures
setup-prometheus-monitoring - - Reference SLO impact in incident severity classification
define-slo-sli-sla
- - 将Runbook关联到告警注释,故障时可以立即访问
configure-alerting-rules - - 在看板和诊断面板中嵌入Runbook链接
build-grafana-dashboards - - 在Runbook流程中加入Prometheus诊断查询语句
setup-prometheus-monitoring - - 在事件严重等级分类中参考SLO影响
define-slo-sli-sla