incident-response

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Response

事件响应

Severity Levels

严重级别

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
SeverityNameDescriptionExamplesResponse TimeUpdate CadenceResponders
P0Total OutageComplete service unavailability. All users affected. Revenue-impacting.Site completely down, data corruption, security breach with active exploitation< 15 minutesEvery 15 minutesAll hands on deck, executive notification
P1Major DegradationCore functionality severely impaired. Large portion of users affected.Payment processing broken, authentication failing for most users, major data pipeline stalled< 30 minutesEvery 30 minutesOn-call engineer + team lead, stakeholder notification
P2Partial ImpactNon-core functionality broken or core functionality degraded for a subset of users.Search feature down, slow responses in one region, intermittent errors for some users< 2 hoursEvery 2 hoursOn-call engineer
P3Minor IssueCosmetic issues, minor bugs, or issues with workarounds available.UI glitch, non-critical background job delayed, minor data inconsistencyNext business dayDaily (if ongoing)Assigned engineer
在检测到事件后立即分配严重级别。严重级别决定响应优先级、沟通频率和升级路径。
严重级别名称描述示例响应时间更新频率响应人员
P0完全中断服务完全不可用,所有用户均受影响,会造成收入损失。站点彻底宕机、数据损坏、正在被利用的安全漏洞< 15分钟每15分钟一次全员响应,通知管理层
P1严重降级核心功能严重受损,大部分用户受影响。支付处理故障、多数用户认证失败、主要数据管道停滞< 30分钟每30分钟一次值班工程师 + 团队负责人,通知利益相关方
P2部分影响非核心功能故障,或核心功能仅在部分用户群体中出现降级。搜索功能故障、单一区域响应缓慢、部分用户遇到间歇性错误< 2小时每2小时一次值班工程师
P3轻微问题界面显示问题、小bug,或存在可用的临时解决方案。UI glitch、非关键后台任务延迟、轻微数据不一致下一个工作日每日更新(若事件持续)指定负责工程师

Escalation Rules

升级规则

  • If a P2 is not resolved within 4 hours, escalate to P1.
  • If a P1 is not resolved within 2 hours, escalate to P0.
  • Any incident involving data breach or security compromise is automatically P0.
  • When in doubt, over-classify. It is better to downgrade than to under-respond.
  • 若P2事件在4小时内未解决,升级为P1。
  • 若P1事件在2小时内未解决,升级为P0。
  • 任何涉及数据泄露或安全入侵的事件自动定为P0。
  • 若不确定,宁高勿低。降级比响应不足的风险更低。

Triage Procedure

分诊流程

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:
当检测到事件(告警、用户反馈或监控触发)时,遵循以下分诊步骤:

Step 1: Assess Impact

步骤1:评估影响

Answer these questions within the first 5 minutes:
  • Who is affected? (All users, subset, internal only)
  • What is broken? (Specific feature, entire service, data integrity)
  • When did it start? (Check monitoring for onset time)
  • Is it getting worse? (Error rate trending up, stable, or recovering)
在最初5分钟内回答以下问题:
  • 哪些用户受影响?(所有用户、部分用户、仅内部用户)
  • 什么功能出现故障?(特定功能、整个服务、数据完整性问题)
  • 事件何时开始?(查看监控确定起始时间)
  • 情况是否在恶化?(错误率上升、稳定或正在恢复)

Step 2: Assign Severity

步骤2:分配严重级别

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.
基于影响评估,使用上述表格分配严重级别,并记录严重级别及理由。

Step 3: Assemble the Response Team

步骤3:组建响应团队

SeverityWho to Notify
P0On-call engineer, engineering manager, VP Engineering, customer support lead, communications
P1On-call engineer, engineering manager, customer support lead
P2On-call engineer, relevant team lead
P3Assigned engineer (via ticket)
严重级别通知对象
P0值班工程师、工程经理、工程副总裁、客户支持负责人、沟通专员
P1值班工程师、工程经理、客户支持负责人
P2值班工程师、相关团队负责人
P3指定负责工程师(通过工单分配)

Step 4: Designate Roles

步骤4:明确角色

For P0 and P1 incidents, explicitly assign these roles:
RoleResponsibility
Incident Commander (IC)Coordinates response, makes decisions, delegates tasks. Does NOT debug.
Technical LeadLeads investigation and mitigation. Communicates findings to IC.
Communications LeadDrafts and sends status updates to stakeholders and customers.
ScribeDocuments timeline, actions taken, and decisions in real time.
对于P0和P1事件,需明确分配以下角色:
角色职责
事件指挥官(IC)协调响应工作、制定决策、分配任务。不参与调试工作。
技术负责人主导调查和缓解工作,向事件指挥官汇报发现。
沟通负责人起草并向利益相关方和用户发送状态更新。
记录员实时记录时间线、采取的行动和做出的决策。

Step 5: Open Incident Channel

步骤5:创建事件沟通渠道

Create a dedicated communication channel (Slack channel, incident bridge call):
  • Name it clearly:
    #incident-2024-03-15-payments-down
  • Pin the initial assessment and severity level.
  • All discussion and decisions happen in this channel.
创建专用沟通渠道(Slack频道、事件桥接会议):
  • 命名清晰:
    #incident-2024-03-15-payments-down
  • 置顶初始评估和严重级别信息。
  • 所有讨论和决策均在此渠道进行。

Communication Templates

沟通模板

Initial Alert

初始告警

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.
INCIDENT DECLARED: [P0/P1/P2] - [简要描述]

影响范围: [受影响的用户群体及影响方式]
开始时间: [事件开始时间,UTC时区]
当前状态: [正在调查 / 已定位原因 / 正在缓解]
事件指挥官: [姓名]
沟通渠道: #incident-[日期]-[主题]

下一次更新将在[15/30]分钟后发布。

Status Update

状态更新

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.
INCIDENT UPDATE: [P0/P1/P2] - [简要描述]

状态: [正在调查 / 已定位原因 / 正在缓解 / 已解决]
持续时间: [从开始到现在的时长]
当前情况: [当前进展]
已采取行动: [已尝试的措施]
下一步计划: [即将执行的工作]

下一次更新将在[15/30]分钟后发布。

Resolution Notification

解决通知

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.
INCIDENT RESOLVED: [P0/P1/P2] - [简要描述]

总时长: [从检测到解决的总时间]
根本原因: [一句话总结]
解决方案: [修复措施]
用户影响: [用户侧影响总结]
后续工作: 事后复盘会议定于[日期/时间]召开。

正在监控是否复发。

Customer-Facing Communication

用户侧沟通

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.
[服务名称] 状态更新

我们注意到当前存在影响[用户影响描述]的问题。
我们的团队正在积极解决此问题。

开始时间: [时间,时区]
当前状态: [简要的非技术描述]

我们将每[30分钟 / 1小时]更新一次状态。
对由此带来的不便,我们深表歉意。

Investigation Methodology

调查方法

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.
遵循结构化方法,而非随机排查。从症状入手,逐步定位根本原因。

Step 1: Check Dashboards

步骤1:查看仪表盘

Start with the service overview dashboard. Look for:
  • Error rate spikes (when exactly did they start?)
  • Latency increases (gradual degradation or sudden jump?)
  • Traffic anomalies (unexpected spike or drop?)
  • Resource utilization (CPU, memory, disk, connections)
从服务概览仪表盘开始,重点关注:
  • 错误率峰值(具体何时开始?)
  • 延迟增加(逐步降级还是突然飙升?)
  • 流量异常(意外峰值或下降?)
  • 资源利用率(CPU、内存、磁盘、连接数)

Step 2: Check Recent Deploys

步骤2:检查最近的部署

Most incidents are caused by recent changes. Check:
bash
undefined
大多数事件由最近的变更引起。检查:
bash
undefined

What was deployed recently?

最近有哪些部署?

git log --oneline --since="2 hours ago" origin/main
git log --oneline --since="2 hours ago" origin/main

Any recent infrastructure changes?

最近有基础设施变更吗?

Check deployment pipeline history, Terraform runs, config changes

检查部署流水线历史、Terraform运行记录、配置变更


Questions to answer:

- Was anything deployed in the last 2 hours?
- Was there a configuration change (feature flags, environment variables)?
- Was there an infrastructure change (scaling, migration, certificate renewal)?

需要回答的问题:

- 过去2小时内是否有部署操作?
- 是否有配置变更(功能开关、环境变量)?
- 是否有基础设施变更(扩容、迁移、证书更新)?

Step 3: Examine Logs

步骤3:检查日志

Search logs filtered to the incident timeframe:
undefined
筛选事件时间范围内的日志进行搜索:
undefined

Example queries (adapt to your log aggregation platform)

示例查询(根据你的日志聚合平台调整)

ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"

ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"

Datadog: service:order-service status:error

Datadog: service:order-service status:error

CloudWatch: filter @message like /ERROR/ | sort @timestamp desc

CloudWatch: filter @message like /ERROR/ | sort @timestamp desc


Look for:

- Error messages with stack traces.
- Repeated error patterns (same error thousands of times).
- New error types that were not present before the incident.
- Correlation between errors and the timeline.

重点查找:

- 带有堆栈跟踪的错误信息。
- 重复出现的错误模式(同一错误出现数千次)。
- 事件发生前不存在的新错误类型。
- 错误与时间线的相关性。

Step 4: Hypothesize and Test

步骤4:提出假设并验证

Based on data gathered, form a hypothesis and test it:
HypothesisHow to Test
Bad deploy caused itCompare error timeline with deploy timestamp. Roll back and observe.
Database is overloadedCheck connection pool, slow query log, lock contention.
External dependency is downCheck dependency status page, test connectivity, check timeout rates.
Traffic spike overwhelmed the serviceCheck request rate, compare to normal baseline, check auto-scaling.
DNS or certificate issueTest DNS resolution, check certificate expiry, verify SSL handshake.
Memory leakCheck memory usage trend, look for OOM kills in system logs.
Data corruptionQuery for inconsistent data, check recent migration or backfill jobs.
基于收集到的数据,提出假设并进行验证:
假设验证方法
错误部署导致对比错误时间线与部署时间戳,回滚并观察结果。
数据库过载检查连接池、慢查询日志、锁竞争情况。
外部依赖故障查看依赖服务状态页面、测试连通性、检查超时率。
流量峰值压垮服务查看请求速率、与正常基线对比、检查自动扩容情况。
DNS或证书问题测试DNS解析、检查证书有效期、验证SSL握手。
内存泄漏查看内存使用趋势、在系统日志中查找OOM终止记录。
数据损坏查询不一致数据、检查最近的迁移或回填任务。

Step 5: Verify the Fix

步骤5:验证修复效果

After applying a fix:
  • Error rate returning to baseline.
  • Latency returning to normal.
  • No new error patterns appearing.
  • Affected functionality manually verified.
  • Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.
应用修复后:
  • 错误率恢复到基线水平。
  • 延迟恢复正常。
  • 没有新的错误模式出现。
  • 受影响功能已手动验证。
  • 声明解决前至少监控15分钟(P0/P1)或30分钟(P2)。

Common Mitigation Actions

常见缓解措施

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:
ActionWhen to UseHowRisk
RollbackBad deploy identifiedRevert to previous known-good version via deployment pipelineMay lose new features; verify database compatibility
Feature flag toggleNew feature causing issuesDisable the flag in your feature management systemRequires feature flags to be in place
Horizontal scalingService overwhelmed by trafficIncrease instance count via auto-scaler or manual scalingIncreased cost; may not help if bottleneck is downstream
Cache clearStale or corrupted cached dataFlush application cache (Redis
FLUSHDB
, CDN purge)
Temporary increase in origin load after flush
Circuit breakerFailing dependency cascadingActivate circuit breaker to fail fast instead of waitingGracefully degraded experience for users
Traffic sheddingTotal overloadRate limit or redirect traffic, enable maintenance pageUsers see errors or degraded service
Database failoverPrimary database unresponsivePromote replica to primary (if configured)Brief downtime during promotion; verify replication lag
DNS redirectEntire region or provider downUpdate DNS to point to backup region or providerPropagation delay (use low TTL proactively)
RestartProcess stuck, memory leakRolling restart of application instancesBrief capacity reduction during restart
HotfixSmall targeted code fix neededFast-track a minimal change through deployment pipelineBypasses normal review; must be reviewed post-incident
在确定根本原因后(甚至之前,为了降低影响),采取相应的缓解措施:
措施使用场景操作方式风险
回滚已定位错误部署通过部署流水线回滚到之前的已知可用版本可能丢失新功能;需验证数据库兼容性
功能开关切换新功能导致问题在功能管理系统中关闭对应开关需预先配置功能开关
水平扩容服务被流量压垮通过自动扩容或手动扩容增加实例数量成本增加;若瓶颈在下游则无效
清除缓存缓存数据过期或损坏刷新应用缓存(Redis
FLUSHDB
、CDN清理)
清理后源站负载会暂时增加
熔断机制依赖故障引发级联问题激活熔断机制快速失败而非等待用户体验会优雅降级
流量削减服务完全过载限流或重定向流量、启用维护页面用户会看到错误或降级服务
数据库故障转移主数据库无响应将副本提升为主库(若已配置)提升过程中会有短暂停机;需验证复制延迟
DNS重定向整个区域或服务商故障更新DNS指向备用区域或服务商生效有延迟(需预先设置低TTL)
重启进程卡住、内存泄漏滚动重启应用实例重启期间容量会暂时下降
热修复需要小范围针对性代码修复通过部署流水线快速推送最小变更绕过正常审核流程;事后必须补审

Rollback Procedure

回滚流程

bash
undefined
bash
undefined

Verify the last known-good version

验证最后一个已知可用版本

git log --oneline -10 origin/main
git log --oneline -10 origin/main

Tag the rollback point

标记回滚点

git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

Trigger deployment of previous version

触发旧版本部署

(Adapt to your deployment pipeline)

Example: Kubernetes

kubectl rollout undo deployment/order-service
#(根据你的部署流水线调整)

Verify rollback is deployed

示例:Kubernetes

kubectl rollout status deployment/order-service
kubectl rollout undo deployment/order-service

Monitor error rate and confirm reduction

验证回滚完成

undefined
kubectl rollout status deployment/order-service

Post-Incident Review

监控错误率并确认下降

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.
undefined

Post-Incident Review Template

事后复盘

markdown
undefined
P0/P1事件需在解决后48小时内开展事后复盘(PIR),P2事件需在一周内开展。

Post-Incident Review: [Incident Title]

事后复盘模板

Date: [date] | Severity: [P0/P1/P2] | Duration: [time] | Author: [name]
markdown
undefined

Summary

事后复盘:[事件标题]

[2-3 sentences: what happened, the impact, and the resolution.]
日期: [日期] | 严重级别: [P0/P1/P2] | 总时长: [时长] | 作者: [姓名]

Timeline (UTC)

总结

TimeEvent
14:00Alert fires: order error rate > 5%
14:05P1 declared, incident channel created
14:15Recent deploy at 13:45 identified as suspect
14:25Rollback deployed
14:45Resolved, monitoring for recurrence
[2-3句话:事件概况、影响范围、解决方案。]

Impact

时间线(UTC时区)

Users affected, duration, revenue/SLA impact, data impact.
时间事件
14:00告警触发:订单错误率 > 5%
14:05定为P1事件,创建事件沟通频道
14:15发现13:45的最近部署为可疑原因
14:25执行回滚部署
14:45问题解决,监控是否复发

Root Cause

影响范围

[Detailed technical explanation of what went wrong and why.]
受影响用户数、持续时间、收入/SLA影响、数据影响。

Five Whys

根本原因

  1. Why did orders fail? -> Payment validation threw an exception.
  2. Why did validation throw? -> Null value for a non-nullable field.
  3. Why was the field null? -> Migration added column but did not backfill.
  4. Why was backfill missed? -> No checklist step for backfill verification.
  5. Why no checklist step? -> Migration procedures were undocumented.
[详细的技术解释,说明问题是什么及为什么发生。]

Action Items

五问法

ActionOwnerPriorityDue DateStatus
Add integration test for null field validation@aliceHighYYYY-MM-DDTODO
Lower alert threshold from 5% to 2%@bobHighYYYY-MM-DDTODO
Add feature flag to payment flow@carolMediumYYYY-MM-DDTODO
  1. 为什么订单处理失败? -> 支付验证抛出异常。
  2. 为什么验证会抛出异常? -> 非空字段出现空值。
  3. 为什么字段为空? -> 迁移新增列但未回填数据。
  4. 为什么遗漏回填? -> 没有验证回填的检查清单步骤。
  5. 为什么没有检查清单步骤? -> 迁移流程未文档化。

Lessons Learned

行动项

  • What went well: [e.g., quick detection, rapid team assembly]
  • What could improve: [e.g., rollback automation, test coverage]
undefined
行动负责人优先级截止日期状态
添加空字段验证的集成测试@aliceYYYY-MM-DD待办
将告警阈值从5%降低到2%@bobYYYY-MM-DD待办
为支付流程添加功能开关@carolYYYY-MM-DD待办

Blameless Culture Principles

经验教训

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:
PrinciplePractice
Assume good intentPeople made the best decisions they could with the information they had at the time.
Focus on systems, not individualsAsk "what allowed this to happen?" not "who caused this?"
Separate the what from the whoDescribe actions taken without naming individuals in the root cause. Use role titles if context is needed.
Reward transparencyPublicly thank people who report incidents, share mistakes, or identify risks.
Follow through on action itemsPIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents.
Share learnings broadlyPublish PIR summaries (redacted if needed) so other teams learn too.
  • 做得好的地方:[例如:快速检测、团队快速集结]
  • 可以改进的地方:[例如:回滚自动化、测试覆盖率]
undefined

Runbook Authoring Guide

无责文化原则

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.
事后复盘是学习机会,而非追责会议。需遵循以下原则:
原则实践方式
假设善意人们在当时掌握的信息下做出了最优决策。
关注系统而非个人问“是什么导致了这件事?”而非“是谁导致的?”
区分事实与责任人在根本原因分析中描述行为而非指名道姓,必要时使用角色头衔。
奖励透明公开感谢报告事件、分享错误或识别风险的人。
跟进行动项跟踪并完成复盘行动项。未修复的系统性问题会导致事件复发。
广泛分享经验发布复盘摘要(必要时脱敏),让其他团队也能学习。

Runbook Structure

Runbook编写指南

Every runbook follows this structure:
markdown
undefined
Runbook是针对特定告警或操作场景的分步响应指南。

Runbook: [Alert or Scenario Name]

Runbook结构

When to Use

[Describe the alert, symptom, or scenario that triggers this runbook.]
每个Runbook需遵循以下结构:
markdown
undefined

Prerequisites

Runbook: [告警或场景名称]

适用场景

  • Access to [systems, dashboards, tools]
  • Permissions: [required roles or access levels]
[描述触发此Runbook的告警、症状或场景。]

Steps

前置条件

1. Verify the Problem

bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'
Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.
  • 访问权限:[系统、仪表盘、工具]
  • 权限要求:[所需角色或访问级别]

2. Apply Mitigation

步骤

1. 验证问题

bash
undefined
bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'
预期结果: 错误率低于0.01。若高于此值,继续排查;若正常,检查阈值并关闭告警。

Option A: Restart

2. 应用缓解措施

kubectl rollout restart deployment/order-service
bash
undefined

Option B: Rollback

选项A:重启

kubectl rollout undo deployment/order-service
undefined
kubectl rollout restart deployment/order-service

3. Verify Resolution

选项B:回滚

Expected: Error rate drops below 0.01 within 5 minutes.
kubectl rollout undo deployment/order-service
undefined

4. Escalation

3. 验证修复效果

If unresolved: escalate to [team], contact [channel/phone], provide [context].
undefined
预期结果: 错误率在5分钟内降至0.01以下。

Runbook Best Practices

4. 升级流程

PracticeReason
Use exact commands, not descriptionsUnder stress, responders should copy-paste, not interpret
Include expected outputSo responders know if the command worked
Provide verification after each stepCatch issues early, do not proceed blindly
Include a rollback for each stepIf a mitigation step makes things worse
Test runbooks regularlyOutdated runbooks cause confusion during real incidents
Date-stamp and version runbooksKnow when it was last verified
Link from alert to runbookReduce time-to-runbook to one click
若问题未解决:升级至[团队],联系[频道/电话],提供[上下文信息]。
undefined

Incident Response Checklist

Runbook最佳实践

Quick reference during an active incident:
  • Impact assessed (who, what, when, trending).
  • Severity assigned and documented.
  • Incident channel or bridge opened.
  • Roles assigned (IC, Technical Lead, Comms, Scribe).
  • Initial stakeholder notification sent.
  • Timeline being recorded in real time.
  • Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
  • Mitigation applied and impact reducing.
  • Resolution verified with monitoring data.
  • All-clear communication sent.
  • Post-incident review scheduled (within 48 hours for P0/P1).
  • Action items created with owners and due dates.
实践原因
使用精确命令而非描述压力下,响应人员应直接复制粘贴,无需解读
包含预期输出让响应人员知道命令是否执行成功
每步后验证尽早发现问题,避免盲目推进
为每步提供回滚方案若缓解措施使问题恶化
定期测试Runbook过时的Runbook会在真实事件中造成混乱
标记日期和版本明确上次验证时间
从告警链接到Runbook将查找Runbook的时间缩短至一键点击

事件响应检查清单

事件处理期间的快速参考:
  • 已评估影响(用户、功能、时间、趋势)。
  • 已分配并记录严重级别。
  • 已创建事件沟通频道或桥接会议。
  • 已分配角色(事件指挥官、技术负责人、沟通负责人、记录员)。
  • 已发送初始利益相关方通知。
  • 正在实时记录时间线。
  • 正在遵循结构化调查方法(仪表盘、部署、日志、假设、验证)。
  • 已应用缓解措施且影响正在降低。
  • 已通过监控数据验证修复效果。
  • 已发送解决通知。
  • 已安排事后复盘会议(P0/P1在48小时内)。
  • 已创建带负责人和截止日期的行动项。