incident-response

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Response

事件响应

Severity Levels

严重级别

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.

Severity	Name	Description	Examples	Response Time	Update Cadence	Responders
P0	Total Outage	Complete service unavailability. All users affected. Revenue-impacting.	Site completely down, data corruption, security breach with active exploitation	< 15 minutes	Every 15 minutes	All hands on deck, executive notification
P1	Major Degradation	Core functionality severely impaired. Large portion of users affected.	Payment processing broken, authentication failing for most users, major data pipeline stalled	< 30 minutes	Every 30 minutes	On-call engineer + team lead, stakeholder notification
P2	Partial Impact	Non-core functionality broken or core functionality degraded for a subset of users.	Search feature down, slow responses in one region, intermittent errors for some users	< 2 hours	Every 2 hours	On-call engineer
P3	Minor Issue	Cosmetic issues, minor bugs, or issues with workarounds available.	UI glitch, non-critical background job delayed, minor data inconsistency	Next business day	Daily (if ongoing)	Assigned engineer

在检测到事件后立即分配严重级别。严重级别决定响应优先级、沟通频率和升级路径。

严重级别	名称	描述	示例	响应时间	更新频率	响应人员
P0	完全中断	服务完全不可用，所有用户均受影响，会造成收入损失。	站点彻底宕机、数据损坏、正在被利用的安全漏洞	< 15分钟	每15分钟一次	全员响应，通知管理层
P1	严重降级	核心功能严重受损，大部分用户受影响。	支付处理故障、多数用户认证失败、主要数据管道停滞	< 30分钟	每30分钟一次	值班工程师 + 团队负责人，通知利益相关方
P2	部分影响	非核心功能故障，或核心功能仅在部分用户群体中出现降级。	搜索功能故障、单一区域响应缓慢、部分用户遇到间歇性错误	< 2小时	每2小时一次	值班工程师
P3	轻微问题	界面显示问题、小bug，或存在可用的临时解决方案。	UI glitch、非关键后台任务延迟、轻微数据不一致	下一个工作日	每日更新（若事件持续）	指定负责工程师

Escalation Rules

升级规则

If a P2 is not resolved within 4 hours, escalate to P1.
If a P1 is not resolved within 2 hours, escalate to P0.
Any incident involving data breach or security compromise is automatically P0.
When in doubt, over-classify. It is better to downgrade than to under-respond.

若P2事件在4小时内未解决，升级为P1。
若P1事件在2小时内未解决，升级为P0。
任何涉及数据泄露或安全入侵的事件自动定为P0。
若不确定，宁高勿低。降级比响应不足的风险更低。

Triage Procedure

分诊流程

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:

当检测到事件（告警、用户反馈或监控触发）时，遵循以下分诊步骤：

Step 1: Assess Impact

步骤1：评估影响

Answer these questions within the first 5 minutes:

Who is affected? (All users, subset, internal only)
What is broken? (Specific feature, entire service, data integrity)
When did it start? (Check monitoring for onset time)
Is it getting worse? (Error rate trending up, stable, or recovering)

在最初5分钟内回答以下问题：

哪些用户受影响？（所有用户、部分用户、仅内部用户）
什么功能出现故障？（特定功能、整个服务、数据完整性问题）
事件何时开始？（查看监控确定起始时间）
情况是否在恶化？（错误率上升、稳定或正在恢复）

Step 2: Assign Severity

步骤2：分配严重级别

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.

基于影响评估，使用上述表格分配严重级别，并记录严重级别及理由。

Step 3: Assemble the Response Team

步骤3：组建响应团队

Severity	Who to Notify
P0	On-call engineer, engineering manager, VP Engineering, customer support lead, communications
P1	On-call engineer, engineering manager, customer support lead
P2	On-call engineer, relevant team lead
P3	Assigned engineer (via ticket)

严重级别	通知对象
P0	值班工程师、工程经理、工程副总裁、客户支持负责人、沟通专员
P1	值班工程师、工程经理、客户支持负责人
P2	值班工程师、相关团队负责人
P3	指定负责工程师（通过工单分配）

Step 4: Designate Roles

步骤4：明确角色

For P0 and P1 incidents, explicitly assign these roles:

Role	Responsibility
Incident Commander (IC)	Coordinates response, makes decisions, delegates tasks. Does NOT debug.
Technical Lead	Leads investigation and mitigation. Communicates findings to IC.
Communications Lead	Drafts and sends status updates to stakeholders and customers.
Scribe	Documents timeline, actions taken, and decisions in real time.

对于P0和P1事件，需明确分配以下角色：

角色	职责
事件指挥官（IC）	协调响应工作、制定决策、分配任务。不参与调试工作。
技术负责人	主导调查和缓解工作，向事件指挥官汇报发现。
沟通负责人	起草并向利益相关方和用户发送状态更新。
记录员	实时记录时间线、采取的行动和做出的决策。

Step 5: Open Incident Channel

步骤5：创建事件沟通渠道

Create a dedicated communication channel (Slack channel, incident bridge call):

Name it clearly:
```
#incident-2024-03-15-payments-down
```
Pin the initial assessment and severity level.
All discussion and decisions happen in this channel.

创建专用沟通渠道（Slack频道、事件桥接会议）：

命名清晰：
```
#incident-2024-03-15-payments-down
```
置顶初始评估和严重级别信息。
所有讨论和决策均在此渠道进行。

Communication Templates

沟通模板

Initial Alert

初始告警

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.

INCIDENT DECLARED: [P0/P1/P2] - [简要描述]

影响范围: [受影响的用户群体及影响方式]
开始时间: [事件开始时间，UTC时区]
当前状态: [正在调查 / 已定位原因 / 正在缓解]
事件指挥官: [姓名]
沟通渠道: #incident-[日期]-[主题]

下一次更新将在[15/30]分钟后发布。

Status Update

状态更新

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.

INCIDENT UPDATE: [P0/P1/P2] - [简要描述]

状态: [正在调查 / 已定位原因 / 正在缓解 / 已解决]
持续时间: [从开始到现在的时长]
当前情况: [当前进展]
已采取行动: [已尝试的措施]
下一步计划: [即将执行的工作]

下一次更新将在[15/30]分钟后发布。

Resolution Notification

解决通知

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.

INCIDENT RESOLVED: [P0/P1/P2] - [简要描述]

总时长: [从检测到解决的总时间]
根本原因: [一句话总结]
解决方案: [修复措施]
用户影响: [用户侧影响总结]
后续工作: 事后复盘会议定于[日期/时间]召开。

正在监控是否复发。

Customer-Facing Communication

用户侧沟通

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.

[服务名称] 状态更新

我们注意到当前存在影响[用户影响描述]的问题。
我们的团队正在积极解决此问题。

开始时间: [时间，时区]
当前状态: [简要的非技术描述]

我们将每[30分钟 / 1小时]更新一次状态。
对由此带来的不便，我们深表歉意。

Investigation Methodology

调查方法

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.

遵循结构化方法，而非随机排查。从症状入手，逐步定位根本原因。

Step 1: Check Dashboards

步骤1：查看仪表盘

Start with the service overview dashboard. Look for:

Error rate spikes (when exactly did they start?)
Latency increases (gradual degradation or sudden jump?)
Traffic anomalies (unexpected spike or drop?)
Resource utilization (CPU, memory, disk, connections)

从服务概览仪表盘开始，重点关注：

错误率峰值（具体何时开始？）
延迟增加（逐步降级还是突然飙升？）
流量异常（意外峰值或下降？）
资源利用率（CPU、内存、磁盘、连接数）

Step 2: Check Recent Deploys

步骤2：检查最近的部署

Most incidents are caused by recent changes. Check:

bash

undefined

大多数事件由最近的变更引起。检查：

bash

undefined

What was deployed recently?

Any recent infrastructure changes?

最近有基础设施变更吗？

Check deployment pipeline history, Terraform runs, config changes

检查部署流水线历史、Terraform运行记录、配置变更


Questions to answer:

- Was anything deployed in the last 2 hours?
- Was there a configuration change (feature flags, environment variables)?
- Was there an infrastructure change (scaling, migration, certificate renewal)?


需要回答的问题：

- 过去2小时内是否有部署操作？
- 是否有配置变更（功能开关、环境变量）？
- 是否有基础设施变更（扩容、迁移、证书更新）？

Step 3: Examine Logs

步骤3：检查日志

Search logs filtered to the incident timeframe:

undefined

筛选事件时间范围内的日志进行搜索：

undefined

Example queries (adapt to your log aggregation platform)

示例查询（根据你的日志聚合平台调整）

ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"

Datadog: service:order-service status:error

CloudWatch: filter @message like /ERROR/ | sort @timestamp desc


Look for:

- Error messages with stack traces.
- Repeated error patterns (same error thousands of times).
- New error types that were not present before the incident.
- Correlation between errors and the timeline.


重点查找：

- 带有堆栈跟踪的错误信息。
- 重复出现的错误模式（同一错误出现数千次）。
- 事件发生前不存在的新错误类型。
- 错误与时间线的相关性。

Step 4: Hypothesize and Test

步骤4：提出假设并验证

Based on data gathered, form a hypothesis and test it:

Hypothesis	How to Test
Bad deploy caused it	Compare error timeline with deploy timestamp. Roll back and observe.
Database is overloaded	Check connection pool, slow query log, lock contention.
External dependency is down	Check dependency status page, test connectivity, check timeout rates.
Traffic spike overwhelmed the service	Check request rate, compare to normal baseline, check auto-scaling.
DNS or certificate issue	Test DNS resolution, check certificate expiry, verify SSL handshake.
Memory leak	Check memory usage trend, look for OOM kills in system logs.
Data corruption	Query for inconsistent data, check recent migration or backfill jobs.

基于收集到的数据，提出假设并进行验证：

假设	验证方法
错误部署导致	对比错误时间线与部署时间戳，回滚并观察结果。
数据库过载	检查连接池、慢查询日志、锁竞争情况。
外部依赖故障	查看依赖服务状态页面、测试连通性、检查超时率。
流量峰值压垮服务	查看请求速率、与正常基线对比、检查自动扩容情况。
DNS或证书问题	测试DNS解析、检查证书有效期、验证SSL握手。
内存泄漏	查看内存使用趋势、在系统日志中查找OOM终止记录。
数据损坏	查询不一致数据、检查最近的迁移或回填任务。

Step 5: Verify the Fix

步骤5：验证修复效果

Common Mitigation Actions

常见缓解措施

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:

Action	When to Use	How	Risk
Rollback	Bad deploy identified	Revert to previous known-good version via deployment pipeline	May lose new features; verify database compatibility
Feature flag toggle	New feature causing issues	Disable the flag in your feature management system	Requires feature flags to be in place
Horizontal scaling	Service overwhelmed by traffic	Increase instance count via auto-scaler or manual scaling	Increased cost; may not help if bottleneck is downstream
Cache clear	Stale or corrupted cached data	Flush application cache (Redis `FLUSHDB` , CDN purge)	Temporary increase in origin load after flush
Circuit breaker	Failing dependency cascading	Activate circuit breaker to fail fast instead of waiting	Gracefully degraded experience for users
Traffic shedding	Total overload	Rate limit or redirect traffic, enable maintenance page	Users see errors or degraded service
Database failover	Primary database unresponsive	Promote replica to primary (if configured)	Brief downtime during promotion; verify replication lag
DNS redirect	Entire region or provider down	Update DNS to point to backup region or provider	Propagation delay (use low TTL proactively)
Restart	Process stuck, memory leak	Rolling restart of application instances	Brief capacity reduction during restart
Hotfix	Small targeted code fix needed	Fast-track a minimal change through deployment pipeline	Bypasses normal review; must be reviewed post-incident

在确定根本原因后（甚至之前，为了降低影响），采取相应的缓解措施：

措施	使用场景	操作方式	风险
回滚	已定位错误部署	通过部署流水线回滚到之前的已知可用版本	可能丢失新功能；需验证数据库兼容性
功能开关切换	新功能导致问题	在功能管理系统中关闭对应开关	需预先配置功能开关
水平扩容	服务被流量压垮	通过自动扩容或手动扩容增加实例数量	成本增加；若瓶颈在下游则无效
清除缓存	缓存数据过期或损坏	刷新应用缓存（Redis `FLUSHDB` 、CDN清理）	清理后源站负载会暂时增加
熔断机制	依赖故障引发级联问题	激活熔断机制快速失败而非等待	用户体验会优雅降级
流量削减	服务完全过载	限流或重定向流量、启用维护页面	用户会看到错误或降级服务
数据库故障转移	主数据库无响应	将副本提升为主库（若已配置）	提升过程中会有短暂停机；需验证复制延迟
DNS重定向	整个区域或服务商故障	更新DNS指向备用区域或服务商	生效有延迟（需预先设置低TTL）
重启	进程卡住、内存泄漏	滚动重启应用实例	重启期间容量会暂时下降
热修复	需要小范围针对性代码修复	通过部署流水线快速推送最小变更	绕过正常审核流程；事后必须补审

Rollback Procedure

回滚流程

bash

undefined

bash

undefined

Verify the last known-good version

验证最后一个已知可用版本

git log --oneline -10 origin/main

Tag the rollback point

标记回滚点

git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

Trigger deployment of previous version

触发旧版本部署

(Adapt to your deployment pipeline)

—

Example: Kubernetes

—

kubectl rollout undo deployment/order-service

#（根据你的部署流水线调整）

Verify rollback is deployed

示例：Kubernetes

kubectl rollout status deployment/order-service

kubectl rollout undo deployment/order-service

Monitor error rate and confirm reduction

验证回滚完成

undefined

kubectl rollout status deployment/order-service

Post-Incident Review

监控错误率并确认下降

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.

undefined

Post-Incident Review Template

事后复盘

markdown

undefined

P0/P1事件需在解决后48小时内开展事后复盘（PIR），P2事件需在一周内开展。

Post-Incident Review: [Incident Title]

事后复盘模板

Date: [date] | Severity: [P0/P1/P2] | Duration: [time] | Author: [name]

markdown

undefined

Summary

事后复盘：[事件标题]

[2-3 sentences: what happened, the impact, and the resolution.]

日期: [日期] | 严重级别: [P0/P1/P2] | 总时长: [时长] | 作者: [姓名]

Timeline (UTC)

总结

Time	Event
14:00	Alert fires: order error rate > 5%
14:05	P1 declared, incident channel created
14:15	Recent deploy at 13:45 identified as suspect
14:25	Rollback deployed
14:45	Resolved, monitoring for recurrence

[2-3句话：事件概况、影响范围、解决方案。]

Impact

时间线（UTC时区）

Users affected, duration, revenue/SLA impact, data impact.

时间	事件
14:00	告警触发：订单错误率 > 5%
14:05	定为P1事件，创建事件沟通频道
14:15	发现13:45的最近部署为可疑原因
14:25	执行回滚部署
14:45	问题解决，监控是否复发

Root Cause

影响范围

[Detailed technical explanation of what went wrong and why.]

受影响用户数、持续时间、收入/SLA影响、数据影响。

Five Whys

根本原因

Why did orders fail? -> Payment validation threw an exception.
Why did validation throw? -> Null value for a non-nullable field.
Why was the field null? -> Migration added column but did not backfill.
Why was backfill missed? -> No checklist step for backfill verification.
Why no checklist step? -> Migration procedures were undocumented.

[详细的技术解释，说明问题是什么及为什么发生。]

Action Items

五问法

Action	Owner	Priority	Due Date	Status
Add integration test for null field validation	@alice	High	YYYY-MM-DD	TODO
Lower alert threshold from 5% to 2%	@bob	High	YYYY-MM-DD	TODO
Add feature flag to payment flow	@carol	Medium	YYYY-MM-DD	TODO

为什么订单处理失败？ -> 支付验证抛出异常。
为什么验证会抛出异常？ -> 非空字段出现空值。
为什么字段为空？ -> 迁移新增列但未回填数据。
为什么遗漏回填？ -> 没有验证回填的检查清单步骤。
为什么没有检查清单步骤？ -> 迁移流程未文档化。

Lessons Learned

行动项

What went well: [e.g., quick detection, rapid team assembly]
What could improve: [e.g., rollback automation, test coverage]

undefined

行动	负责人	优先级	截止日期	状态
添加空字段验证的集成测试	@alice	高	YYYY-MM-DD	待办
将告警阈值从5%降低到2%	@bob	高	YYYY-MM-DD	待办
为支付流程添加功能开关	@carol	中	YYYY-MM-DD	待办

Blameless Culture Principles

经验教训

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:

Principle	Practice
Assume good intent	People made the best decisions they could with the information they had at the time.
Focus on systems, not individuals	Ask "what allowed this to happen?" not "who caused this?"
Separate the what from the who	Describe actions taken without naming individuals in the root cause. Use role titles if context is needed.
Reward transparency	Publicly thank people who report incidents, share mistakes, or identify risks.
Follow through on action items	PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents.
Share learnings broadly	Publish PIR summaries (redacted if needed) so other teams learn too.

做得好的地方：[例如：快速检测、团队快速集结]
可以改进的地方：[例如：回滚自动化、测试覆盖率]

undefined

Runbook Authoring Guide

无责文化原则

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.

事后复盘是学习机会，而非追责会议。需遵循以下原则：

原则	实践方式
假设善意	人们在当时掌握的信息下做出了最优决策。
关注系统而非个人	问“是什么导致了这件事？”而非“是谁导致的？”
区分事实与责任人	在根本原因分析中描述行为而非指名道姓，必要时使用角色头衔。
奖励透明	公开感谢报告事件、分享错误或识别风险的人。
跟进行动项	跟踪并完成复盘行动项。未修复的系统性问题会导致事件复发。
广泛分享经验	发布复盘摘要（必要时脱敏），让其他团队也能学习。

Runbook Structure

Runbook编写指南

Every runbook follows this structure:

markdown

undefined

Runbook是针对特定告警或操作场景的分步响应指南。

Runbook: [Alert or Scenario Name]

Runbook结构

When to Use

—

[Describe the alert, symptom, or scenario that triggers this runbook.]

每个Runbook需遵循以下结构：

markdown

undefined

Prerequisites

Runbook: [告警或场景名称]

—

适用场景

Access to [systems, dashboards, tools]
Permissions: [required roles or access levels]

[描述触发此Runbook的告警、症状或场景。]

Steps

前置条件

1. Verify the Problem

—

bash

curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.

访问权限：[系统、仪表盘、工具]
权限要求：[所需角色或访问级别]

2. Apply Mitigation

步骤

—

1. 验证问题

bash

undefined

bash

curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

预期结果: 错误率低于0.01。若高于此值，继续排查；若正常，检查阈值并关闭告警。

Option A: Restart

2. 应用缓解措施

kubectl rollout restart deployment/order-service

bash

undefined

Option B: Rollback

选项A：重启

kubectl rollout undo deployment/order-service

undefined

kubectl rollout restart deployment/order-service

3. Verify Resolution

选项B：回滚

Expected: Error rate drops below 0.01 within 5 minutes.

kubectl rollout undo deployment/order-service

undefined

4. Escalation

3. 验证修复效果

If unresolved: escalate to [team], contact [channel/phone], provide [context].

undefined

预期结果: 错误率在5分钟内降至0.01以下。

Runbook Best Practices

4. 升级流程

Practice	Reason
Use exact commands, not descriptions	Under stress, responders should copy-paste, not interpret
Include expected output	So responders know if the command worked
Provide verification after each step	Catch issues early, do not proceed blindly
Include a rollback for each step	If a mitigation step makes things worse
Test runbooks regularly	Outdated runbooks cause confusion during real incidents
Date-stamp and version runbooks	Know when it was last verified
Link from alert to runbook	Reduce time-to-runbook to one click

若问题未解决：升级至[团队]，联系[频道/电话]，提供[上下文信息]。

undefined