incident-response
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Response
事件响应
Severity Levels
严重级别
Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Name | Description | Examples | Response Time | Update Cadence | Responders |
|---|---|---|---|---|---|---|
| P0 | Total Outage | Complete service unavailability. All users affected. Revenue-impacting. | Site completely down, data corruption, security breach with active exploitation | < 15 minutes | Every 15 minutes | All hands on deck, executive notification |
| P1 | Major Degradation | Core functionality severely impaired. Large portion of users affected. | Payment processing broken, authentication failing for most users, major data pipeline stalled | < 30 minutes | Every 30 minutes | On-call engineer + team lead, stakeholder notification |
| P2 | Partial Impact | Non-core functionality broken or core functionality degraded for a subset of users. | Search feature down, slow responses in one region, intermittent errors for some users | < 2 hours | Every 2 hours | On-call engineer |
| P3 | Minor Issue | Cosmetic issues, minor bugs, or issues with workarounds available. | UI glitch, non-critical background job delayed, minor data inconsistency | Next business day | Daily (if ongoing) | Assigned engineer |
在检测到事件后立即分配严重级别。严重级别决定响应优先级、沟通频率和升级路径。
| 严重级别 | 名称 | 描述 | 示例 | 响应时间 | 更新频率 | 响应人员 |
|---|---|---|---|---|---|---|
| P0 | 完全中断 | 服务完全不可用,所有用户均受影响,会造成收入损失。 | 站点彻底宕机、数据损坏、正在被利用的安全漏洞 | < 15分钟 | 每15分钟一次 | 全员响应,通知管理层 |
| P1 | 严重降级 | 核心功能严重受损,大部分用户受影响。 | 支付处理故障、多数用户认证失败、主要数据管道停滞 | < 30分钟 | 每30分钟一次 | 值班工程师 + 团队负责人,通知利益相关方 |
| P2 | 部分影响 | 非核心功能故障,或核心功能仅在部分用户群体中出现降级。 | 搜索功能故障、单一区域响应缓慢、部分用户遇到间歇性错误 | < 2小时 | 每2小时一次 | 值班工程师 |
| P3 | 轻微问题 | 界面显示问题、小bug,或存在可用的临时解决方案。 | UI glitch、非关键后台任务延迟、轻微数据不一致 | 下一个工作日 | 每日更新(若事件持续) | 指定负责工程师 |
Escalation Rules
升级规则
- If a P2 is not resolved within 4 hours, escalate to P1.
- If a P1 is not resolved within 2 hours, escalate to P0.
- Any incident involving data breach or security compromise is automatically P0.
- When in doubt, over-classify. It is better to downgrade than to under-respond.
- 若P2事件在4小时内未解决,升级为P1。
- 若P1事件在2小时内未解决,升级为P0。
- 任何涉及数据泄露或安全入侵的事件自动定为P0。
- 若不确定,宁高勿低。降级比响应不足的风险更低。
Triage Procedure
分诊流程
When an incident is detected (alert, user report, or monitoring), follow this triage sequence:
当检测到事件(告警、用户反馈或监控触发)时,遵循以下分诊步骤:
Step 1: Assess Impact
步骤1:评估影响
Answer these questions within the first 5 minutes:
- Who is affected? (All users, subset, internal only)
- What is broken? (Specific feature, entire service, data integrity)
- When did it start? (Check monitoring for onset time)
- Is it getting worse? (Error rate trending up, stable, or recovering)
在最初5分钟内回答以下问题:
- 哪些用户受影响?(所有用户、部分用户、仅内部用户)
- 什么功能出现故障?(特定功能、整个服务、数据完整性问题)
- 事件何时开始?(查看监控确定起始时间)
- 情况是否在恶化?(错误率上升、稳定或正在恢复)
Step 2: Assign Severity
步骤2:分配严重级别
Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.
基于影响评估,使用上述表格分配严重级别,并记录严重级别及理由。
Step 3: Assemble the Response Team
步骤3:组建响应团队
| Severity | Who to Notify |
|---|---|
| P0 | On-call engineer, engineering manager, VP Engineering, customer support lead, communications |
| P1 | On-call engineer, engineering manager, customer support lead |
| P2 | On-call engineer, relevant team lead |
| P3 | Assigned engineer (via ticket) |
| 严重级别 | 通知对象 |
|---|---|
| P0 | 值班工程师、工程经理、工程副总裁、客户支持负责人、沟通专员 |
| P1 | 值班工程师、工程经理、客户支持负责人 |
| P2 | 值班工程师、相关团队负责人 |
| P3 | 指定负责工程师(通过工单分配) |
Step 4: Designate Roles
步骤4:明确角色
For P0 and P1 incidents, explicitly assign these roles:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks. Does NOT debug. |
| Technical Lead | Leads investigation and mitigation. Communicates findings to IC. |
| Communications Lead | Drafts and sends status updates to stakeholders and customers. |
| Scribe | Documents timeline, actions taken, and decisions in real time. |
对于P0和P1事件,需明确分配以下角色:
| 角色 | 职责 |
|---|---|
| 事件指挥官(IC) | 协调响应工作、制定决策、分配任务。不参与调试工作。 |
| 技术负责人 | 主导调查和缓解工作,向事件指挥官汇报发现。 |
| 沟通负责人 | 起草并向利益相关方和用户发送状态更新。 |
| 记录员 | 实时记录时间线、采取的行动和做出的决策。 |
Step 5: Open Incident Channel
步骤5:创建事件沟通渠道
Create a dedicated communication channel (Slack channel, incident bridge call):
- Name it clearly:
#incident-2024-03-15-payments-down - Pin the initial assessment and severity level.
- All discussion and decisions happen in this channel.
创建专用沟通渠道(Slack频道、事件桥接会议):
- 命名清晰:
#incident-2024-03-15-payments-down - 置顶初始评估和严重级别信息。
- 所有讨论和决策均在此渠道进行。
Communication Templates
沟通模板
Initial Alert
初始告警
INCIDENT DECLARED: [P0/P1/P2] - [Brief description]
Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]
Next update in [15/30] minutes.INCIDENT DECLARED: [P0/P1/P2] - [简要描述]
影响范围: [受影响的用户群体及影响方式]
开始时间: [事件开始时间,UTC时区]
当前状态: [正在调查 / 已定位原因 / 正在缓解]
事件指挥官: [姓名]
沟通渠道: #incident-[日期]-[主题]
下一次更新将在[15/30]分钟后发布。Status Update
状态更新
INCIDENT UPDATE: [P0/P1/P2] - [Brief description]
Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]
Next update in [15/30] minutes.INCIDENT UPDATE: [P0/P1/P2] - [简要描述]
状态: [正在调查 / 已定位原因 / 正在缓解 / 已解决]
持续时间: [从开始到现在的时长]
当前情况: [当前进展]
已采取行动: [已尝试的措施]
下一步计划: [即将执行的工作]
下一次更新将在[15/30]分钟后发布。Resolution Notification
解决通知
INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]
Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].
Monitoring for recurrence.INCIDENT RESOLVED: [P0/P1/P2] - [简要描述]
总时长: [从检测到解决的总时间]
根本原因: [一句话总结]
解决方案: [修复措施]
用户影响: [用户侧影响总结]
后续工作: 事后复盘会议定于[日期/时间]召开。
正在监控是否复发。Customer-Facing Communication
用户侧沟通
[Service Name] Status Update
We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.
Started: [Time, timezone]
Current status: [Brief, non-technical description]
We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.[服务名称] 状态更新
我们注意到当前存在影响[用户影响描述]的问题。
我们的团队正在积极解决此问题。
开始时间: [时间,时区]
当前状态: [简要的非技术描述]
我们将每[30分钟 / 1小时]更新一次状态。
对由此带来的不便,我们深表歉意。Investigation Methodology
调查方法
Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.
遵循结构化方法,而非随机排查。从症状入手,逐步定位根本原因。
Step 1: Check Dashboards
步骤1:查看仪表盘
Start with the service overview dashboard. Look for:
- Error rate spikes (when exactly did they start?)
- Latency increases (gradual degradation or sudden jump?)
- Traffic anomalies (unexpected spike or drop?)
- Resource utilization (CPU, memory, disk, connections)
从服务概览仪表盘开始,重点关注:
- 错误率峰值(具体何时开始?)
- 延迟增加(逐步降级还是突然飙升?)
- 流量异常(意外峰值或下降?)
- 资源利用率(CPU、内存、磁盘、连接数)
Step 2: Check Recent Deploys
步骤2:检查最近的部署
Most incidents are caused by recent changes. Check:
bash
undefined大多数事件由最近的变更引起。检查:
bash
undefinedWhat was deployed recently?
最近有哪些部署?
git log --oneline --since="2 hours ago" origin/main
git log --oneline --since="2 hours ago" origin/main
Any recent infrastructure changes?
最近有基础设施变更吗?
Check deployment pipeline history, Terraform runs, config changes
检查部署流水线历史、Terraform运行记录、配置变更
Questions to answer:
- Was anything deployed in the last 2 hours?
- Was there a configuration change (feature flags, environment variables)?
- Was there an infrastructure change (scaling, migration, certificate renewal)?
需要回答的问题:
- 过去2小时内是否有部署操作?
- 是否有配置变更(功能开关、环境变量)?
- 是否有基础设施变更(扩容、迁移、证书更新)?Step 3: Examine Logs
步骤3:检查日志
Search logs filtered to the incident timeframe:
undefined筛选事件时间范围内的日志进行搜索:
undefinedExample queries (adapt to your log aggregation platform)
示例查询(根据你的日志聚合平台调整)
ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
Datadog: service:order-service status:error
Datadog: service:order-service status:error
CloudWatch: filter @message like /ERROR/ | sort @timestamp desc
CloudWatch: filter @message like /ERROR/ | sort @timestamp desc
Look for:
- Error messages with stack traces.
- Repeated error patterns (same error thousands of times).
- New error types that were not present before the incident.
- Correlation between errors and the timeline.
重点查找:
- 带有堆栈跟踪的错误信息。
- 重复出现的错误模式(同一错误出现数千次)。
- 事件发生前不存在的新错误类型。
- 错误与时间线的相关性。Step 4: Hypothesize and Test
步骤4:提出假设并验证
Based on data gathered, form a hypothesis and test it:
| Hypothesis | How to Test |
|---|---|
| Bad deploy caused it | Compare error timeline with deploy timestamp. Roll back and observe. |
| Database is overloaded | Check connection pool, slow query log, lock contention. |
| External dependency is down | Check dependency status page, test connectivity, check timeout rates. |
| Traffic spike overwhelmed the service | Check request rate, compare to normal baseline, check auto-scaling. |
| DNS or certificate issue | Test DNS resolution, check certificate expiry, verify SSL handshake. |
| Memory leak | Check memory usage trend, look for OOM kills in system logs. |
| Data corruption | Query for inconsistent data, check recent migration or backfill jobs. |
基于收集到的数据,提出假设并进行验证:
| 假设 | 验证方法 |
|---|---|
| 错误部署导致 | 对比错误时间线与部署时间戳,回滚并观察结果。 |
| 数据库过载 | 检查连接池、慢查询日志、锁竞争情况。 |
| 外部依赖故障 | 查看依赖服务状态页面、测试连通性、检查超时率。 |
| 流量峰值压垮服务 | 查看请求速率、与正常基线对比、检查自动扩容情况。 |
| DNS或证书问题 | 测试DNS解析、检查证书有效期、验证SSL握手。 |
| 内存泄漏 | 查看内存使用趋势、在系统日志中查找OOM终止记录。 |
| 数据损坏 | 查询不一致数据、检查最近的迁移或回填任务。 |
Step 5: Verify the Fix
步骤5:验证修复效果
After applying a fix:
- Error rate returning to baseline.
- Latency returning to normal.
- No new error patterns appearing.
- Affected functionality manually verified.
- Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.
应用修复后:
- 错误率恢复到基线水平。
- 延迟恢复正常。
- 没有新的错误模式出现。
- 受影响功能已手动验证。
- 声明解决前至少监控15分钟(P0/P1)或30分钟(P2)。
Common Mitigation Actions
常见缓解措施
When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:
| Action | When to Use | How | Risk |
|---|---|---|---|
| Rollback | Bad deploy identified | Revert to previous known-good version via deployment pipeline | May lose new features; verify database compatibility |
| Feature flag toggle | New feature causing issues | Disable the flag in your feature management system | Requires feature flags to be in place |
| Horizontal scaling | Service overwhelmed by traffic | Increase instance count via auto-scaler or manual scaling | Increased cost; may not help if bottleneck is downstream |
| Cache clear | Stale or corrupted cached data | Flush application cache (Redis | Temporary increase in origin load after flush |
| Circuit breaker | Failing dependency cascading | Activate circuit breaker to fail fast instead of waiting | Gracefully degraded experience for users |
| Traffic shedding | Total overload | Rate limit or redirect traffic, enable maintenance page | Users see errors or degraded service |
| Database failover | Primary database unresponsive | Promote replica to primary (if configured) | Brief downtime during promotion; verify replication lag |
| DNS redirect | Entire region or provider down | Update DNS to point to backup region or provider | Propagation delay (use low TTL proactively) |
| Restart | Process stuck, memory leak | Rolling restart of application instances | Brief capacity reduction during restart |
| Hotfix | Small targeted code fix needed | Fast-track a minimal change through deployment pipeline | Bypasses normal review; must be reviewed post-incident |
在确定根本原因后(甚至之前,为了降低影响),采取相应的缓解措施:
| 措施 | 使用场景 | 操作方式 | 风险 |
|---|---|---|---|
| 回滚 | 已定位错误部署 | 通过部署流水线回滚到之前的已知可用版本 | 可能丢失新功能;需验证数据库兼容性 |
| 功能开关切换 | 新功能导致问题 | 在功能管理系统中关闭对应开关 | 需预先配置功能开关 |
| 水平扩容 | 服务被流量压垮 | 通过自动扩容或手动扩容增加实例数量 | 成本增加;若瓶颈在下游则无效 |
| 清除缓存 | 缓存数据过期或损坏 | 刷新应用缓存(Redis | 清理后源站负载会暂时增加 |
| 熔断机制 | 依赖故障引发级联问题 | 激活熔断机制快速失败而非等待 | 用户体验会优雅降级 |
| 流量削减 | 服务完全过载 | 限流或重定向流量、启用维护页面 | 用户会看到错误或降级服务 |
| 数据库故障转移 | 主数据库无响应 | 将副本提升为主库(若已配置) | 提升过程中会有短暂停机;需验证复制延迟 |
| DNS重定向 | 整个区域或服务商故障 | 更新DNS指向备用区域或服务商 | 生效有延迟(需预先设置低TTL) |
| 重启 | 进程卡住、内存泄漏 | 滚动重启应用实例 | 重启期间容量会暂时下降 |
| 热修复 | 需要小范围针对性代码修复 | 通过部署流水线快速推送最小变更 | 绕过正常审核流程;事后必须补审 |
Rollback Procedure
回滚流程
bash
undefinedbash
undefinedVerify the last known-good version
验证最后一个已知可用版本
git log --oneline -10 origin/main
git log --oneline -10 origin/main
Tag the rollback point
标记回滚点
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"
Trigger deployment of previous version
触发旧版本部署
(Adapt to your deployment pipeline)
—
Example: Kubernetes
—
kubectl rollout undo deployment/order-service
#(根据你的部署流水线调整)
Verify rollback is deployed
示例:Kubernetes
kubectl rollout status deployment/order-service
kubectl rollout undo deployment/order-service
Monitor error rate and confirm reduction
验证回滚完成
undefinedkubectl rollout status deployment/order-service
Post-Incident Review
监控错误率并确认下降
Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.
undefinedPost-Incident Review Template
事后复盘
markdown
undefinedP0/P1事件需在解决后48小时内开展事后复盘(PIR),P2事件需在一周内开展。
Post-Incident Review: [Incident Title]
事后复盘模板
Date: [date] | Severity: [P0/P1/P2] | Duration: [time] | Author: [name]
markdown
undefinedSummary
事后复盘:[事件标题]
[2-3 sentences: what happened, the impact, and the resolution.]
日期: [日期] | 严重级别: [P0/P1/P2] | 总时长: [时长] | 作者: [姓名]
Timeline (UTC)
总结
| Time | Event |
|---|---|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |
[2-3句话:事件概况、影响范围、解决方案。]
Impact
时间线(UTC时区)
Users affected, duration, revenue/SLA impact, data impact.
| 时间 | 事件 |
|---|---|
| 14:00 | 告警触发:订单错误率 > 5% |
| 14:05 | 定为P1事件,创建事件沟通频道 |
| 14:15 | 发现13:45的最近部署为可疑原因 |
| 14:25 | 执行回滚部署 |
| 14:45 | 问题解决,监控是否复发 |
Root Cause
影响范围
[Detailed technical explanation of what went wrong and why.]
受影响用户数、持续时间、收入/SLA影响、数据影响。
Five Whys
根本原因
- Why did orders fail? -> Payment validation threw an exception.
- Why did validation throw? -> Null value for a non-nullable field.
- Why was the field null? -> Migration added column but did not backfill.
- Why was backfill missed? -> No checklist step for backfill verification.
- Why no checklist step? -> Migration procedures were undocumented.
[详细的技术解释,说明问题是什么及为什么发生。]
Action Items
五问法
| Action | Owner | Priority | Due Date | Status |
|---|---|---|---|---|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |
- 为什么订单处理失败? -> 支付验证抛出异常。
- 为什么验证会抛出异常? -> 非空字段出现空值。
- 为什么字段为空? -> 迁移新增列但未回填数据。
- 为什么遗漏回填? -> 没有验证回填的检查清单步骤。
- 为什么没有检查清单步骤? -> 迁移流程未文档化。
Lessons Learned
行动项
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]
undefined| 行动 | 负责人 | 优先级 | 截止日期 | 状态 |
|---|---|---|---|---|
| 添加空字段验证的集成测试 | @alice | 高 | YYYY-MM-DD | 待办 |
| 将告警阈值从5%降低到2% | @bob | 高 | YYYY-MM-DD | 待办 |
| 为支付流程添加功能开关 | @carol | 中 | YYYY-MM-DD | 待办 |
Blameless Culture Principles
经验教训
Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:
| Principle | Practice |
|---|---|
| Assume good intent | People made the best decisions they could with the information they had at the time. |
| Focus on systems, not individuals | Ask "what allowed this to happen?" not "who caused this?" |
| Separate the what from the who | Describe actions taken without naming individuals in the root cause. Use role titles if context is needed. |
| Reward transparency | Publicly thank people who report incidents, share mistakes, or identify risks. |
| Follow through on action items | PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents. |
| Share learnings broadly | Publish PIR summaries (redacted if needed) so other teams learn too. |
- 做得好的地方:[例如:快速检测、团队快速集结]
- 可以改进的地方:[例如:回滚自动化、测试覆盖率]
undefinedRunbook Authoring Guide
无责文化原则
A runbook is a step-by-step guide for responding to a specific alert or operational scenario.
事后复盘是学习机会,而非追责会议。需遵循以下原则:
| 原则 | 实践方式 |
|---|---|
| 假设善意 | 人们在当时掌握的信息下做出了最优决策。 |
| 关注系统而非个人 | 问“是什么导致了这件事?”而非“是谁导致的?” |
| 区分事实与责任人 | 在根本原因分析中描述行为而非指名道姓,必要时使用角色头衔。 |
| 奖励透明 | 公开感谢报告事件、分享错误或识别风险的人。 |
| 跟进行动项 | 跟踪并完成复盘行动项。未修复的系统性问题会导致事件复发。 |
| 广泛分享经验 | 发布复盘摘要(必要时脱敏),让其他团队也能学习。 |
Runbook Structure
Runbook编写指南
Every runbook follows this structure:
markdown
undefinedRunbook是针对特定告警或操作场景的分步响应指南。
Runbook: [Alert or Scenario Name]
Runbook结构
When to Use
—
[Describe the alert, symptom, or scenario that triggers this runbook.]
每个Runbook需遵循以下结构:
markdown
undefinedPrerequisites
Runbook: [告警或场景名称]
—
适用场景
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]
[描述触发此Runbook的告警、症状或场景。]
Steps
前置条件
1. Verify the Problem
—
bash
curl -s https://monitoring.example.com/api/v1/query \
--data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.
- 访问权限:[系统、仪表盘、工具]
- 权限要求:[所需角色或访问级别]
2. Apply Mitigation
步骤
—
1. 验证问题
bash
undefinedbash
curl -s https://monitoring.example.com/api/v1/query \
--data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'预期结果: 错误率低于0.01。若高于此值,继续排查;若正常,检查阈值并关闭告警。
Option A: Restart
2. 应用缓解措施
kubectl rollout restart deployment/order-service
bash
undefinedOption B: Rollback
选项A:重启
kubectl rollout undo deployment/order-service
undefinedkubectl rollout restart deployment/order-service
3. Verify Resolution
选项B:回滚
Expected: Error rate drops below 0.01 within 5 minutes.
kubectl rollout undo deployment/order-service
undefined4. Escalation
3. 验证修复效果
If unresolved: escalate to [team], contact [channel/phone], provide [context].
undefined预期结果: 错误率在5分钟内降至0.01以下。
Runbook Best Practices
4. 升级流程
| Practice | Reason |
|---|---|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |
若问题未解决:升级至[团队],联系[频道/电话],提供[上下文信息]。
undefinedIncident Response Checklist
Runbook最佳实践
Quick reference during an active incident:
- Impact assessed (who, what, when, trending).
- Severity assigned and documented.
- Incident channel or bridge opened.
- Roles assigned (IC, Technical Lead, Comms, Scribe).
- Initial stakeholder notification sent.
- Timeline being recorded in real time.
- Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- Mitigation applied and impact reducing.
- Resolution verified with monitoring data.
- All-clear communication sent.
- Post-incident review scheduled (within 48 hours for P0/P1).
- Action items created with owners and due dates.
| 实践 | 原因 |
|---|---|
| 使用精确命令而非描述 | 压力下,响应人员应直接复制粘贴,无需解读 |
| 包含预期输出 | 让响应人员知道命令是否执行成功 |
| 每步后验证 | 尽早发现问题,避免盲目推进 |
| 为每步提供回滚方案 | 若缓解措施使问题恶化 |
| 定期测试Runbook | 过时的Runbook会在真实事件中造成混乱 |
| 标记日期和版本 | 明确上次验证时间 |
| 从告警链接到Runbook | 将查找Runbook的时间缩短至一键点击 |
—
事件响应检查清单
—
事件处理期间的快速参考:
- 已评估影响(用户、功能、时间、趋势)。
- 已分配并记录严重级别。
- 已创建事件沟通频道或桥接会议。
- 已分配角色(事件指挥官、技术负责人、沟通负责人、记录员)。
- 已发送初始利益相关方通知。
- 正在实时记录时间线。
- 正在遵循结构化调查方法(仪表盘、部署、日志、假设、验证)。
- 已应用缓解措施且影响正在降低。
- 已通过监控数据验证修复效果。
- 已发送解决通知。
- 已安排事后复盘会议(P0/P1在48小时内)。
- 已创建带负责人和截止日期的行动项。