incident-commander

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Commander Skill

Incident Commander技能

Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026

分类: 工程团队
等级: 高级
作者: Claude Skills团队
版本: 1.0.0
最后更新: 2026年2月

Overview

概述

The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.

Incident Commander技能提供了一套全面的事件响应框架，用于管理从检测、解决到事后复盘的全流程技术事件。该技能采用了大规模SRE和DevOps团队久经考验的实践方法，为事件严重程度分类、时间线重建和全面的事后分析提供结构化工具。

Key Features

核心功能

Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
Communication Templates - Pre-built templates for stakeholder updates and escalations
Runbook Integration - Generate actionable runbooks from incident patterns

自动严重程度分类 - 基于影响范围和紧急度指标的智能事件分流
时间线重建 - 将分散的日志和事件转化为连贯的事件过程叙事
事后复盘报告生成 - 包含多种RCA框架的结构化PIR文档
沟通模板 - 预构建的利益相关方更新与升级沟通模板
运行手册集成 - 根据事件模式生成可执行的运行手册

Skills Included

包含的工具

Core Tools

核心工具

Incident Classifier (
```
incident_classifier.py
```
)
- Analyzes incident descriptions and outputs severity levels
- Recommends response teams and initial actions
- Generates communication templates based on severity
Timeline Reconstructor (
```
timeline_reconstructor.py
```
)
- Processes timestamped events from multiple sources
- Reconstructs chronological incident timeline
- Identifies gaps and provides duration analysis
PIR Generator (
```
pir_generator.py
```
)
- Creates comprehensive Post-Incident Review documents
- Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
- Generates actionable follow-up items

事件分类器 (
```
incident_classifier.py
```
)
- 分析事件描述并输出严重等级
- 推荐响应团队和初始行动方案
- 根据严重程度生成对应的沟通模板
时间线重建工具 (
```
timeline_reconstructor.py
```
)
- 处理多来源的带时间戳事件
- 重建按时间顺序排列的事件时间线
- 识别时间线缺口并进行时长分析
PIR生成器 (
```
pir_generator.py
```
)
- 创建全面的事后复盘（Post-Incident Review）文档
- 应用多种RCA框架（5 Whys、鱼骨图、时间线法）
- 生成可执行的后续改进项

Incident Response Framework

事件响应框架

Severity Classification System

严重程度分类体系

SEV1 - Critical Outage

SEV1 - 严重中断

Definition: Complete service failure affecting all users or critical business functions

Characteristics:

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties

Response Requirements:

Immediate escalation to on-call engineer
Incident Commander assigned within 5 minutes
Executive notification within 15 minutes
Public status page update within 15 minutes
War room established
All hands on deck if needed

Communication Frequency: Every 15 minutes until resolution

定义: 影响所有用户或关键业务功能的完全服务故障

特征:

面向客户的服务完全不可用
影响用户的数据丢失或损坏
涉及客户数据泄露的安全事件
营收相关系统宕机
违反服务等级协议（SLA）并面临财务处罚

响应要求:

立即升级至值班工程师
5分钟内指定事件指挥官
15分钟内通知管理层
15分钟内更新公共状态页面
建立应急指挥室
必要时全员响应

沟通频率: 解决前每15分钟更新一次

SEV2 - Major Impact

SEV2 - 重大影响

Definition: Significant degradation affecting subset of users or non-critical functions

Characteristics:

Partial service degradation (>25% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience

Response Requirements:

On-call engineer response within 15 minutes
Incident Commander assigned within 30 minutes
Status page update within 30 minutes
Stakeholder notification within 1 hour
Regular team updates

Communication Frequency: Every 30 minutes during active response

定义: 影响部分用户或非关键功能的严重服务降级

特征:

部分服务降级（影响超过25%的用户）
导致用户不满的性能问题
非关键功能不可用
影响生产力的内部工具故障
不影响用户体验的数据不一致问题

响应要求:

值班工程师15分钟内响应
30分钟内指定事件指挥官
30分钟内更新状态页面
1小时内通知利益相关方
定期向团队更新进度

沟通频率: 事件响应期间每30分钟更新一次

SEV3 - Minor Impact

SEV3 - 轻微影响

Definition: Limited impact with workarounds available

Characteristics:

Single feature or component affected
<25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts

Response Requirements:

Response within 2 hours during business hours
Next business day response acceptable outside hours
Internal team notification
Optional status page update

Communication Frequency: At key milestones only

定义: 影响范围有限，存在可用的替代方案

特征:

单个功能或组件受影响
影响用户占比<25%
存在可用的替代方案
性能降级未显著影响用户体验
非紧急的监控告警

响应要求:

工作时间内2小时内响应
非工作时间可次日响应
通知内部团队
可选更新状态页面

沟通频率: 仅在关键节点更新

SEV4 - Low Impact

SEV4 - 低影响

Definition: Minimal impact, cosmetic issues, or planned maintenance

Characteristics:

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues

Response Requirements:

Response within 1-2 business days
Standard ticket/issue tracking
No special escalation required

Communication Frequency: Standard development cycle updates

定义: 影响极小，仅为表面问题或计划内维护

特征:

界面显示类bug
文档问题
日志或监控缺口
无用户影响的性能问题
开发/测试环境问题

响应要求:

1-2个工作日内响应
使用标准工单/问题追踪流程
无需特殊升级流程

沟通频率: 按照标准开发周期更新

Incident Commander Role

事件指挥官角色

Primary Responsibilities

主要职责

Command and Control
- Own the incident response process
- Make critical decisions about resource allocation
- Coordinate between technical teams and stakeholders
- Maintain situational awareness across all response streams
Communication Hub
- Provide regular updates to stakeholders
- Manage external communications (status pages, customer notifications)
- Facilitate effective communication between response teams
- Shield responders from external distractions
Process Management
- Ensure proper incident tracking and documentation
- Drive toward resolution while maintaining quality
- Coordinate handoffs between team members
- Plan and execute rollback strategies if needed
Post-Incident Leadership
- Ensure thorough post-incident reviews are conducted
- Drive implementation of preventive measures
- Share learnings with broader organization

指挥与控制
- 主导事件响应流程
- 做出资源分配的关键决策
- 协调技术团队与利益相关方
- 掌握所有响应环节的实时状态
沟通枢纽
- 定期向利益相关方更新进度
- 管理外部沟通（状态页面、客户通知）
- 促进响应团队间的有效沟通
- 为响应人员屏蔽外部干扰
流程管理
- 确保事件的跟踪与文档记录完整
- 在保证质量的前提下推动问题解决
- 协调团队成员间的工作交接
- 必要时规划并执行回滚策略
事后复盘领导
- 确保开展全面的事后复盘
- 推动预防措施的落地
- 在整个组织内分享经验教训

Decision-Making Framework

决策框架

Emergency Decisions (SEV1/2):

Incident Commander has full authority
Bias toward action over analysis
Document decisions for later review
Consult subject matter experts but don't get blocked

Resource Allocation:

Can pull in any necessary team members
Authority to escalate to senior leadership
Can approve emergency spend for external resources
Make call on communication channels and timing

Technical Decisions:

Lean on technical leads for implementation details
Make final calls on trade-offs between speed and risk
Approve rollback vs. fix-forward strategies
Coordinate testing and validation approaches

紧急决策（SEV1/2）:

事件指挥官拥有完全决策权
优先行动而非过度分析
记录决策以便后续复盘
咨询专家但不被意见阻碍

资源分配:

可调动任何必要的团队成员
有权升级至高层领导
可批准应急外部资源支出
决定沟通渠道与时间

技术决策:

依赖技术负责人处理实施细节
最终决定速度与风险的权衡
批准回滚与正向修复的策略选择
协调测试与验证方案

Communication Templates

沟通模板

Initial Incident Notification (SEV1/2)

初始事件通知（SEV1/2）

Subject: [SEV{severity}] {Service Name} - {Brief Description}

Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}

Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}

Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}

Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}

---
{Incident Commander Name}
{Contact Information}

主题: [SEV{severity}] {服务名称} - {简要描述}

事件详情:
- 开始时间: {timestamp}
- 严重程度: SEV{level}
- 影响范围: {用户影响描述}
- 当前状态: {调查中/缓解中/已解决}

技术细节:
- 受影响服务: {服务列表}
- 症状: {用户遇到的问题}
- 初步评估: {已知的疑似根因}

响应团队:
- 事件指挥官: {姓名}
- 技术负责人: {姓名}
- 参与的专家: {列表}

下次更新时间: {timestamp}
状态页面: {链接}
应急指挥室: {会议链接/聊天群链接}

---
{事件指挥官姓名}
{联系方式}

Executive Summary (SEV1)

管理层摘要（SEV1）

Subject: URGENT - Customer-Impacting Outage - {Service Name}

Executive Summary:
{2-3 sentence description of customer impact and business implications}

Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes} 
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}

Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination  
- [ ] Resource allocation decisions
- [ ] External vendor engagement

Incident Commander: {name} ({contact})
Next Update: {time}

---
This is an automated alert from our incident response system.

主题: 紧急通知 - 影响客户的服务中断 - {服务名称}

管理层摘要:
{2-3句话描述客户影响与业务影响}

关键指标:
- 检测时间: {X分钟}
- 响应时间: {X分钟} 
- 预估客户影响: {数量/百分比}
- 当前状态: {状态}
- 预计恢复时间: {时间或"调查中"}

需要管理层采取的行动:
- [ ] 客户沟通内容审批
- [ ] PR/公关协调  
- [ ] 资源分配决策
- [ ] 外部供应商对接

事件指挥官: {姓名} ({联系方式})
下次更新时间: {时间}

---
此为事件响应系统自动发送的告警通知。

Customer Communication Template

客户沟通模板

We are currently experiencing {brief description of issue} affecting {scope of impact}. 

Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.

What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}

What we're doing:
- {primary response action}
- {secondary response action}

Workaround (if available):
{workaround steps or "No workaround currently available"}

We apologize for the inconvenience and will share more information as it becomes available.

Next update: {time}
Status page: {link}

我们目前正在处理{问题简要描述}，该问题影响{影响范围}。

我们的工程团队已于{时间}收到告警并正在积极解决问题。我们会每隔{频率}更新一次进度，直至问题解决。

已知信息:
- {影响情况的事实陈述}
- {影响范围的事实陈述}
- {响应进度的简要说明}

我们正在采取的行动:
- {主要响应措施}
- {次要响应措施}

替代方案（如有）:
{替代步骤或"暂无可用替代方案"}

对于给您带来的不便我们深表歉意，我们会在获取更多信息后及时告知。

下次更新时间: {时间}
状态页面: {链接}

Stakeholder Management

利益相关方管理

Stakeholder Classification

利益相关方分类

Internal Stakeholders:

Engineering Leadership - Technical decisions and resource allocation
Product Management - Customer impact assessment and feature implications
Customer Support - User communication and support ticket management
Sales/Account Management - Customer relationship management for enterprise clients
Executive Team - Business impact decisions and external communication approval
Legal/Compliance - Regulatory reporting and liability assessment

External Stakeholders:

Customers - Service availability and impact communication
Partners - API availability and integration impacts
Vendors - Third-party service dependencies and support escalation
Regulators - Compliance reporting for regulated industries
Public/Media - Transparency for public-facing outages

内部利益相关方:

工程管理层 - 技术决策与资源分配
产品管理 - 客户影响评估与功能影响分析
客户支持 - 用户沟通与支持工单管理
销售/客户管理 - 企业客户关系维护
执行团队 - 业务影响决策与外部沟通审批
法务/合规 - 监管报告与责任评估

外部利益相关方:

客户 - 服务可用性与影响沟通
合作伙伴 - API可用性与集成影响
供应商 - 第三方服务依赖与支持升级
监管机构 - 合规行业的合规报告
公众/媒体 - 面向公众的中断事件透明沟通

Communication Cadence by Stakeholder

按利益相关方划分的沟通频率

Stakeholder	SEV1	SEV2	SEV3	SEV4
Engineering Leadership	Real-time	30min	4hrs	Daily
Executive Team	15min	1hr	EOD	Weekly
Customer Support	Real-time	30min	2hrs	As needed
Customers	15min	1hr	Optional	None
Partners	30min	2hrs	Optional	None

利益相关方	SEV1	SEV2	SEV3	SEV4
工程管理层	实时	30分钟	4小时	每日
执行团队	15分钟	1小时	工作日结束	每周
客户支持	实时	30分钟	2小时	按需
客户	15分钟	1小时	可选	无
合作伙伴	30分钟	2小时	可选	无

Runbook Generation Framework

运行手册生成框架

Dynamic Runbook Components

动态运行手册组件

Detection Playbooks
- Monitoring alert definitions
- Triage decision trees
- Escalation trigger points
- Initial response actions
Response Playbooks
- Step-by-step mitigation procedures
- Rollback instructions
- Validation checkpoints
- Communication checkpoints
Recovery Playbooks
- Service restoration procedures
- Data consistency checks
- Performance validation
- User notification processes

检测手册
- 监控告警定义
- 分流决策树
- 升级触发点
- 初始响应动作
响应手册
- 分步缓解流程
- 回滚说明
- 验证检查点
- 沟通检查点
恢复手册
- 服务恢复流程
- 数据一致性检查
- 性能验证
- 用户通知流程

Runbook Template Structure

运行手册模板结构

markdown

undefined

markdown

undefined

{Service/Component} Incident Response Runbook

{服务/组件}事件响应运行手册

Quick Reference

快速参考

Severity Indicators: {list of conditions for each severity level}
Key Contacts: {on-call rotations and escalation paths}
Critical Commands: {list of emergency commands with descriptions}

严重程度指标: {各严重等级的判定条件列表}
关键联系人: {值班轮换与升级路径}
紧急命令: {带描述的紧急命令列表}

Detection

检测

Monitoring Alerts

监控告警

{Alert name}: {description and thresholds}
{Alert name}: {description and thresholds}

{告警名称}: {描述与阈值}
{告警名称}: {描述与阈值}

Manual Detection Signs

人工检测迹象

{Symptom}: {what to look for and where}
{Symptom}: {what to look for and where}

{症状}: {检查内容与位置}
{症状}: {检查内容与位置}

Initial Response (0-15 minutes)

初始响应（0-15分钟）

Mitigation Strategies

缓解策略

Strategy 1: {Name}

策略1: {名称}

Use when: {conditions} Steps:

{detailed step with commands}
{detailed step with expected outcomes}
{validation step}

Rollback Plan:

{rollback step}
{verification step}

适用场景: {条件} 步骤:

{带命令的详细步骤}
{带预期结果的详细步骤}
{验证步骤}

回滚方案:

{回滚步骤}
{验证步骤}

Strategy 2: {Name}

策略2: {名称}

{similar structure}

{类似结构}

Recovery and Validation

恢复与验证

Common Pitfalls

常见陷阱

{Pitfall}: {description and how to avoid}
{Pitfall}: {description and how to avoid}

{陷阱}: {描述与避免方法}
{陷阱}: {描述与避免方法}

Reference Information

参考信息

→ See references/reference-information.md for details

→ 详情请见references/reference-information.md

Usage Examples

使用示例

Example 1: Database Connection Pool Exhaustion

示例1: 数据库连接池耗尽

bash

undefined

bash

undefined

Classify the incident

分类事件

echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py

Reconstruct timeline from logs

从日志重建时间线

python scripts/timeline_reconstructor.py --input assets/db_incident_events.json --output timeline.md

Generate PIR after resolution

问题解决后生成PIR

python scripts/pir_generator.py --incident assets/db_incident_data.json --timeline timeline.md --output pir.md

undefined

python scripts/pir_generator.py --incident assets/db_incident_data.json --timeline timeline.md --output pir.md

undefined

Example 2: API Rate Limiting Incident

示例2: API限流事件

bash

undefined

bash

undefined

Quick classification from stdin

从标准输入快速分类

echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text

Build timeline from multiple sources

从多来源构建时间线

python scripts/timeline_reconstructor.py --input assets/api_incident_logs.json --detect-phases --gap-analysis

Generate comprehensive PIR

生成全面的PIR

python scripts/pir_generator.py --incident assets/api_incident_summary.json --rca-method fishbone --action-items

undefined

python scripts/pir_generator.py --incident assets/api_incident_summary.json --rca-method fishbone --action-items

undefined

Best Practices

最佳实践

During Incident Response

事件响应期间

Maintain Calm Leadership
- Stay composed under pressure
- Make decisive calls with incomplete information
- Communicate confidence while acknowledging uncertainty
Document Everything
- All actions taken and their outcomes
- Decision rationale, especially for controversial calls
- Timeline of events as they happen
Effective Communication
- Use clear, jargon-free language
- Provide regular updates even when there's no new information
- Manage stakeholder expectations proactively
Technical Excellence
- Prefer rollbacks to risky fixes under pressure
- Validate fixes before declaring resolution
- Plan for secondary failures and cascading effects

保持冷静的领导力
- 压力下保持镇定
- 在信息不全时做出果断决策
- 沟通时保持自信，同时承认不确定性
记录所有内容
- 所有执行的动作及其结果
- 决策依据，尤其是有争议的决策
- 事件发生的时间线
有效沟通
- 使用清晰、无专业术语的语言
- 即使没有新信息也要定期更新
- 主动管理利益相关方的预期
技术卓越
- 压力下优先选择回滚而非风险修复
- 声明解决前先验证修复效果
- 规划应对二次故障与连锁反应

Post-Incident

事后阶段

Blameless Culture
- Focus on system failures, not individual mistakes
- Encourage honest reporting of what went wrong
- Celebrate learning and improvement opportunities
Action Item Discipline
- Assign specific owners and due dates
- Track progress publicly
- Prioritize based on risk and effort
Knowledge Sharing
- Share PIRs broadly within the organization
- Update runbooks based on lessons learned
- Conduct training sessions for common failure modes
Continuous Improvement
- Look for patterns across multiple incidents
- Invest in tooling and automation
- Regularly review and update processes

无责文化
- 关注系统故障而非个人错误
- 鼓励如实报告问题
- 重视学习与改进机会
改进项纪律
- 指定具体负责人与截止日期
- 公开跟踪进度
- 根据风险与投入优先级排序
知识共享
- 在组织内广泛分享PIR文档
- 根据经验教训更新运行手册
- 针对常见故障模式开展培训
持续改进
- 寻找多起事件中的共性模式
- 投资工具与自动化
- 定期回顾与更新流程

Integration with Existing Tools

与现有工具的集成

Monitoring and Alerting

监控与告警

PagerDuty/Opsgenie integration for escalation
Datadog/Grafana for metrics and dashboards
ELK/Splunk for log analysis and correlation

PagerDuty/Opsgenie集成用于升级通知
Datadog/Grafana用于指标与仪表盘
ELK/Splunk用于日志分析与关联

Communication Platforms

沟通平台

Slack/Teams for war room coordination
Zoom/Meet for video bridges
Status page providers (Statuspage.io, etc.)

Slack/Teams用于应急指挥室协作
Zoom/Meet用于视频会议
状态页面提供商（Statuspage.io等）

Documentation Systems

文档系统

Confluence/Notion for PIR storage
GitHub/GitLab for runbook version control
JIRA/Linear for action item tracking

Confluence/Notion用于存储PIR文档
GitHub/GitLab用于运行手册版本控制
JIRA/Linear用于跟踪改进项

Change Management

变更管理

CI/CD pipeline integration
Deployment tracking systems
Feature flag platforms for quick rollbacks

CI/CD流水线集成
部署跟踪系统
功能开关平台用于快速回滚

Conclusion

总结

The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.

The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.

Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.

Incident Commander技能提供了从事件检测到事后复盘的全流程管理框架。通过实施结构化流程、清晰的沟通模板和全面的分析工具，团队可以提升事件响应能力，构建更具韧性的系统。

成功事件管理的关键在于准备、实践与持续学习。将此框架作为起点，根据组织的特定需求、文化和技术环境进行调整。

请记住：目标不是预防所有事件（这不可能实现），而是快速检测、有效响应、清晰沟通并持续学习。