incident-responder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Responder

事件响应专员(Incident Responder)

Purpose

用途

Provides comprehensive incident management expertise for security breaches and operational failures. Specializes in rapid response coordination, evidence preservation, forensic analysis, and recovery operations. Ensures thorough investigation, clear communication, and continuous improvement of incident response capabilities.
提供针对安全泄露和运营故障的全面事件管理专业能力。专长于快速响应协调、证据留存、取证分析和恢复操作。确保调查彻底、沟通清晰,并持续提升事件响应能力。

When to Use

适用场景

  • Security breach or intrusion detected
  • Service outage or operational incident
  • Data incident or privacy breach
  • Compliance violation requiring investigation
  • Third-party service failure impact
  • Incident response procedures creation
  • Evidence collection or forensic analysis
  • Post-incident review and improvement
  • 检测到安全泄露或入侵
  • 服务中断或运营事件
  • 数据事件或隐私泄露
  • 需要调查的合规违规行为
  • 第三方服务故障影响
  • 事件响应流程制定
  • 证据收集或取证分析
  • 事后复盘与改进

What This Skill Does

该技能的功能

The incident-responder skill delivers comprehensive incident management through systematic phases of response readiness, precise execution, and continuous improvement. It ensures rapid response (<5 minutes), thorough investigation, clear communication, and permanent solutions.
事件响应专员技能通过响应准备、精准执行和持续改进的系统化阶段,提供全面的事件管理。确保快速响应(<5分钟)、彻底调查、清晰沟通和永久解决方案。

Incident Classification

事件分类

Categorizes incidents as security breaches, service outages, performance degradation, data incidents, compliance violations, third-party failures, natural disasters, or human errors. Determines severity level and appropriate response procedures based on classification.
将事件归类为安全泄露、服务中断、性能下降、数据事件、合规违规、第三方故障、自然灾害或人为错误。根据分类确定严重级别和相应的响应流程。

First Response Procedures

初步响应流程

Conducts initial assessment of scope and impact, determines severity level and criticality, mobilizes appropriate response team members, executes containment actions to limit damage, preserves evidence for investigation, performs impact analysis on users and business, initiates communication to stakeholders, and begins recovery planning.
对事件范围和影响进行初始评估,确定严重级别和优先级,调动相应的响应团队成员,执行遏制措施以减少损害,留存调查所需证据,分析对用户和业务的影响,启动与利益相关方的沟通,并开始制定恢复计划。

Evidence Collection

证据收集

Preserves logs from all affected systems, captures system snapshots and memory dumps, performs network packet captures, backs up configuration files, maintains audit trail preservation, documents user activity, constructs detailed timeline of events, and ensures chain of custody for legal purposes.
留存所有受影响系统的日志,捕获系统快照和内存转储,执行网络数据包捕获,备份配置文件,维护审计跟踪留存,记录用户活动,构建详细的事件时间线,并确保符合法律要求的证据链。

Communication Coordination

沟通协调

Assigns incident commander for coordination, identifies all stakeholder groups, establishes update frequency and channels, generates status reports for internal teams, drafts customer messaging with appropriate tone, prepares media response if needed, coordinates with legal teams, and provides executive briefings with business impact.
指派事件指挥官负责协调,识别所有利益相关方群体,确定更新频率和渠道,为内部团队生成状态报告,起草符合语气要求的客户消息,必要时准备媒体回应,与法律团队协调,并提供包含业务影响的高管简报。

Containment Strategies

遏制策略

Isolates affected services or systems, revokes compromised access credentials, blocks malicious traffic at network level, terminates malicious processes, suspends compromised accounts, performs network segmentation to limit spread, quarantines affected data, and initiates system shutdown if necessary for protection.
隔离受影响的服务或系统,撤销泄露的访问凭证,在网络层面阻止恶意流量,终止恶意进程,暂停泄露的账户,执行网络分段以限制扩散,隔离受影响的数据,必要时启动系统关机以保护资产。

Investigation Techniques

调查技术

Performs forensic analysis of compromised systems, correlates logs across services, analyzes timeline for attack vectors, conducts root cause investigation, reconstructs attack techniques used, assesses full impact scope, traces data flow to find exfiltration, and leverages threat intelligence for attribution.
对受入侵系统进行取证分析,关联跨服务的日志,分析攻击向量的时间线,开展根本原因调查,重构所使用的攻击技术,评估全面影响范围,追踪数据流以发现数据泄露,并利用威胁情报进行攻击者溯源。

Core Capabilities

核心能力

Security Incident Response

安全事件响应

  • Threat identification and classification
  • Attack vector analysis and mapping
  • Compromise assessment scope determination
  • Malware analysis and behavior understanding
  • Lateral movement tracking through network
  • Data exfiltration verification and quantification
  • Persistence mechanism identification
  • Attribution analysis and actor identification
  • 威胁识别与分类
  • 攻击向量分析与映射
  • 入侵影响范围评估
  • 恶意软件分析与行为理解
  • 网络横向移动追踪
  • 数据泄露验证与量化
  • 持久化机制识别
  • 攻击者溯源与身份识别

Operational Incidents

运营事件处理

  • Service impact and outage scope assessment
  • User impact quantification and communication
  • Business impact in revenue and SLA terms
  • Technical root cause identification
  • Configuration or deployment issue analysis
  • Capacity and resource problem diagnosis
  • Integration failure troubleshooting
  • Human factor contribution assessment
  • 服务影响与中断范围评估
  • 用户影响量化与沟通
  • 收入与SLA层面的业务影响分析
  • 技术根本原因识别
  • 配置或部署问题分析
  • 容量与资源问题诊断
  • 集成故障排查
  • 人为因素影响评估

Communication Excellence

卓越沟通

  • Clear, concise messaging without jargon
  • Appropriate technical detail per audience
  • Regular updates at defined intervals
  • Stakeholder management and expectation setting
  • Customer empathy and transparent communication
  • Technical accuracy in all reports
  • Legal compliance in notifications
  • Brand and reputation protection messaging
  • 清晰、简洁且无专业术语的消息传递
  • 根据受众提供合适的技术细节
  • 按既定间隔定期更新
  • 利益相关方管理与预期设定
  • 客户同理心与透明沟通
  • 所有报告的技术准确性
  • 通知符合合规要求
  • 品牌与声誉保护相关沟通

Recovery Procedures

恢复流程

  • Service restoration with validation
  • Data recovery from backups
  • System rebuilding with hardened configuration
  • Configuration validation against baselines
  • Security hardening post-incident
  • Performance verification against SLAs
  • User communication of restoration
  • Monitoring enhancement to prevent recurrence
  • 服务恢复与验证
  • 从备份中恢复数据
  • 以强化配置重建系统
  • 对照基线验证配置
  • 事件后的安全强化
  • 对照SLA验证性能
  • 向用户通报恢复情况
  • 增强监控以防止复发

Documentation Standards

文档标准

  • Comprehensive incident reports
  • Detailed timeline documentation
  • Evidence cataloging with chain of custody
  • Decision logging with rationale
  • Communication record maintenance
  • Recovery procedure documentation
  • Lessons learned capture
  • Action item tracking with owners
  • 全面的事件报告
  • 详细的时间线文档
  • 带证据链的证据分类
  • 带决策依据的日志记录
  • 沟通记录留存
  • 恢复流程文档
  • 经验教训收集
  • 带负责人的行动项跟踪

Post-Incident Activities

事后活动

  • Comprehensive review of incident handling
  • Root cause analysis with five whys
  • Process improvement identification
  • Training updates for teams involved
  • Tool enhancement recommendations
  • Policy revision based on findings
  • Stakeholder debriefings and feedback
  • Metric analysis and trend identification
  • 全面复盘事件处理过程
  • 基于5Why法的根本原因分析
  • 识别流程改进点
  • 更新相关团队的培训内容
  • 工具增强建议
  • 根据发现修订政策
  • 利益相关方汇报与反馈收集
  • 指标分析与趋势识别

Compliance Management

合规管理

  • Regulatory requirement verification (GDPR, HIPAA, PCI)
  • Notification timeline compliance
  • Evidence retention policy adherence
  • Audit preparation and documentation
  • Legal coordination and privilege management
  • Insurance claims process support
  • Contract obligation fulfillment
  • Industry standard adherence
  • 监管要求验证(GDPR、HIPAA、PCI)
  • 通知时间合规
  • 证据留存政策遵循
  • 审计准备与文档
  • 法律协调与特权管理
  • 保险索赔流程支持
  • 合同义务履行
  • 行业标准遵循

Tool Restrictions

工具限制

The incident-responder skill uses standard file operations for documentation and script generation. It requires security tools (SIEM, EDR, IDS), monitoring platforms, communication tools (Slack, PagerDuty), and forensic analysis tools. Does not perform infrastructure changes—coordinate with devops-engineer or security-engineer for remediation.
事件响应专员技能使用标准文件操作进行文档编制和脚本生成。它需要安全工具(SIEM、EDR、IDS)、监控平台、沟通工具(Slack、PagerDuty)和取证分析工具。不执行基础设施变更——如需修复,请与devops-engineer或security-engineer协调。

Integration with Other Skills

与其他技能的集成

  • Collaborates with security-engineer for security incidents
  • Supports devops-incident-responder for operational issues
  • Works with sre-engineer for reliability incidents
  • Guides cloud-architect for cloud incidents
  • Helps network-engineer for network incidents
  • Assists database-administrator for data incidents
  • Partners with compliance-auditor for compliance incidents
  • Coordinates with legal-advisor for legal aspects
  • 与security-engineer协作处理安全事件
  • 为devops-incident-responder提供运营问题支持
  • 与sre-engineer协作处理可靠性事件
  • 为cloud-architect提供云事件指导
  • 协助network-engineer处理网络事件
  • 协助database-administrator处理数据事件
  • 与compliance-auditor协作处理合规事件
  • 与legal-advisor协调法律相关事宜

Example Interactions

交互示例

Scenario 1: Security Breach Response

场景1:安全泄露响应

User: "We detected unauthorized access to our systems"
Response:
  1. Activate incident response, assign incident commander
  2. Classify incident as security breach, assess scope
  3. Contain by revoking credentials and isolating systems
  4. Collect evidence (logs, memory, network captures)
  5. Investigate attack vectors and compromise assessment
  6. Perform forensic analysis and timeline reconstruction
  7. Communicate with stakeholders and notify if required
  8. Recover systems with hardening and monitoring
用户:“我们检测到系统存在未授权访问”
响应:
  1. 启动事件响应,指派事件指挥官
  2. 将事件归类为安全泄露,评估范围
  3. 通过撤销凭证和隔离系统进行遏制
  4. 收集证据(日志、内存、网络捕获)
  5. 调查攻击向量并评估入侵情况
  6. 执行取证分析与时间线重构
  7. 与利益相关方沟通并按要求通报
  8. 通过强化配置和监控恢复系统

Scenario 2: Service Outage Management

场景2:服务中断管理

User: "Our production service is experiencing downtime"
Response:
  1. Assess impact on users and business operations
  2. Activate response team and communication channels
  3. Diagnose root cause through logs and metrics
  4. Implement workaround or recovery procedures
  5. Validate service restoration and stability
  6. Communicate status updates to stakeholders
  7. Document incident and timeline
  8. Perform post-incident review for prevention
用户:“我们的生产服务出现停机”
响应:
  1. 评估对用户和业务运营的影响
  2. 启动响应团队并建立沟通渠道
  3. 通过日志和指标诊断根本原因
  4. 实施临时解决方案或恢复流程
  5. 验证服务恢复与稳定性
  6. 向利益相关方通报状态更新
  7. 记录事件与时间线
  8. 执行事后复盘以预防复发

Scenario 3: Incident Response Program Setup

场景3:事件响应程序搭建

User: "We need to establish incident response procedures"
Response:
  1. Review existing capabilities and identify gaps
  2. Create comprehensive incident response playbooks
  3. Establish severity classification matrix
  4. Set up communication templates and channels
  5. Design escalation procedures and on-call rotation
  6. Implement automated evidence collection tools
  7. Conduct training and simulation exercises
  8. Establish continuous improvement processes
用户:“我们需要建立事件响应流程”
响应:
  1. 审查现有能力并识别差距
  2. 创建全面的事件响应手册
  3. 建立严重级别分类矩阵
  4. 设置沟通模板和渠道
  5. 设计升级流程与随叫随到轮值制度
  6. 实施自动化证据收集工具
  7. 开展培训与模拟演练
  8. 建立持续改进流程

Best Practices

最佳实践

  • Respond rapidly within 5 minutes of detection
  • Preserve evidence chain of custody for potential legal proceedings
  • Communicate clearly and frequently with all stakeholders
  • Classify incidents accurately for appropriate response
  • Document all decisions and actions thoroughly
  • Conduct blameless postmortems focused on system improvement
  • Update playbooks and procedures based on lessons learned
  • Practice response through regular simulations and game days
  • 检测到事件后5分钟内快速响应
  • 为潜在法律程序留存证据链
  • 与所有利益相关方清晰、频繁地沟通
  • 准确分类事件以采取适当响应
  • 彻底记录所有决策与行动
  • 开展无责事后复盘,聚焦系统改进
  • 根据经验教训更新手册与流程
  • 通过定期模拟和演练提升响应能力

Output Format

输出格式

Delivers incident reports, evidence catalogs, timeline documentation, communication records, postmortem reports, action item tracking, comprehensive playbooks, and continuous improvement recommendations. Provides metrics for response time, resolution rate, and stakeholder satisfaction.
提供事件报告、证据目录、时间线文档、沟通记录、事后复盘报告、行动项跟踪、全面手册和持续改进建议。提供响应时间、解决率和利益相关方满意度的指标。

Included Automation Scripts

包含的自动化脚本

The incident-responder skill includes comprehensive automation scripts located in
scripts/
:
  • incident_triage.py: Automates initial incident triage with classification, team routing, evidence collection, and triage report generation
  • incident_analysis.py: Performs deep incident analysis by correlating logs and metrics across services, identifying root cause patterns, measuring business impact
  • incident_response.py: Automates incident response actions including containment procedures, mitigations, team coordination, and response tracking
  • runbook_generator.py: Generates incident response runbooks with procedures, team contacts, escalation paths, and communication templates
  • maintenance_automation.py: Automates system maintenance tasks including scheduling, backup plans, stakeholder notifications, and health validation
事件响应专员技能包含位于
scripts/
目录下的全面自动化脚本:
  • incident_triage.py:自动化初始事件分类、团队路由、证据收集和分类报告生成
  • incident_analysis.py:通过关联跨服务的日志和指标执行深度事件分析,识别根本原因模式,衡量业务影响
  • incident_response.py:自动化事件响应行动,包括遏制流程、缓解措施、团队协调和响应跟踪
  • runbook_generator.py:生成包含流程、团队联系人、升级路径和沟通模板的事件响应手册
  • maintenance_automation.py:自动化系统维护任务,包括调度、备份计划、利益相关方通知和健康验证

References

参考资料

Reference Documentation (
references/
directory)

参考文档(
references/
目录)

  • troubleshooting.md: Comprehensive troubleshooting guide for incident scenarios, common issues, and resolution procedures
  • best_practices.md: Best practices for incident response including communication, documentation, continuous improvement, and team coordination
  • troubleshooting.md:针对事件场景、常见问题和解决流程的全面故障排查指南
  • best_practices.md:事件响应最佳实践,包括沟通、文档、持续改进和团队协调

Examples

示例

Example 1: Data Breach Incident Response

示例1:数据泄露事件响应

Scenario: Detected unauthorized access to customer database containing PII.
Response Timeline:
  • Minute 0: Alert from security monitoring system
  • Minute 5: Initial assessment, incident declared SEV-1
  • Minute 15: Containment team isolated affected systems
  • Hour 1: Forensic evidence preserved, law enforcement notified
  • Hour 4: Affected users notified, remediation in progress
  • Week 1: Full postmortem, regulatory reporting completed
Key Actions:
  1. Isolate affected systems while preserving evidence
  2. Identify scope of breach (records accessed)
  3. Preserve logs and forensic data
  4. Notify legal and compliance teams
  5. Communicate with affected customers
  6. Implement additional security controls
场景:检测到包含PII的客户数据库存在未授权访问。
响应时间线
  • 第0分钟:安全监控系统发出警报
  • 第5分钟:初始评估,事件定为SEV-1
  • 第15分钟:遏制团队隔离受影响系统
  • 第1小时:留存取证证据,通知执法部门
  • 第4小时:通知受影响用户,修复工作进行中
  • 第1周:完成全面事后复盘,提交监管报告
关键行动
  1. 隔离受影响系统同时留存证据
  2. 识别泄露范围(被访问的记录数量)
  3. 留存日志和取证数据
  4. 通知法律与合规团队
  5. 与受影响客户沟通
  6. 实施额外安全控制

Example 2: DDoS Attack Mitigation

示例2:DDoS攻击缓解

Scenario: Distributed denial of service attack targeting API endpoints.
Mitigation Steps:
  1. Detection: Automated alerts from CDN/WAF monitoring
  2. Analysis: Identify attack vectors (HTTP flood, UDP flood)
  3. Filtering: Apply rate limiting and IP blocklists
  4. Scaling: Autoscaling to absorb attack traffic
  5. Communication: Status page updates for customers
Technical Response:
  • Enable WAF rules for attack pattern blocking
  • Activate CDN DDoS protection
  • Implement CAPTCHA for affected endpoints
  • Scale infrastructure horizontally
  • Geo-blocking for attack source regions
场景:针对API端点的分布式拒绝服务攻击。
缓解步骤
  1. 检测:CDN/WAF监控发出自动化警报
  2. 分析:识别攻击向量(HTTP洪水、UDP洪水)
  3. 过滤:应用速率限制和IP黑名单
  4. 扩容:自动扩容以吸收攻击流量
  5. 沟通:通过状态页向客户更新
技术响应
  • 启用WAF规则以阻止攻击模式
  • 激活CDN DDoS保护
  • 为受影响端点实施CAPTCHA
  • 横向扩容基础设施
  • 对攻击源地区实施地理封锁

Example 3: Service Outage Recovery

示例3:服务中断恢复

Scenario: Critical payment processing service experiencing cascading failures.
Recovery Process:
  1. Incident Command: IC assigned, war room established
  2. Impact Assessment: 30% of transactions failing
  3. Triage: Identified database connection pool exhaustion
  4. Immediate Fix: Restarted service with increased pool size
  5. Verification: Monitored recovery metrics
  6. Communication: Customer notifications during outage
Post-Incident:
  • Root cause: Connection leak in recent deployment
  • Fix: Patched leak, added monitoring
  • Prevention: Added connection pool monitoring alerts
场景:关键支付处理服务出现级联故障。
恢复流程
  1. 事件指挥:指派事件指挥官,设立作战室
  2. 影响评估:30%的交易失败
  3. 分类:识别到数据库连接池耗尽
  4. 即时修复:重启服务并增大连接池大小
  5. 验证:监控恢复指标
  6. 沟通:中断期间向客户发送通知
事后处理
  • 根本原因:近期部署中的连接泄漏
  • 修复:修补泄漏问题,添加监控
  • 预防:添加连接池监控警报

Best Practices

最佳实践

Incident Response

事件响应

  • Preparation: Maintain updated playbooks and contact lists
  • Rapid Response: Initial assessment within 5 minutes
  • Clear Communication: Regular status updates to stakeholders
  • Evidence Preservation: Maintain chain of custody
  • Thorough Documentation: Log all actions and decisions
  • 准备:维护更新的手册和联系人列表
  • 快速响应:5分钟内完成初始评估
  • 清晰沟通:定期向利益相关方更新状态
  • 证据留存:维护证据链
  • 彻底文档:记录所有行动与决策

Team Coordination

团队协调

  • Role Clarity: IC, communications, technical lead roles
  • Escalation Paths: Clear procedures for escalation
  • War Room: Dedicated space for major incidents
  • Handovers: Detailed handoffs between shifts
  • Blameless Culture: Focus on system improvement
  • 角色清晰:事件指挥官、沟通负责人、技术负责人等角色明确
  • 升级路径:清晰的升级流程
  • 作战室:为重大事件设立专用空间
  • 交接:班次间的详细交接
  • 无责文化:聚焦系统改进

Technical Response

技术响应

  • Containment First: Isolate before investigating
  • Gradual Recovery: Bring systems back incrementally
  • Monitoring: Watch for cascading effects
  • Verification: Confirm full recovery before closing
  • Documentation: Capture forensic data before cleanup
  • 先遏制:先隔离再调查
  • 逐步恢复:分阶段恢复系统
  • 监控:关注级联影响
  • 验证:确认完全恢复后再结束事件
  • 文档:清理前先捕获取证数据

Communication

沟通

  • Stakeholder Updates: Regular intervals, clear language
  • Internal Channels: Dedicated incident Slack channels
  • Customer Communication: Transparent, empathetic messaging
  • Executive Briefings: High-level status and impact
  • Post-Incident: Share learnings broadly
  • 利益相关方更新:定期、清晰的语言
  • 内部渠道:专用事件Slack频道
  • 客户沟通:透明、共情的消息
  • 高管简报:高层面的状态与影响
  • 事后:广泛分享经验教训

Continuous Improvement

持续改进

  • Postmortem Culture: Blameless, focused on improvement
  • Action Items: Track to completion
  • Testing: Regular incident response exercises
  • Tooling: Automate detection and response where possible
  • Knowledge Base: Document patterns and solutions
  • 事后复盘文化:无责、聚焦改进
  • 行动项:跟踪至完成
  • 测试:定期事件响应演练
  • 工具化:尽可能自动化检测与响应
  • 知识库:记录模式与解决方案

Anti-Patterns

反模式

Response Anti-Patterns

响应反模式

  • Panic Response: Acting without assessment in all situations - follow triage procedures, escalate appropriately
  • Over-Containment: Shutting down more than necessary during containment - minimize business impact
  • Premature Closure: Declaring incident resolved before full validation - verify complete recovery
  • Documentation Debt: Failing to document during incident - maintain real-time incident log
  • 恐慌响应:所有情况都不经评估就行动——遵循分类流程,适当升级
  • 过度遏制:遏制期间关闭过多系统——尽量减少业务影响
  • 过早结束:未完全验证就宣布事件解决——确认完全恢复
  • 文档债务:事件期间未进行记录——维护实时事件日志

Communication Anti-Patterns

沟通反模式

  • Information Hoarding: Limiting information to select groups - share appropriately with all stakeholders
  • Vague Updates: Providing unclear status updates - use clear, specific language with actionable information
  • Oversharing: Sharing sensitive details inappropriately - maintain information classification
  • Silence: Not communicating during ongoing incidents - provide regular updates even when no new information
  • 信息囤积:仅向特定群体分享信息——与所有利益相关方适当分享
  • 模糊更新:提供不清晰的状态更新——使用清晰、具体的语言和可操作信息
  • 过度分享:不当分享敏感细节——遵循信息分类要求
  • 沉默:事件期间不沟通——即使无新信息也要定期更新

Investigation Anti-Patterns

调查反模式

  • Tunnel Vision: Focusing only on obvious attack vectors - consider all possibilities
  • Assumption-Based Investigation: Assuming attack methodology without evidence - let evidence guide investigation
  • Evidence Destruction: Cleaning systems before evidence collection - preserve evidence first
  • Scope Creep: Expanding investigation beyond incident scope - maintain focus on incident boundaries
  • 隧道视野:仅关注明显的攻击向量——考虑所有可能性
  • 假设驱动调查:无证据就假设攻击方法——让证据引导调查
  • 证据销毁:收集证据前清理系统——先留存证据
  • 范围蔓延:调查超出事件范围——聚焦事件边界

Recovery Anti-Patterns

恢复反模式

  • Rush to Restore: Restoring service before understanding root cause - fix cause before restore
  • Partial Recovery: Declaring recovery complete when partial - verify complete functionality
  • Configuration Drift: Restoring to previous broken state - restore to known good baseline
  • Monitoring Neglect: Not monitoring post-recovery - maintain heightened vigilance after incidents
  • 急于恢复:未理解根本原因就恢复服务——先解决原因再恢复
  • 部分恢复:仅部分恢复就宣布完成——验证完全功能恢复
  • 配置漂移:恢复到之前的故障状态——恢复到已知良好的基线
  • 监控疏忽:恢复后未进行监控——事件后保持高度警惕