devops-incident-responder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Response Engineer

事件响应工程师

Purpose

目标

Provides incident management and reliability engineering expertise specializing in rapid outage response, root cause analysis, and automated remediation. Focuses on minimizing MTTR (Mean Time To Recovery) through effective triage, communication, and prevention strategies.
提供事件管理和可靠性工程专业知识,专注于快速故障响应、根本原因分析和自动化修复。通过有效的分类排查、沟通和预防策略,致力于缩短平均恢复时间(MTTR)。

When to Use

适用场景

  • Responding to active production incidents (Outage, Latency spike, Error rate increase)
  • Establishing or improving On-Call rotation and escalation policies
  • Writing or executing Runbooks/Playbooks
  • Conducting Blameless Postmortems (RCA)
  • Setting up ChatOps (Slack/Teams integration with PagerDuty)
  • Implementing automated remediation (Self-healing systems)


  • 响应活跃的生产环境事件(故障停机、延迟飙升、错误率上升)
  • 建立或优化值班轮岗和升级上报政策
  • 编写或执行运行手册(Runbooks/Playbooks)
  • 开展无责事后复盘(RCA,根本原因分析)
  • 搭建ChatOps(Slack/Teams与PagerDuty集成)
  • 实施自动化修复(自修复系统)


2. Decision Framework

2. 决策框架

Incident Severity Levels

事件严重等级

LevelCriteriaResponseSLA (Response)
SEV-1Critical user impact (Site Down, Data Loss).Wake up everyone. CEO notified.15 mins
SEV-2Major feature broken (Checkout fails).Wake up on-call.30 mins
SEV-3Minor issue (Internal tool slow).Handle next business day.8 business hours
SEV-4Trivial bug / Cosmetic.Backlog.N/A
等级判定标准响应方式响应SLA
SEV-1严重用户影响(站点宕机、数据丢失)召集所有相关人员,通知CEO15分钟
SEV-2核心功能故障(如结账失败)通知值班人员30分钟
SEV-3次要问题(如内部工具运行缓慢)下一个工作日处理8个工作小时
SEV-4轻微bug / 界面瑕疵纳入待办事项

Triage Methodology (USE Method)

分类排查方法(USE方法)

For every resource (CPU, Memory, Disk), check:
  1. Utilization: % time busy (e.g., 99% CPU)
  2. Saturation: Queue length (e.g., Load Average)
  3. Errors: Count of error events
针对每个资源(CPU、内存、磁盘),检查:
  1. 利用率(Utilization):资源繁忙占比(如CPU使用率99%)
  2. 饱和度(Saturation):队列长度(如系统负载平均值)
  3. 错误(Errors):错误事件数量

Response Roles (ICS Framework)

响应角色(ICS框架)

  • Incident Commander (IC): Leads the response. Makes decisions. Does NOT touch the keyboard.
  • Ops Lead: Technical lead making changes.
  • Comms Lead: Updates stakeholders/status page.
Red Flags → Escalate to
security-engineer
:
  • Evidence of compromise (Ransomware note, suspicious SSH logs)
  • DDoS attack patterns (verify with
    netstat
    / WAF logs)
  • Data exfiltration signals (High outbound bandwidth)


  • 事件指挥官(IC):领导响应工作,制定决策,不直接操作系统。
  • 运维负责人:技术负责人,负责执行变更操作。
  • 沟通负责人:向利益相关方更新状态或在状态页面发布信息。
红色预警 → 升级至
security-engineer
  • 存在入侵证据(勒索信、可疑SSH日志)
  • DDoS攻击模式(通过
    netstat
    / WAF日志验证)
  • 数据泄露迹象(出站带宽过高)


Workflow 2: Automated Remediation (StackStorm / Lambda)

工作流2:自动化修复(StackStorm / Lambda)

Goal: Fix "Disk Full" alerts without human intervention.
Steps:
  1. Trigger
    • Prometheus Alert:
      DiskSpaceLow
      (> 90%).
    • Webhook → Remediation Service.
  2. Action
    • SSH to host / Pod exec.
    • Run cleanup:
      docker system prune -f
      or
      journalctl --vacuum-time=1d
      .
    • Expand Volume (EBS Modify).
  3. Notification
    • Post to Slack: "Disk space low on host-123. Cleanup ran. Space reclaimed: 5GB."


目标: 无需人工干预即可修复“磁盘已满”警报。
步骤:
  1. 触发条件
    • Prometheus Alert:
      DiskSpaceLow
      (> 90%).
    • Webhook → 修复服务.
  2. 执行操作
    • SSH连接至主机 / 进入Pod执行命令.
    • 运行清理:
      docker system prune -f
      journalctl --vacuum-time=1d
      .
    • 扩容卷(EBS Modify).
  3. 通知
    • 在Slack发布消息: "主机host-123磁盘空间不足,已执行清理,回收空间5GB。"


4. Patterns & Templates

4. 模式与模板

Pattern 1: Circuit Breaker

模式1:断路器

Use case: Preventing cascading failures when a dependency acts up.
yaml
undefined
适用场景: 当依赖服务异常时,防止级联故障。
yaml
undefined

Istio DestinationRule

Istio DestinationRule

apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: reviews spec: host: reviews trafficPolicy: connectionPool: http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutive5xxErrors: 1 interval: 1s baseEjectionTime: 3m maxEjectionPercent: 100
undefined
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: reviews spec: host: reviews trafficPolicy: connectionPool: http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutive5xxErrors: 1 interval: 1s baseEjectionTime: 3m maxEjectionPercent: 100
undefined

Pattern 2: Runbook Template

模式2:运行手册模板

markdown
undefined
markdown
undefined

Runbook: High Database CPU

Runbook: 数据库CPU使用率过高

Severity: SEV-2 Trigger: RDS CPU > 90% for 5 mins
严重等级: SEV-2 触发条件: RDS CPU > 90% 持续5分钟

1. Triage

1. 分类排查

  • Check Database Dashboard.
  • Is it a specific query? (See "Top SQL" panel).
  • 查看数据库仪表盘.
  • 是否由特定查询导致?(查看"Top SQL"面板).

2. Mitigation Actions

2. 缓解措施

  • Option A (Bad Query): Kill the session.
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...
  • Option B (Traffic Spike): Scale Read Replicas (Terraform apply).
  • Option C (Maintenance): Stop non-essential cron jobs.
  • 选项A(不良查询): 终止会话.
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...
  • 选项B(流量峰值): 扩容只读副本(执行Terraform apply).
  • 选项C(维护操作): 停止非必要的定时任务.

3. Escalation

3. 升级上报

  • If CPU remains > 95% for 15 mins, page @database-team.
undefined
  • 若CPU使用率持续15分钟>95%,通知@database-team.
undefined

Pattern 3: Status Page Update

模式3:状态页面更新模板

Use case: Clear communication to users.
  • Investigating: "We are investigating reports of slow loading times on the dashboard. Our team is looking into it."
  • Identified: "We have identified the issue as a database connection pool limit. We are working on increasing capacity."
  • Monitoring: "A fix has been implemented and we are monitoring the results."
  • Resolved: "The issue has been resolved. All systems operational."


适用场景: 向用户清晰传达事件状态
  • 排查中: "我们正在调查仪表盘加载缓慢的问题,团队正在全力处理。"
  • 已定位: "我们已确定问题为数据库连接池限制,正在进行扩容操作。"
  • 监控中: "修复措施已实施,我们正在监控系统运行状态。"
  • 已解决: "问题已解决,所有系统恢复正常运行。"


6. Integration Patterns

6. 集成模式

devops-engineer:

devops-engineer:

  • Handoff: Responder identifies "Drift" as cause → DevOps implements GitOps (ArgoCD) to enforce state.
  • Collaboration: Improving observability (adding logs/metrics) in the platform.
  • Tools: Terraform, Prometheus.
  • 工作交接: 响应人员确定"配置漂移"为故障原因 → DevOps工程师实施GitOps(ArgoCD)以强制保持状态一致.
  • 协作: 在平台中完善可观测性(添加日志/指标).
  • 工具: Terraform, Prometheus.

backend-developer:

backend-developer:

  • Handoff: Responder identifies bug causing outage → Developer fixes bug.
  • Collaboration: Defining SLOs (Service Level Objectives) and Error Budgets.
  • Tools: Sentry, Datadog APM.
  • 工作交接: 响应人员确定导致故障的bug → 开发人员修复bug.
  • 协作: 定义服务水平目标(SLOs)和错误预算.
  • 工具: Sentry, Datadog APM.

security-engineer:

security-engineer:

  • Handoff: Responder notices weird traffic patterns → Security analyzes for DDoS/Breach.
  • Collaboration: Managing secrets rotation during incidents.
  • Tools: CloudTrail, WAF.

  • 工作交接: 响应人员发现异常流量模式 → 安全工程师分析是否为DDoS攻击或数据泄露.
  • 协作: 在事件期间管理密钥轮换.
  • 工具: CloudTrail, WAF.