incident-response

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Response

事件响应

Correlate incidents with deployments, assess blast radius, and generate postmortem documents using Harness MCP.
使用Harness MCP将事件与部署关联,评估blast radius并生成postmortem文档。

Instructions

操作说明

Step 1: Establish Scope

步骤1:确定范围

Confirm the affected service, environment, and incident details.
Call MCP tool: harness_list
Parameters:
  resource_type: "service"
  org_id: "<organization>"
  project_id: "<project>"
确认受影响的服务、环境及事件详情。
Call MCP tool: harness_list
Parameters:
  resource_type: "service"
  org_id: "<organization>"
  project_id: "<project>"

Step 2: Identify the Incident Response Task

步骤2:确定事件响应任务

Determine which workflow the user needs:
  1. Deployment-to-Incident Correlation -- Determine if a recent deployment caused the incident
  2. Blast Radius Assessment -- Map affected services and downstream impact
  3. Postmortem Generation -- Create a structured postmortem document
判断用户需要的工作流:
  1. 部署与事件关联 — 判断近期部署是否引发了该事件
  2. Blast Radius评估 — 梳理受影响服务及下游影响
  3. Postmortem文档生成 — 创建结构化的postmortem文档

Step 3: Correlate Deployment to Incident

步骤3:关联部署与事件

Gather from the user:
  • Affected service name and environment
  • Alert or incident name and start time
  • Observed symptoms (error rate spike, latency, outage)
Pull recent deployments:
Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
  status: "Success"
For each recent deployment, check:
  • Timing: Was the deployment within the incident correlation window (e.g., 2 hours before)?
  • Service match: Does the deployed service match or depend on the affected service?
  • Change content: What changed in the deployment (config, code, infrastructure)?
Build a deployment timeline:
  1. List all deployments to the affected environment in the last N hours
  2. Mark the incident start time on the timeline
  3. Identify the most likely causal deployment (closest before incident start)
  4. Check if a rollback was performed and whether it resolved the issue
Present findings with confidence level: HIGH (deployment matches timing + service), MEDIUM (timing matches but different service), LOW (no deployment correlation found).
向用户收集以下信息:
  • 受影响的服务名称与环境
  • 告警或事件名称及开始时间
  • 观测到的症状(错误率飙升、延迟、服务中断)
获取近期部署记录:
Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
  status: "Success"
针对每个近期部署,检查以下内容:
  • 时间匹配度:部署是否在事件关联窗口内(例如,事件发生前2小时)?
  • 服务匹配度:部署的服务是否与受影响服务一致或存在依赖关系?
  • 变更内容:部署中发生了哪些变更(配置、代码、基础设施)?
构建部署时间线:
  1. 列出过去N小时内受影响环境的所有部署记录
  2. 在时间线上标记事件开始时间
  3. 确定最可能引发事件的部署(事件开始前最近的部署)
  4. 检查是否执行了回滚操作,以及回滚是否解决了问题
结合置信度呈现分析结果:高置信度(部署时间与服务均匹配)、中置信度(时间匹配但服务不同)、低置信度(未找到部署关联)。

Step 4: Assess Blast Radius

步骤4:评估Blast Radius

Gather from the user:
  • Failing service and failure type (outage, elevated error rate, high latency)
  • Current error rate or severity
  • Environment
Map the impact:
Call MCP tool: harness_status
Parameters:
  org_id: "<organization>"
  project_id: "<project>"
Assess:
  • Direct impact: The failing service's error rate, latency, and availability
  • Upstream callers: Services that call the failing service -- are they degrading?
  • Downstream dependencies: Services the failing service depends on -- are they healthy?
  • User impact: Estimate affected users based on traffic volume
  • Data integrity: Any risk of data corruption or inconsistency?
Classify severity:
  • Critical: User-facing outage, data loss risk, or multiple services affected
  • Major: Degraded performance affecting users, single service impacted
  • Minor: Internal service degraded, no user-facing impact
Recommend immediate actions based on blast radius:
  • If deployment-correlated: recommend rollback with expected resolution time
  • If infrastructure-related: recommend failover or scaling
  • If dependency-related: recommend circuit breaker activation or graceful degradation
向用户收集以下信息:
  • 故障服务及故障类型(服务中断、错误率升高、高延迟)
  • 当前错误率或严重程度
  • 环境
梳理影响范围:
Call MCP tool: harness_status
Parameters:
  org_id: "<organization>"
  project_id: "<project>"
评估内容:
  • 直接影响:故障服务的错误率、延迟及可用性
  • 上游调用方:调用故障服务的服务是否出现性能下降?
  • 下游依赖:故障服务依赖的服务是否正常?
  • 用户影响:基于流量估算受影响用户数量
  • 数据完整性:是否存在数据损坏或不一致的风险?
严重程度分类:
  • 严重(Critical):面向用户的服务中断、数据丢失风险,或多个服务受影响
  • 重要(Major):性能下降影响用户,单个服务受影响
  • 次要(Minor):内部服务性能下降,无面向用户的影响
根据Blast Radius评估结果推荐即时操作:
  • 若与部署相关:建议执行回滚,并告知预期解决时间
  • 若与基础设施相关:建议执行故障转移或扩容
  • 若与依赖相关:建议启用断路器或执行优雅降级

Step 5: Generate Postmortem

步骤5:生成Postmortem文档

Gather from the user:
  • Service name and incident summary
  • Incident duration and environment
  • Resolution steps taken
Structure the postmortem:
1. Executive Summary -- What happened, customer impact, duration (2-3 sentences)
2. Timeline -- Build from Harness pipeline events and alert timestamps:
  • When was the issue first detected?
  • When did the on-call team engage?
  • What deployment or change triggered the regression?
  • When was mitigation applied and service restored?
Pull timeline data:
Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
3. Root Cause Analysis -- Which deployment or change triggered the incident and why
4. Impact Assessment -- Affected services, environments, and approximate user impact
5. Action Items -- Categorized as:
  • Immediate fixes (address the root cause)
  • Process improvements (prevent recurrence)
  • Monitoring improvements (detect faster)
6. Lessons Learned -- What went well, what didn't, and what was lucky
向用户收集以下信息:
  • 服务名称及事件摘要
  • 事件持续时间及环境
  • 已执行的解决步骤
Postmortem文档结构:
1. 执行摘要 — 事件概况、客户影响、持续时间(2-3句话)
2. 时间线 — 基于Harness流水线事件及告警时间戳构建:
  • 问题首次被发现的时间?
  • 值班团队介入的时间?
  • 引发问题的部署或变更是什么?
  • 何时应用缓解措施并恢复服务?
获取时间线数据:
Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
3. 根本原因分析 — 引发事件的部署或变更是什么,以及原因
4. 影响评估 — 受影响的服务、环境及大致用户影响范围
5. 行动项 — 分为以下类别:
  • 即时修复(解决根本原因)
  • 流程改进(防止再次发生)
  • 监控改进(更快发现问题)
6. 经验总结 — 哪些方面做得好,哪些不足,以及哪些是侥幸因素

Examples

示例

  • "Our payment service is down -- was it a deployment?" -- Correlate incident with recent deployments and provide confidence level
  • "What is the blast radius of the checkout outage?" -- Map upstream/downstream services and estimate user impact
  • "Generate a postmortem for yesterday's auth-service incident" -- Create structured postmortem with timeline, RCA, and action items
  • "A Sev-1 just fired -- which deployment caused it?" -- Pull recent deployments and correlate with alert timing
  • “我们的支付服务中断了——是部署导致的吗?” — 将事件与近期部署关联,并提供置信度
  • “结账服务中断的blast radius是多少?” — 梳理上下游服务并估算用户影响
  • “为昨天的auth-service事件生成postmortem文档” — 创建包含时间线、根本原因分析(RCA)及行动项的结构化postmortem文档
  • “刚刚触发了Sev-1告警——是哪个部署导致的?” — 获取近期部署记录并与告警时间关联

Performance Notes

性能注意事项

  • Deployment correlation is most accurate within a 2-hour window -- beyond that, other factors become more likely.
  • Blast radius assessment requires an up-to-date service dependency map -- stale maps miss connections.
  • Postmortems should be generated within 48 hours while the incident is fresh in the team's memory.
  • Always include what went well in the postmortem -- blameless culture requires acknowledging good responses.
  • 部署关联在2小时窗口内最为准确——超过该时间范围,其他因素引发事件的可能性更大。
  • Blast Radius评估需要最新的服务依赖关系图——过时的依赖图会遗漏关联关系。
  • 应在事件发生后48小时内生成postmortem文档,此时团队对事件的记忆仍清晰。
  • Postmortem文档中务必包含做得好的方面——无责文化要求认可有效的响应措施。

Troubleshooting

故障排除

No Deployment Found in Correlation Window

关联窗口内未找到部署记录

  • Expand the search window to 4-6 hours -- some failures have delayed onset
  • Check for infrastructure changes (not just code deployments)
  • Look for config changes, feature flag toggles, or certificate expirations
  • 将搜索窗口扩大至4-6小时——部分故障存在延迟发作的情况
  • 检查基础设施变更(不仅限于代码部署)
  • 排查配置变更、功能开关切换或证书过期情况

Blast Radius Assessment Missing Services

Blast Radius评估遗漏服务

  • The service dependency map may be incomplete -- check for undocumented dependencies
  • Look for shared infrastructure (databases, message queues) that multiple services use
  • Check for external dependencies (third-party APIs, DNS, CDN)
  • 服务依赖关系图可能不完整——检查未记录的依赖关系
  • 排查多个服务共用的基础设施(数据库、消息队列)
  • 检查外部依赖(第三方API、DNS、CDN)

Postmortem Missing Timeline Events

Postmortem文档缺少时间线事件

  • Pull from multiple sources: pipeline executions, alert history, and chat transcripts
  • Check if automated rollbacks occurred that may not be in the deployment history
  • Include infrastructure events (auto-scaling, node failures) alongside deployment events
  • 从多个数据源获取信息:流水线执行记录、告警历史及聊天记录
  • 检查是否存在未记录在部署历史中的自动回滚操作
  • 将基础设施事件(自动扩容、节点故障)与部署事件一同纳入