incident-response

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Response

事件响应

Correlate incidents with deployments, assess blast radius, and generate postmortem documents using Harness MCP.

使用Harness MCP将事件与部署关联，评估blast radius并生成postmortem文档。

Instructions

操作说明

Step 1: Establish Scope

步骤1：确定范围

Confirm the affected service, environment, and incident details.

Call MCP tool: harness_list
Parameters:
  resource_type: "service"
  org_id: "<organization>"
  project_id: "<project>"

确认受影响的服务、环境及事件详情。

Call MCP tool: harness_list
Parameters:
  resource_type: "service"
  org_id: "<organization>"
  project_id: "<project>"

Step 2: Identify the Incident Response Task

步骤2：确定事件响应任务

Determine which workflow the user needs:

Deployment-to-Incident Correlation -- Determine if a recent deployment caused the incident
Blast Radius Assessment -- Map affected services and downstream impact
Postmortem Generation -- Create a structured postmortem document

判断用户需要的工作流：

部署与事件关联 — 判断近期部署是否引发了该事件
Blast Radius评估 — 梳理受影响服务及下游影响
Postmortem文档生成 — 创建结构化的postmortem文档

Step 3: Correlate Deployment to Incident

步骤3：关联部署与事件

Gather from the user:

Affected service name and environment
Alert or incident name and start time
Observed symptoms (error rate spike, latency, outage)

Pull recent deployments:

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
  status: "Success"

For each recent deployment, check:

Timing: Was the deployment within the incident correlation window (e.g., 2 hours before)?
Service match: Does the deployed service match or depend on the affected service?
Change content: What changed in the deployment (config, code, infrastructure)?

Build a deployment timeline:

List all deployments to the affected environment in the last N hours
Mark the incident start time on the timeline
Identify the most likely causal deployment (closest before incident start)
Check if a rollback was performed and whether it resolved the issue

Present findings with confidence level: HIGH (deployment matches timing + service), MEDIUM (timing matches but different service), LOW (no deployment correlation found).

向用户收集以下信息：

受影响的服务名称与环境
告警或事件名称及开始时间
观测到的症状（错误率飙升、延迟、服务中断）

获取近期部署记录：

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
  status: "Success"

针对每个近期部署，检查以下内容：

时间匹配度：部署是否在事件关联窗口内（例如，事件发生前2小时）？
服务匹配度：部署的服务是否与受影响服务一致或存在依赖关系？
变更内容：部署中发生了哪些变更（配置、代码、基础设施）？

构建部署时间线：

列出过去N小时内受影响环境的所有部署记录
在时间线上标记事件开始时间
确定最可能引发事件的部署（事件开始前最近的部署）
检查是否执行了回滚操作，以及回滚是否解决了问题

结合置信度呈现分析结果：高置信度（部署时间与服务均匹配）、中置信度（时间匹配但服务不同）、低置信度（未找到部署关联）。

Step 4: Assess Blast Radius

步骤4：评估Blast Radius

Gather from the user:

Failing service and failure type (outage, elevated error rate, high latency)
Current error rate or severity
Environment

Map the impact:

Call MCP tool: harness_status
Parameters:
  org_id: "<organization>"
  project_id: "<project>"

Assess:

Direct impact: The failing service's error rate, latency, and availability
Upstream callers: Services that call the failing service -- are they degrading?
Downstream dependencies: Services the failing service depends on -- are they healthy?
User impact: Estimate affected users based on traffic volume
Data integrity: Any risk of data corruption or inconsistency?

Classify severity:

Critical: User-facing outage, data loss risk, or multiple services affected
Major: Degraded performance affecting users, single service impacted
Minor: Internal service degraded, no user-facing impact

Recommend immediate actions based on blast radius:

If deployment-correlated: recommend rollback with expected resolution time
If infrastructure-related: recommend failover or scaling
If dependency-related: recommend circuit breaker activation or graceful degradation

向用户收集以下信息：

故障服务及故障类型（服务中断、错误率升高、高延迟）
当前错误率或严重程度
环境

梳理影响范围：

Call MCP tool: harness_status
Parameters:
  org_id: "<organization>"
  project_id: "<project>"

评估内容：

直接影响：故障服务的错误率、延迟及可用性
上游调用方：调用故障服务的服务是否出现性能下降？
下游依赖：故障服务依赖的服务是否正常？
用户影响：基于流量估算受影响用户数量
数据完整性：是否存在数据损坏或不一致的风险？

严重程度分类：

严重（Critical）：面向用户的服务中断、数据丢失风险，或多个服务受影响
重要（Major）：性能下降影响用户，单个服务受影响
次要（Minor）：内部服务性能下降，无面向用户的影响

根据Blast Radius评估结果推荐即时操作：

若与部署相关：建议执行回滚，并告知预期解决时间
若与基础设施相关：建议执行故障转移或扩容
若与依赖相关：建议启用断路器或执行优雅降级

Step 5: Generate Postmortem

步骤5：生成Postmortem文档

Gather from the user:

Service name and incident summary
Incident duration and environment
Resolution steps taken

Structure the postmortem:

1. Executive Summary -- What happened, customer impact, duration (2-3 sentences)

2. Timeline -- Build from Harness pipeline events and alert timestamps:

When was the issue first detected?
When did the on-call team engage?
What deployment or change triggered the regression?
When was mitigation applied and service restored?

Pull timeline data:

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"

3. Root Cause Analysis -- Which deployment or change triggered the incident and why

4. Impact Assessment -- Affected services, environments, and approximate user impact

5. Action Items -- Categorized as:

Immediate fixes (address the root cause)
Process improvements (prevent recurrence)
Monitoring improvements (detect faster)

6. Lessons Learned -- What went well, what didn't, and what was lucky

向用户收集以下信息：

服务名称及事件摘要
事件持续时间及环境
已执行的解决步骤

Postmortem文档结构：

1. 执行摘要 — 事件概况、客户影响、持续时间（2-3句话）

2. 时间线 — 基于Harness流水线事件及告警时间戳构建：

问题首次被发现的时间？
值班团队介入的时间？
引发问题的部署或变更是什么？
何时应用缓解措施并恢复服务？

获取时间线数据：

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"

3. 根本原因分析 — 引发事件的部署或变更是什么，以及原因

4. 影响评估 — 受影响的服务、环境及大致用户影响范围

5. 行动项 — 分为以下类别：

即时修复（解决根本原因）
流程改进（防止再次发生）
监控改进（更快发现问题）

6. 经验总结 — 哪些方面做得好，哪些不足，以及哪些是侥幸因素

Examples

示例

"Our payment service is down -- was it a deployment?" -- Correlate incident with recent deployments and provide confidence level
"What is the blast radius of the checkout outage?" -- Map upstream/downstream services and estimate user impact
"Generate a postmortem for yesterday's auth-service incident" -- Create structured postmortem with timeline, RCA, and action items
"A Sev-1 just fired -- which deployment caused it?" -- Pull recent deployments and correlate with alert timing

“我们的支付服务中断了——是部署导致的吗？” — 将事件与近期部署关联，并提供置信度
“结账服务中断的blast radius是多少？” — 梳理上下游服务并估算用户影响
“为昨天的auth-service事件生成postmortem文档” — 创建包含时间线、根本原因分析（RCA）及行动项的结构化postmortem文档
“刚刚触发了Sev-1告警——是哪个部署导致的？” — 获取近期部署记录并与告警时间关联

Performance Notes

性能注意事项

Deployment correlation is most accurate within a 2-hour window -- beyond that, other factors become more likely.
Blast radius assessment requires an up-to-date service dependency map -- stale maps miss connections.
Postmortems should be generated within 48 hours while the incident is fresh in the team's memory.
Always include what went well in the postmortem -- blameless culture requires acknowledging good responses.

部署关联在2小时窗口内最为准确——超过该时间范围，其他因素引发事件的可能性更大。
Blast Radius评估需要最新的服务依赖关系图——过时的依赖图会遗漏关联关系。
应在事件发生后48小时内生成postmortem文档，此时团队对事件的记忆仍清晰。
Postmortem文档中务必包含做得好的方面——无责文化要求认可有效的响应措施。

Troubleshooting

故障排除

No Deployment Found in Correlation Window

关联窗口内未找到部署记录

Expand the search window to 4-6 hours -- some failures have delayed onset
Check for infrastructure changes (not just code deployments)
Look for config changes, feature flag toggles, or certificate expirations

将搜索窗口扩大至4-6小时——部分故障存在延迟发作的情况
检查基础设施变更（不仅限于代码部署）
排查配置变更、功能开关切换或证书过期情况

Blast Radius Assessment Missing Services

Blast Radius评估遗漏服务

The service dependency map may be incomplete -- check for undocumented dependencies
Look for shared infrastructure (databases, message queues) that multiple services use
Check for external dependencies (third-party APIs, DNS, CDN)

服务依赖关系图可能不完整——检查未记录的依赖关系
排查多个服务共用的基础设施（数据库、消息队列）
检查外部依赖（第三方API、DNS、CDN）

Postmortem Missing Timeline Events

Postmortem文档缺少时间线事件

Pull from multiple sources: pipeline executions, alert history, and chat transcripts
Check if automated rollbacks occurred that may not be in the deployment history
Include infrastructure events (auto-scaling, node failures) alongside deployment events

从多个数据源获取信息：流水线执行记录、告警历史及聊天记录
检查是否存在未记录在部署历史中的自动回滚操作
将基础设施事件（自动扩容、节点故障）与部署事件一同纳入