incident-response
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Response
事件响应
Correlate incidents with deployments, assess blast radius, and generate postmortem documents using Harness MCP.
使用Harness MCP将事件与部署关联,评估blast radius并生成postmortem文档。
Instructions
操作说明
Step 1: Establish Scope
步骤1:确定范围
Confirm the affected service, environment, and incident details.
Call MCP tool: harness_list
Parameters:
resource_type: "service"
org_id: "<organization>"
project_id: "<project>"确认受影响的服务、环境及事件详情。
Call MCP tool: harness_list
Parameters:
resource_type: "service"
org_id: "<organization>"
project_id: "<project>"Step 2: Identify the Incident Response Task
步骤2:确定事件响应任务
Determine which workflow the user needs:
- Deployment-to-Incident Correlation -- Determine if a recent deployment caused the incident
- Blast Radius Assessment -- Map affected services and downstream impact
- Postmortem Generation -- Create a structured postmortem document
判断用户需要的工作流:
- 部署与事件关联 — 判断近期部署是否引发了该事件
- Blast Radius评估 — 梳理受影响服务及下游影响
- Postmortem文档生成 — 创建结构化的postmortem文档
Step 3: Correlate Deployment to Incident
步骤3:关联部署与事件
Gather from the user:
- Affected service name and environment
- Alert or incident name and start time
- Observed symptoms (error rate spike, latency, outage)
Pull recent deployments:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
status: "Success"For each recent deployment, check:
- Timing: Was the deployment within the incident correlation window (e.g., 2 hours before)?
- Service match: Does the deployed service match or depend on the affected service?
- Change content: What changed in the deployment (config, code, infrastructure)?
Build a deployment timeline:
- List all deployments to the affected environment in the last N hours
- Mark the incident start time on the timeline
- Identify the most likely causal deployment (closest before incident start)
- Check if a rollback was performed and whether it resolved the issue
Present findings with confidence level: HIGH (deployment matches timing + service), MEDIUM (timing matches but different service), LOW (no deployment correlation found).
向用户收集以下信息:
- 受影响的服务名称与环境
- 告警或事件名称及开始时间
- 观测到的症状(错误率飙升、延迟、服务中断)
获取近期部署记录:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
status: "Success"针对每个近期部署,检查以下内容:
- 时间匹配度:部署是否在事件关联窗口内(例如,事件发生前2小时)?
- 服务匹配度:部署的服务是否与受影响服务一致或存在依赖关系?
- 变更内容:部署中发生了哪些变更(配置、代码、基础设施)?
构建部署时间线:
- 列出过去N小时内受影响环境的所有部署记录
- 在时间线上标记事件开始时间
- 确定最可能引发事件的部署(事件开始前最近的部署)
- 检查是否执行了回滚操作,以及回滚是否解决了问题
结合置信度呈现分析结果:高置信度(部署时间与服务均匹配)、中置信度(时间匹配但服务不同)、低置信度(未找到部署关联)。
Step 4: Assess Blast Radius
步骤4:评估Blast Radius
Gather from the user:
- Failing service and failure type (outage, elevated error rate, high latency)
- Current error rate or severity
- Environment
Map the impact:
Call MCP tool: harness_status
Parameters:
org_id: "<organization>"
project_id: "<project>"Assess:
- Direct impact: The failing service's error rate, latency, and availability
- Upstream callers: Services that call the failing service -- are they degrading?
- Downstream dependencies: Services the failing service depends on -- are they healthy?
- User impact: Estimate affected users based on traffic volume
- Data integrity: Any risk of data corruption or inconsistency?
Classify severity:
- Critical: User-facing outage, data loss risk, or multiple services affected
- Major: Degraded performance affecting users, single service impacted
- Minor: Internal service degraded, no user-facing impact
Recommend immediate actions based on blast radius:
- If deployment-correlated: recommend rollback with expected resolution time
- If infrastructure-related: recommend failover or scaling
- If dependency-related: recommend circuit breaker activation or graceful degradation
向用户收集以下信息:
- 故障服务及故障类型(服务中断、错误率升高、高延迟)
- 当前错误率或严重程度
- 环境
梳理影响范围:
Call MCP tool: harness_status
Parameters:
org_id: "<organization>"
project_id: "<project>"评估内容:
- 直接影响:故障服务的错误率、延迟及可用性
- 上游调用方:调用故障服务的服务是否出现性能下降?
- 下游依赖:故障服务依赖的服务是否正常?
- 用户影响:基于流量估算受影响用户数量
- 数据完整性:是否存在数据损坏或不一致的风险?
严重程度分类:
- 严重(Critical):面向用户的服务中断、数据丢失风险,或多个服务受影响
- 重要(Major):性能下降影响用户,单个服务受影响
- 次要(Minor):内部服务性能下降,无面向用户的影响
根据Blast Radius评估结果推荐即时操作:
- 若与部署相关:建议执行回滚,并告知预期解决时间
- 若与基础设施相关:建议执行故障转移或扩容
- 若与依赖相关:建议启用断路器或执行优雅降级
Step 5: Generate Postmortem
步骤5:生成Postmortem文档
Gather from the user:
- Service name and incident summary
- Incident duration and environment
- Resolution steps taken
Structure the postmortem:
1. Executive Summary -- What happened, customer impact, duration (2-3 sentences)
2. Timeline -- Build from Harness pipeline events and alert timestamps:
- When was the issue first detected?
- When did the on-call team engage?
- What deployment or change triggered the regression?
- When was mitigation applied and service restored?
Pull timeline data:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"3. Root Cause Analysis -- Which deployment or change triggered the incident and why
4. Impact Assessment -- Affected services, environments, and approximate user impact
5. Action Items -- Categorized as:
- Immediate fixes (address the root cause)
- Process improvements (prevent recurrence)
- Monitoring improvements (detect faster)
6. Lessons Learned -- What went well, what didn't, and what was lucky
向用户收集以下信息:
- 服务名称及事件摘要
- 事件持续时间及环境
- 已执行的解决步骤
Postmortem文档结构:
1. 执行摘要 — 事件概况、客户影响、持续时间(2-3句话)
2. 时间线 — 基于Harness流水线事件及告警时间戳构建:
- 问题首次被发现的时间?
- 值班团队介入的时间?
- 引发问题的部署或变更是什么?
- 何时应用缓解措施并恢复服务?
获取时间线数据:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"3. 根本原因分析 — 引发事件的部署或变更是什么,以及原因
4. 影响评估 — 受影响的服务、环境及大致用户影响范围
5. 行动项 — 分为以下类别:
- 即时修复(解决根本原因)
- 流程改进(防止再次发生)
- 监控改进(更快发现问题)
6. 经验总结 — 哪些方面做得好,哪些不足,以及哪些是侥幸因素
Examples
示例
- "Our payment service is down -- was it a deployment?" -- Correlate incident with recent deployments and provide confidence level
- "What is the blast radius of the checkout outage?" -- Map upstream/downstream services and estimate user impact
- "Generate a postmortem for yesterday's auth-service incident" -- Create structured postmortem with timeline, RCA, and action items
- "A Sev-1 just fired -- which deployment caused it?" -- Pull recent deployments and correlate with alert timing
- “我们的支付服务中断了——是部署导致的吗?” — 将事件与近期部署关联,并提供置信度
- “结账服务中断的blast radius是多少?” — 梳理上下游服务并估算用户影响
- “为昨天的auth-service事件生成postmortem文档” — 创建包含时间线、根本原因分析(RCA)及行动项的结构化postmortem文档
- “刚刚触发了Sev-1告警——是哪个部署导致的?” — 获取近期部署记录并与告警时间关联
Performance Notes
性能注意事项
- Deployment correlation is most accurate within a 2-hour window -- beyond that, other factors become more likely.
- Blast radius assessment requires an up-to-date service dependency map -- stale maps miss connections.
- Postmortems should be generated within 48 hours while the incident is fresh in the team's memory.
- Always include what went well in the postmortem -- blameless culture requires acknowledging good responses.
- 部署关联在2小时窗口内最为准确——超过该时间范围,其他因素引发事件的可能性更大。
- Blast Radius评估需要最新的服务依赖关系图——过时的依赖图会遗漏关联关系。
- 应在事件发生后48小时内生成postmortem文档,此时团队对事件的记忆仍清晰。
- Postmortem文档中务必包含做得好的方面——无责文化要求认可有效的响应措施。
Troubleshooting
故障排除
No Deployment Found in Correlation Window
关联窗口内未找到部署记录
- Expand the search window to 4-6 hours -- some failures have delayed onset
- Check for infrastructure changes (not just code deployments)
- Look for config changes, feature flag toggles, or certificate expirations
- 将搜索窗口扩大至4-6小时——部分故障存在延迟发作的情况
- 检查基础设施变更(不仅限于代码部署)
- 排查配置变更、功能开关切换或证书过期情况
Blast Radius Assessment Missing Services
Blast Radius评估遗漏服务
- The service dependency map may be incomplete -- check for undocumented dependencies
- Look for shared infrastructure (databases, message queues) that multiple services use
- Check for external dependencies (third-party APIs, DNS, CDN)
- 服务依赖关系图可能不完整——检查未记录的依赖关系
- 排查多个服务共用的基础设施(数据库、消息队列)
- 检查外部依赖(第三方API、DNS、CDN)
Postmortem Missing Timeline Events
Postmortem文档缺少时间线事件
- Pull from multiple sources: pipeline executions, alert history, and chat transcripts
- Check if automated rollbacks occurred that may not be in the deployment history
- Include infrastructure events (auto-scaling, node failures) alongside deployment events
- 从多个数据源获取信息:流水线执行记录、告警历史及聊天记录
- 检查是否存在未记录在部署历史中的自动回滚操作
- 将基础设施事件(自动扩容、节点故障)与部署事件一同纳入