incident-response

Original：🇺🇸 English

Translated

Incident response and analysis via Harness MCP. Correlate incidents with recent deployments, assess blast radius and downstream service impact, and generate comprehensive postmortem documents. Use when asked to investigate an incident, determine if a deployment caused an issue, assess blast radius, or create a postmortem. Do NOT use for pipeline debugging (use debug-pipeline instead) or SLO management (use manage-slos instead). Trigger phrases: incident, deployment correlation, blast radius, postmortem, root cause, service impact, outage analysis, rollback decision, incident timeline, deployment caused, which deploy.

7installs

Sourceharness/harness-skills

Added on2026-05-25

NPX Install

npx skill4agent add harness/harness-skills incident-response

SKILL.md Content

View Translation Comparison →

Incident Response

Correlate incidents with deployments, assess blast radius, and generate postmortem documents using Harness MCP.

Instructions

Step 1: Establish Scope

Confirm the affected service, environment, and incident details.

Call MCP tool: harness_list
Parameters:
  resource_type: "service"
  org_id: "<organization>"
  project_id: "<project>"

Step 2: Identify the Incident Response Task

Determine which workflow the user needs:

Deployment-to-Incident Correlation -- Determine if a recent deployment caused the incident
Blast Radius Assessment -- Map affected services and downstream impact
Postmortem Generation -- Create a structured postmortem document

Step 3: Correlate Deployment to Incident

Gather from the user:

Affected service name and environment
Alert or incident name and start time
Observed symptoms (error rate spike, latency, outage)

Pull recent deployments:

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"
  status: "Success"

For each recent deployment, check:

Timing: Was the deployment within the incident correlation window (e.g., 2 hours before)?
Service match: Does the deployed service match or depend on the affected service?
Change content: What changed in the deployment (config, code, infrastructure)?

Build a deployment timeline:

List all deployments to the affected environment in the last N hours
Mark the incident start time on the timeline
Identify the most likely causal deployment (closest before incident start)
Check if a rollback was performed and whether it resolved the issue

Present findings with confidence level: HIGH (deployment matches timing + service), MEDIUM (timing matches but different service), LOW (no deployment correlation found).

Step 4: Assess Blast Radius

Gather from the user:

Failing service and failure type (outage, elevated error rate, high latency)
Current error rate or severity
Environment

Map the impact:

Call MCP tool: harness_status
Parameters:
  org_id: "<organization>"
  project_id: "<project>"

Assess:

Direct impact: The failing service's error rate, latency, and availability
Upstream callers: Services that call the failing service -- are they degrading?
Downstream dependencies: Services the failing service depends on -- are they healthy?
User impact: Estimate affected users based on traffic volume
Data integrity: Any risk of data corruption or inconsistency?

Classify severity:

Critical: User-facing outage, data loss risk, or multiple services affected
Major: Degraded performance affecting users, single service impacted
Minor: Internal service degraded, no user-facing impact

Recommend immediate actions based on blast radius:

If deployment-correlated: recommend rollback with expected resolution time
If infrastructure-related: recommend failover or scaling
If dependency-related: recommend circuit breaker activation or graceful degradation

Step 5: Generate Postmortem

Gather from the user:

Service name and incident summary
Incident duration and environment
Resolution steps taken

Structure the postmortem:

1. Executive Summary -- What happened, customer impact, duration (2-3 sentences)

2. Timeline -- Build from Harness pipeline events and alert timestamps:

When was the issue first detected?
When did the on-call team engage?
What deployment or change triggered the regression?
When was mitigation applied and service restored?

Pull timeline data:

Call MCP tool: harness_list
Parameters:
  resource_type: "execution"
  org_id: "<organization>"
  project_id: "<project>"

3. Root Cause Analysis -- Which deployment or change triggered the incident and why

4. Impact Assessment -- Affected services, environments, and approximate user impact

5. Action Items -- Categorized as:

Immediate fixes (address the root cause)
Process improvements (prevent recurrence)
Monitoring improvements (detect faster)

6. Lessons Learned -- What went well, what didn't, and what was lucky

Examples

"Our payment service is down -- was it a deployment?" -- Correlate incident with recent deployments and provide confidence level
"What is the blast radius of the checkout outage?" -- Map upstream/downstream services and estimate user impact
"Generate a postmortem for yesterday's auth-service incident" -- Create structured postmortem with timeline, RCA, and action items
"A Sev-1 just fired -- which deployment caused it?" -- Pull recent deployments and correlate with alert timing

Performance Notes

Deployment correlation is most accurate within a 2-hour window -- beyond that, other factors become more likely.
Blast radius assessment requires an up-to-date service dependency map -- stale maps miss connections.
Postmortems should be generated within 48 hours while the incident is fresh in the team's memory.
Always include what went well in the postmortem -- blameless culture requires acknowledging good responses.

Troubleshooting

No Deployment Found in Correlation Window

Expand the search window to 4-6 hours -- some failures have delayed onset
Check for infrastructure changes (not just code deployments)
Look for config changes, feature flag toggles, or certificate expirations

Blast Radius Assessment Missing Services

The service dependency map may be incomplete -- check for undocumented dependencies
Look for shared infrastructure (databases, message queues) that multiple services use
Check for external dependencies (third-party APIs, DNS, CDN)

Postmortem Missing Timeline Events

Pull from multiple sources: pipeline executions, alert history, and chat transcripts
Check if automated rollbacks occurred that may not be in the deployment history
Include infrastructure events (auto-scaling, node failures) alongside deployment events

incident-response

NPX Install

Tags

SKILL.md Content

Incident Response

Instructions

Step 1: Establish Scope

Step 2: Identify the Incident Response Task

Step 3: Correlate Deployment to Incident

Step 4: Assess Blast Radius

Step 5: Generate Postmortem

Examples

Performance Notes

Troubleshooting

No Deployment Found in Correlation Window

Blast Radius Assessment Missing Services

Postmortem Missing Timeline Events