Incident Response
Correlate incidents with deployments, assess blast radius, and generate postmortem documents using Harness MCP.
Instructions
Step 1: Establish Scope
Confirm the affected service, environment, and incident details.
Call MCP tool: harness_list
Parameters:
resource_type: "service"
org_id: "<organization>"
project_id: "<project>"
Step 2: Identify the Incident Response Task
Determine which workflow the user needs:
- Deployment-to-Incident Correlation -- Determine if a recent deployment caused the incident
- Blast Radius Assessment -- Map affected services and downstream impact
- Postmortem Generation -- Create a structured postmortem document
Step 3: Correlate Deployment to Incident
Gather from the user:
- Affected service name and environment
- Alert or incident name and start time
- Observed symptoms (error rate spike, latency, outage)
Pull recent deployments:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
status: "Success"
For each recent deployment, check:
- Timing: Was the deployment within the incident correlation window (e.g., 2 hours before)?
- Service match: Does the deployed service match or depend on the affected service?
- Change content: What changed in the deployment (config, code, infrastructure)?
Build a deployment timeline:
- List all deployments to the affected environment in the last N hours
- Mark the incident start time on the timeline
- Identify the most likely causal deployment (closest before incident start)
- Check if a rollback was performed and whether it resolved the issue
Present findings with confidence level: HIGH (deployment matches timing + service), MEDIUM (timing matches but different service), LOW (no deployment correlation found).
Step 4: Assess Blast Radius
Gather from the user:
- Failing service and failure type (outage, elevated error rate, high latency)
- Current error rate or severity
- Environment
Map the impact:
Call MCP tool: harness_status
Parameters:
org_id: "<organization>"
project_id: "<project>"
Assess:
- Direct impact: The failing service's error rate, latency, and availability
- Upstream callers: Services that call the failing service -- are they degrading?
- Downstream dependencies: Services the failing service depends on -- are they healthy?
- User impact: Estimate affected users based on traffic volume
- Data integrity: Any risk of data corruption or inconsistency?
Classify severity:
- Critical: User-facing outage, data loss risk, or multiple services affected
- Major: Degraded performance affecting users, single service impacted
- Minor: Internal service degraded, no user-facing impact
Recommend immediate actions based on blast radius:
- If deployment-correlated: recommend rollback with expected resolution time
- If infrastructure-related: recommend failover or scaling
- If dependency-related: recommend circuit breaker activation or graceful degradation
Step 5: Generate Postmortem
Gather from the user:
- Service name and incident summary
- Incident duration and environment
- Resolution steps taken
Structure the postmortem:
1. Executive Summary -- What happened, customer impact, duration (2-3 sentences)
2. Timeline -- Build from Harness pipeline events and alert timestamps:
- When was the issue first detected?
- When did the on-call team engage?
- What deployment or change triggered the regression?
- When was mitigation applied and service restored?
Pull timeline data:
Call MCP tool: harness_list
Parameters:
resource_type: "execution"
org_id: "<organization>"
project_id: "<project>"
3. Root Cause Analysis -- Which deployment or change triggered the incident and why
4. Impact Assessment -- Affected services, environments, and approximate user impact
5. Action Items -- Categorized as:
- Immediate fixes (address the root cause)
- Process improvements (prevent recurrence)
- Monitoring improvements (detect faster)
6. Lessons Learned -- What went well, what didn't, and what was lucky
Examples
- "Our payment service is down -- was it a deployment?" -- Correlate incident with recent deployments and provide confidence level
- "What is the blast radius of the checkout outage?" -- Map upstream/downstream services and estimate user impact
- "Generate a postmortem for yesterday's auth-service incident" -- Create structured postmortem with timeline, RCA, and action items
- "A Sev-1 just fired -- which deployment caused it?" -- Pull recent deployments and correlate with alert timing
Performance Notes
- Deployment correlation is most accurate within a 2-hour window -- beyond that, other factors become more likely.
- Blast radius assessment requires an up-to-date service dependency map -- stale maps miss connections.
- Postmortems should be generated within 48 hours while the incident is fresh in the team's memory.
- Always include what went well in the postmortem -- blameless culture requires acknowledging good responses.
Troubleshooting
No Deployment Found in Correlation Window
- Expand the search window to 4-6 hours -- some failures have delayed onset
- Check for infrastructure changes (not just code deployments)
- Look for config changes, feature flag toggles, or certificate expirations
Blast Radius Assessment Missing Services
- The service dependency map may be incomplete -- check for undocumented dependencies
- Look for shared infrastructure (databases, message queues) that multiple services use
- Check for external dependencies (third-party APIs, DNS, CDN)
Postmortem Missing Timeline Events
- Pull from multiple sources: pipeline executions, alert history, and chat transcripts
- Check if automated rollbacks occurred that may not be in the deployment history
- Include infrastructure events (auto-scaling, node failures) alongside deployment events