Loading...
Loading...
Prospective failure analysis using Gary Klein's swing-mortem technique. Assumes complete failure, works backward to identify risks, leading indicators, and circuit breakers. Counters optimism bias by forcing systematic exploration of failure modes before they materialize. Use for project plans, architecture decisions, technology adoption, business strategy, or feature launches. Triggers on "리스크", "위험", "실패하면", "swing-mortem", "뭐가 잘못될 수 있어", "risk", "what could go wrong", "걱정되는 점", "failure modes", "리스크 분석", "위험 분석".
npx skill4agent add whynowlab/swing-skills swing-mortemswing-reviewswing-mortemFAILURE FRAME
─────────────
Subject: [what is being analyzed — plan, decision, architecture, launch]
Timeframe: [when failure is discovered — default 6 months, adjust to context]
Failure statement: "It is [timeframe] from now. [Subject] has failed completely.
Not partially underperformed — completely failed. The team is conducting a
post-mortem. What went wrong?"SCENARIO [N]: [Category] — [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What happened:
[2-4 sentence specific narrative of how this failure unfolded]
Why it was plausible:
[1-2 sentences on why this wasn't obvious beforehand]
Concrete consequence:
[Specific, measurable impact — revenue lost, users affected, time wasted, data compromised]RISK MATRIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|----------------|----------------------|------------|--------------|----------|
| 1 | Technical | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 2 | Organizational | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 3 | External | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 4 | Temporal | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 5 | Assumption | [title] | H / M / L | Cat/Sev/Mod | [rank] |LEADING INDICATORS — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Indicator 1: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]
Indicator 2: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]CIRCUIT BREAKER — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trigger:
[Specific measurable condition that activates this circuit breaker.
Must be a concrete threshold, not "if things go wrong."]
Fallback:
[The alternative path. What do you switch to? Be specific about
the replacement approach, not just "find another way."]
Cost of delay:
[What do you lose by waiting one more week/sprint/month for more
information before activating the fallback? Quantify if possible.]
Decision owner:
[Who has authority to pull this trigger? Role, not name.]PRE-MORTEM SUMMARY
━━━━━━━━━━━━━━━━━━
The highest risk to [subject] is [specific scenario from top priority].
You'll know it's happening when [most actionable leading indicator with
threshold]. Your escape hatch is [primary fallback from circuit breaker].
The cost of ignoring this: [concrete consequence]. The cost of acting
too early: [trade-off of the fallback]. Monitor [specific metric] starting
[when] to stay ahead of this risk.## Pre-Mortem: [Subject]
### Failure Frame
> It is [timeframe] from now. [Subject] has failed completely. What went wrong?
### Failure Scenarios
#### Scenario 1: Technical — [Title]
**What happened:** [Narrative]
**Why plausible:** [Reasoning]
**Consequence:** [Specific impact]
#### Scenario 2: Organizational — [Title]
**What happened:** [Narrative]
**Why plausible:** [Reasoning]
**Consequence:** [Specific impact]
#### Scenario 3: External — [Title]
**What happened:** [Narrative]
**Why plausible:** [Reasoning]
**Consequence:** [Specific impact]
#### Scenario 4: Temporal — [Title]
**What happened:** [Narrative]
**Why plausible:** [Reasoning]
**Consequence:** [Specific impact]
#### Scenario 5: Assumption — [Title]
**What happened:** [Narrative]
**Why plausible:** [Reasoning]
**Consequence:** [Specific impact]
### Risk Matrix
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|----------|----------|------------|--------|----------|
| 1 | Technical | ... | ... | ... | ... |
| 2 | Organizational | ... | ... | ... | ... |
| 3 | External | ... | ... | ... | ... |
| 4 | Temporal | ... | ... | ... | ... |
| 5 | Assumption | ... | ... | ... | ... |
### Deep Analysis (Top 3 Risks)
#### Risk [N]: [Title]
**Leading Indicators:**
1. **[Indicator name]** — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
2. **[Indicator name]** — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
**Circuit Breaker:**
- **Trigger:** [specific measurable condition]
- **Fallback:** [concrete alternative]
- **Cost of delay:** [what you lose by waiting]
- **Decision owner:** [role]
[Repeat for each top risk]
### Pre-Mortem Summary
The highest risk to [subject] is [X]. You'll know it's happening when [Y]. Your escape hatch is [Z]. The cost of ignoring this: [consequence]. The cost of acting too early: [trade-off]. Monitor [metric] starting [when].## Pre-Mortem: New Microservices Migration
### Failure Scenarios
#### Scenario 1: Technical — It didn't scale
**What happened:** The system couldn't handle the load.
**Why plausible:** Scaling is hard.
**Consequence:** Users were unhappy.
#### Scenario 2: Organizational — Communication issues
**What happened:** Teams didn't communicate well.
**Why plausible:** Communication is always a challenge.
**Consequence:** Things were delayed.
#### Scenario 3: External — Market changed
**What happened:** The market shifted.
**Why plausible:** Markets are unpredictable.
**Consequence:** Revenue was impacted.
### Leading Indicators
- Watch out for scaling issues
- Monitor team communication
- Keep an eye on the market
### Circuit Breakers
- If things go wrong, switch to plan B
- If communication breaks down, have more meetings## Pre-Mortem: Migrating Order Service from Monolith to Event-Driven Microservices
### Failure Frame
> It is 6 months from now. The microservices migration has failed completely.
> The team reverted to the monolith. What went wrong?
### Failure Scenarios
#### Scenario 1: Technical — Kafka consumer lag cascades into order loss
**What happened:** Under Black Friday load (12x normal), Kafka consumer groups
for the order-processing service fell behind by 4+ hours. The dead letter queue
filled its 10GB allocation. 2,340 orders were silently dropped because the
retry policy exhausted its 3 attempts while downstream inventory service was
backpressured. No alert fired because monitoring tracked consumer group status
(STABLE) rather than consumer lag (which had no threshold configured).
**Why plausible:** Load testing only covered 3x normal traffic. The failure
mode — consumer lag + DLQ overflow + silent drop — requires all three conditions
simultaneously, which wasn't in the test matrix.
**Consequence:** $180K in lost orders, 2,340 customer support tickets, and a
forced rollback to the monolith under production load — itself causing a 45-minute
outage during the revert.
#### Scenario 2: Organizational — Domain boundary mismatch between teams
**What happened:** The "order" bounded context was split between Team Alpha
(order creation) and Team Bravo (fulfillment). Both teams independently implemented
inventory reservation — Alpha with optimistic locking, Bravo with pessimistic
locking. The conflict wasn't discovered until integration testing in week 14 of
a 16-week timeline, requiring a 6-week redesign of the inventory domain model.
**Why plausible:** Domain-Driven Design workshops defined bounded contexts on
paper, but the actual code ownership didn't align. No cross-team code review
process existed for shared domain objects.
**Consequence:** 6-week schedule slip, team morale damage from rework, and the
inventory service shipped with a compatibility shim that became permanent
technical debt.
### Risk Matrix
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|----------|----------|------------|--------|----------|
| 1 | Technical | Kafka consumer lag cascade | Medium | Catastrophic | P2 |
| 2 | Organizational | Domain boundary mismatch | High | Severe | P2 |
### Deep Analysis — Risk 1: Kafka Consumer Lag Cascade
**Leading Indicators:**
1. **Consumer lag growth rate** — Measure: max consumer lag across all
partitions for order-processing group. Threshold: >10,000 messages or
lag growing >500 msg/sec for 5 consecutive minutes. Where: Kafka
monitoring dashboard (Burrow or equivalent). Lead time: 2-4 hours
before DLQ overflow at projected rates.
2. **DLQ fill rate** — Measure: dead letter queue size as percentage of
allocated storage. Threshold: >30% capacity outside of known incident
windows. Where: Kafka topic metrics + PagerDuty alert. Lead time:
1-2 hours before overflow.
**Circuit Breaker:**
- **Trigger:** Consumer lag exceeds 1 hour AND DLQ reaches 50% capacity
during any traffic event exceeding 5x baseline.
- **Fallback:** Activate synchronous HTTP fallback path for order processing
(already exists in monolith, needs a feature flag to route traffic).
Accept the latency penalty (800ms vs 200ms) to guarantee zero order loss.
- **Cost of delay:** Each hour of delay at Black Friday volumes risks ~$30K
in dropped orders. The fallback path adds 600ms latency but loses zero orders.
- **Decision owner:** On-call SRE lead with VP Engineering escalation.
### Pre-Mortem Summary
The highest risk to the order service migration is Kafka consumer lag cascading
into silent order loss under peak load. You'll know it's happening when consumer
lag exceeds 10,000 messages with a growth rate above 500 msg/sec for 5 minutes.
Your escape hatch is the synchronous HTTP fallback path behind a feature flag.
The cost of ignoring this: thousands of lost orders and a forced rollback under
production pressure. The cost of acting too early: 4x higher latency on order
processing during the traffic spike. Monitor Kafka consumer lag starting 2 weeks
before any projected traffic event exceeding 3x baseline.swing-reviewswing-researchswing-optionsengineering:code-reviewswing-mortemswing-reviewswing-researchswing-optionsdeep-dive-analyzerswing-mortemdeep-dive-analyzerswing-mortemswing-options