swing-mortem
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePre-Mortem
预验尸(Pre-Mortem)
Prospective failure analysis that defeats optimism bias by assuming failure first, then working backward to surface risks, early warnings, and escape hatches.
Based on Gary Klein's swing-mortem technique: Instead of asking "will this work?" (which triggers optimism bias), this skill forces the question: "It's 6 months from now and this has completely failed. What went wrong?"
Key distinction from swing-review:
- examines the CURRENT state — "what's wrong NOW?"
swing-review - examines the FUTURE — "what will go wrong LATER?"
swing-mortem - Adversarial review finds existing flaws. Pre-mortem anticipates flaws that don't exist yet.
一种前瞻性失败分析方法,通过先假设失败,再回溯排查风险、预警信号和应急方案,来抵消乐观偏差。
基于Gary Klein的swing-mortem技术: 不再问“这个方案可行吗?”(这类问题会引发乐观偏差),该方法会迫使团队思考:“6个月后,这个项目彻底失败了。问题出在哪里?”
与swing-review的核心区别:
- 审视当前状态——“现在存在什么问题?”
swing-review - 审视未来——“未来会出现什么问题?”
swing-mortem - 对抗性评审查找已存在的缺陷,而预验尸分析则预判尚未出现的缺陷。
Rules (Absolute)
核心规则(必须遵守)
- Never produce generic risks. Every failure scenario must name specific technologies, quantities, timelines, or conditions. "The database might not scale" is banned. "PostgreSQL connection pool exhaustion at >2,000 concurrent users due to long-running analytical queries holding connections for 30s+" is acceptable.
- Exactly 5 scenarios across 5 categories. One Technical, one Organizational, one External, one Temporal, one Assumption. No category may be skipped, no category may have more than one scenario.
- Leading indicators must be observable and measurable. "Watch out for problems" is banned. Every indicator must specify what to measure, what threshold signals danger, and where to observe it.
- Circuit breakers must include a specific trigger condition. "If things go wrong" is banned. Every trigger must be a measurable condition with a concrete threshold.
- The swing-mortem summary is MANDATORY. It is the BLUF of the analysis. It must appear at the end and synthesize the highest risk, its leading indicator, and its escape hatch in one paragraph.
- Assume complete failure. Not partial, not "underperformance." The premise is total failure. This extreme framing is what forces creative risk identification — do not soften it.
- Specificity over coverage. One deeply analyzed, plausible failure scenario per category is worth more than five shallow ones. Depth beats breadth.
- 禁止提出泛泛的风险。 每个故障场景必须明确指出具体的技术、数量、时间线或条件。“数据库可能无法扩容”这类表述是不允许的。“当并发用户数超过2000时,由于长时间运行的分析查询占用连接30秒以上,导致PostgreSQL连接池耗尽”这类表述是可接受的。
- 需覆盖5个类别,每个类别1个场景。 分别为技术类、组织类、外部类、时间类、假设类。不得跳过任何类别,也不得在单个类别下设置多个场景。
- 预警指标必须可观测、可量化。 “注意可能出现的问题”这类表述是不允许的。每个指标必须明确说明要测量的内容、触发预警的阈值,以及观测渠道。
- 熔断机制必须包含具体的触发条件。 “如果出现问题”这类表述是不允许的。每个触发条件必须是可量化的具体阈值。
- 必须包含预验尸分析摘要。 这是分析的核心结论部分,需放在最后,用一段话总结最高优先级风险、对应的预警指标和应急方案。
- 假设项目已完全失败。 不是部分失败或“表现不佳”,前提是彻底失败。这种极端假设能倒逼团队创造性地识别风险——不得弱化这一前提。
- 宁深勿广。 每个类别下一个深入分析、符合逻辑的故障场景,远胜于五个浅尝辄止的场景。深度优于广度。
Process
执行流程
Execute these 6 phases sequentially. Do NOT skip phases.
按以下6个阶段依次执行,不得跳过任何阶段。
Phase 1: Set the Failure Frame
阶段1:设定故障框架
Establish the temporal and contextual frame before any analysis.
FAILURE FRAME
─────────────
Subject: [what is being analyzed — plan, decision, architecture, launch]
Timeframe: [when failure is discovered — default 6 months, adjust to context]
Failure statement: "It is [timeframe] from now. [Subject] has failed completely.
Not partially underperformed — completely failed. The team is conducting a
post-mortem. What went wrong?"If the subject is ambiguous or too broad, ask one clarifying question before proceeding. "Analyze our project" is too vague. "Analyze our migration from MongoDB to PostgreSQL for the user service" is actionable.
Before generating scenarios, gather context:
- If code/architecture exists, read relevant files to ground scenarios in reality
- If a project plan exists, examine timelines and dependencies
- If prior decisions are documented, review the rationale and constraints
Do not generate scenarios from imagination alone when concrete artifacts are available.
在开始任何分析前,先确定时间范围和背景框架。
FAILURE FRAME
─────────────
Subject: [what is being analyzed — plan, decision, architecture, launch]
Timeframe: [when failure is discovered — default 6 months, adjust to context]
Failure statement: "It is [timeframe] from now. [Subject] has failed completely.
Not partially underperformed — completely failed. The team is conducting a
post-mortem. What went wrong?"如果分析对象模糊或范围过宽,先提出一个澄清问题再继续。“分析我们的项目”这类表述过于模糊,“分析用户服务从MongoDB迁移到PostgreSQL的计划”才是可执行的。
生成场景前,先收集上下文信息:
- 如果已有代码/架构,阅读相关文件,确保场景符合实际情况
- 如果已有项目计划,查看时间线和依赖关系
- 如果已有决策文档,回顾决策依据和约束条件
当有具体文档可参考时,不得仅凭想象生成场景。
Phase 2: Failure Scenario Generation
阶段2:生成故障场景
Generate exactly 5 failure scenarios, one per category. Each scenario must be a specific, plausible narrative — not a generic risk label.
生成恰好5个故障场景,每个类别一个。每个场景必须是具体、符合逻辑的完整叙事,而非泛泛的风险标签。
Category 1: Technical
类别1:技术类
The technology didn't work as expected. Name the specific technology, the specific failure mode, and the specific conditions under which it failed.
技术未按预期工作。需明确指出具体技术、故障模式,以及触发故障的具体条件。
Category 2: Organizational
类别2:组织类
Team, process, or communication broke down. Name the specific team dynamics, handoff points, or process gaps that caused failure.
团队、流程或沟通出现问题。需明确指出具体的团队动态、交接环节或流程漏洞。
Category 3: External
类别3:外部类
Market shifted, competitor moved, regulation changed, or a dependency broke. Name the specific external force and its specific impact.
市场变化、竞品动作、监管调整或依赖项故障。需明确指出具体的外部因素及其影响。
Category 4: Temporal
类别4:时间类
Timeline was wrong. Name what took longer (or what window was missed) and by how much, with the specific cascading consequence.
时间线出现偏差。需明确指出哪部分工作耗时超出预期(或错过时间窗口)、超出的时长,以及由此引发的连锁反应。
Category 5: Assumption
类别5:假设类
A core assumption turned out to be false. Name the specific assumption, why it seemed reasonable at the time, and what reality turned out to be.
Format each scenario as:
SCENARIO [N]: [Category] — [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What happened:
[2-4 sentence specific narrative of how this failure unfolded]
Why it was plausible:
[1-2 sentences on why this wasn't obvious beforehand]
Concrete consequence:
[Specific, measurable impact — revenue lost, users affected, time wasted, data compromised]核心假设被证明是错误的。需明确指出具体假设、当初认为该假设合理的原因,以及实际情况。
每个场景的格式如下:
SCENARIO [N]: [Category] — [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What happened:
[2-4 sentence specific narrative of how this failure unfolded]
Why it was plausible:
[1-2 sentences on why this wasn't obvious beforehand]
Concrete consequence:
[Specific, measurable impact — revenue lost, users affected, time wasted, data compromised]Phase 3: Likelihood x Impact Matrix
阶段3:可能性×影响矩阵
Rate each scenario and determine priority:
RISK MATRIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|----------------|----------------------|------------|--------------|----------|
| 1 | Technical | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 2 | Organizational | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 3 | External | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 4 | Temporal | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 5 | Assumption | [title] | H / M / L | Cat/Sev/Mod | [rank] |Likelihood: High (>50%), Medium (15-50%), Low (<15%)
Impact: Catastrophic (project killed, irreversible damage), Severe (major rework, significant loss), Moderate (setback, recoverable with effort)
Priority scoring:
- High + Catastrophic = P1
- High + Severe OR Medium + Catastrophic = P2
- Medium + Severe OR High + Moderate = P3
- Everything else = P4
Select the top 3 by priority for detailed analysis in Phases 4-5. In case of tie, prefer higher Likelihood.
对每个场景进行评级,确定优先级:
RISK MATRIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|----------------|----------------------|------------|--------------|----------|
| 1 | Technical | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 2 | Organizational | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 3 | External | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 4 | Temporal | [title] | H / M / L | Cat/Sev/Mod | [rank] |
| 5 | Assumption | [title] | H / M / L | Cat/Sev/Mod | [rank] |可能性: 高(>50%)、中(15-50%)、低(<15%)
影响: 灾难性(项目终止、不可逆损失)、严重(重大返工、巨额损失)、中等(进度延误,可通过努力恢复)
优先级评分:
- 高可能性+灾难性影响 = P1
- 高可能性+严重影响 或 中可能性+灾难性影响 = P2
- 中可能性+严重影响 或 高可能性+中等影响 = P3
- 其他所有情况 = P4
选择优先级排名前3的场景,在阶段4-5中进行深入分析。若出现并列,优先选择可能性更高的场景。
Phase 4: Leading Indicators
阶段4:预警指标
For each of the top 3 risks, identify 2-3 early warning signals. These are observable, measurable conditions that would indicate the failure mode is beginning to materialize — before it's too late.
LEADING INDICATORS — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Indicator 1: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]
Indicator 2: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]Every indicator must pass the "intern test": could a new team member, given this description alone, determine whether the threshold has been crossed? If not, make it more specific.
针对前3个高优先级风险,每个识别2-3个早期预警信号。这些信号是可观测、可量化的条件,能在故障完全发生前预警——在为时已晚之前。
LEADING INDICATORS — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Indicator 1: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]
Indicator 2: [Name]
Measure: [What specifically to track]
Threshold: [At what value does this become a warning]
Where to observe: [Dashboard, log, metric, report, or manual check]
Lead time: [How far in advance of failure this signal appears]每个指标必须通过**“新人测试”**:仅根据描述,新团队成员能否判断是否已达到预警阈值?若不能,需进一步明确。
Phase 5: Circuit Breakers
阶段5:熔断机制
For each of the top 3 risks, define the decision framework for when and how to change course.
CIRCUIT BREAKER — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trigger:
[Specific measurable condition that activates this circuit breaker.
Must be a concrete threshold, not "if things go wrong."]
Fallback:
[The alternative path. What do you switch to? Be specific about
the replacement approach, not just "find another way."]
Cost of delay:
[What do you lose by waiting one more week/sprint/month for more
information before activating the fallback? Quantify if possible.]
Decision owner:
[Who has authority to pull this trigger? Role, not name.]针对前3个高优先级风险,定义何时及如何调整方向的决策框架。
CIRCUIT BREAKER — Scenario [N]: [Title]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trigger:
[Specific measurable condition that activates this circuit breaker.
Must be a concrete threshold, not "if things go wrong."]
Fallback:
[The alternative path. What do you switch to? Be specific about
the replacement approach, not just "find another way."]
Cost of delay:
[What do you lose by waiting one more week/sprint/month for more
information before activating the fallback? Quantify if possible.]
Decision owner:
[Who has authority to pull this trigger? Role, not name.]Phase 6: Pre-Mortem Summary (BLUF)
阶段6:预验尸分析摘要(核心结论)
Synthesize the entire analysis into one paragraph. This is the most important output — the reader should be able to read ONLY this paragraph and walk away with the critical insight.
Format:
PRE-MORTEM SUMMARY
━━━━━━━━━━━━━━━━━━
The highest risk to [subject] is [specific scenario from top priority].
You'll know it's happening when [most actionable leading indicator with
threshold]. Your escape hatch is [primary fallback from circuit breaker].
The cost of ignoring this: [concrete consequence]. The cost of acting
too early: [trade-off of the fallback]. Monitor [specific metric] starting
[when] to stay ahead of this risk.将整个分析内容浓缩为一段话。这是最重要的输出成果——读者只需阅读这段话,就能获取关键洞察。
格式如下:
PRE-MORTEM SUMMARY
━━━━━━━━━━━━━━━━━━
The highest risk to [subject] is [specific scenario from top priority].
You'll know it's happening when [most actionable leading indicator with
threshold]. Your escape hatch is [primary fallback from circuit breaker].
The cost of ignoring this: [concrete consequence]. The cost of acting
too early: [trade-off of the fallback]. Monitor [specific metric] starting
[when] to stay ahead of this risk.Output Format
输出格式
markdown
undefinedmarkdown
undefinedPre-Mortem: [Subject]
Pre-Mortem: [Subject]
Failure Frame
Failure Frame
It is [timeframe] from now. [Subject] has failed completely. What went wrong?
It is [timeframe] from now. [Subject] has failed completely. What went wrong?
Failure Scenarios
Failure Scenarios
Scenario 1: Technical — [Title]
Scenario 1: Technical — [Title]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
Scenario 2: Organizational — [Title]
Scenario 2: Organizational — [Title]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
Scenario 3: External — [Title]
Scenario 3: External — [Title]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
Scenario 4: Temporal — [Title]
Scenario 4: Temporal — [Title]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
Scenario 5: Assumption — [Title]
Scenario 5: Assumption — [Title]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
What happened: [Narrative]
Why plausible: [Reasoning]
Consequence: [Specific impact]
Risk Matrix
Risk Matrix
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|---|---|---|---|---|
| 1 | Technical | ... | ... | ... | ... |
| 2 | Organizational | ... | ... | ... | ... |
| 3 | External | ... | ... | ... | ... |
| 4 | Temporal | ... | ... | ... | ... |
| 5 | Assumption | ... | ... | ... | ... |
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|---|---|---|---|---|
| 1 | Technical | ... | ... | ... | ... |
| 2 | Organizational | ... | ... | ... | ... |
| 3 | External | ... | ... | ... | ... |
| 4 | Temporal | ... | ... | ... | ... |
| 5 | Assumption | ... | ... | ... | ... |
Deep Analysis (Top 3 Risks)
Deep Analysis (Top 3 Risks)
Risk [N]: [Title]
Risk [N]: [Title]
Leading Indicators:
- [Indicator name] — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
- [Indicator name] — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
Circuit Breaker:
- Trigger: [specific measurable condition]
- Fallback: [concrete alternative]
- Cost of delay: [what you lose by waiting]
- Decision owner: [role]
[Repeat for each top risk]
Leading Indicators:
- [Indicator name] — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
- [Indicator name] — Measure: [what], Threshold: [value], Where: [location], Lead time: [duration]
Circuit Breaker:
- Trigger: [specific measurable condition]
- Fallback: [concrete alternative]
- Cost of delay: [what you lose by waiting]
- Decision owner: [role]
[Repeat for each top risk]
Pre-Mortem Summary
Pre-Mortem Summary
The highest risk to [subject] is [X]. You'll know it's happening when [Y]. Your escape hatch is [Z]. The cost of ignoring this: [consequence]. The cost of acting too early: [trade-off]. Monitor [metric] starting [when].
undefinedThe highest risk to [subject] is [X]. You'll know it's happening when [Y]. Your escape hatch is [Z]. The cost of ignoring this: [consequence]. The cost of acting too early: [trade-off]. Monitor [metric] starting [when].
undefinedQuality Calibration
质量校准
BAD Pre-Mortem (Don't Do This)
反面示例(请勿效仿)
undefinedundefinedPre-Mortem: New Microservices Migration
Pre-Mortem: New Microservices Migration
Failure Scenarios
Failure Scenarios
Scenario 1: Technical — It didn't scale
Scenario 1: Technical — It didn't scale
What happened: The system couldn't handle the load.
Why plausible: Scaling is hard.
Consequence: Users were unhappy.
What happened: The system couldn't handle the load.
Why plausible: Scaling is hard.
Consequence: Users were unhappy.
Scenario 2: Organizational — Communication issues
Scenario 2: Organizational — Communication issues
What happened: Teams didn't communicate well.
Why plausible: Communication is always a challenge.
Consequence: Things were delayed.
What happened: Teams didn't communicate well.
Why plausible: Communication is always a challenge.
Consequence: Things were delayed.
Scenario 3: External — Market changed
Scenario 3: External — Market changed
What happened: The market shifted.
Why plausible: Markets are unpredictable.
Consequence: Revenue was impacted.
What happened: The market shifted.
Why plausible: Markets are unpredictable.
Consequence: Revenue was impacted.
Leading Indicators
Leading Indicators
- Watch out for scaling issues
- Monitor team communication
- Keep an eye on the market
- Watch out for scaling issues
- Monitor team communication
- Keep an eye on the market
Circuit Breakers
Circuit Breakers
- If things go wrong, switch to plan B
- If communication breaks down, have more meetings
**Why this is bad:**
- Every scenario is generic — could apply to literally any project
- No specific technologies, numbers, or conditions named
- "It didn't scale" is not a scenario — it's a category label
- Leading indicators are unmeasurable platitudes ("watch out for", "keep an eye on")
- Circuit breakers have no trigger thresholds and no concrete fallbacks
- "Have more meetings" is not a circuit breaker — it's a coping mechanism
- No risk matrix, no priority ranking, no swing-mortem summary- If things go wrong, switch to plan B
- If communication breaks down, have more meetings
**问题所在:**
- 所有场景均为泛泛而谈——可适用于任何项目
- 未明确指出具体技术、数字或条件
- “无法扩容”不是具体场景,只是一个类别标签
- 预警指标是无法量化的套话(“注意”、“关注”)
- 熔断机制没有触发阈值,也没有具体的应急方案
- “增加会议”不是熔断机制,只是一种应对手段
- 没有风险矩阵、优先级排序和预验尸分析摘要GOOD Pre-Mortem (Do This)
正面示例(请效仿)
undefinedundefinedPre-Mortem: Migrating Order Service from Monolith to Event-Driven Microservices
Pre-Mortem: Migrating Order Service from Monolith to Event-Driven Microservices
Failure Frame
Failure Frame
It is 6 months from now. The microservices migration has failed completely. The team reverted to the monolith. What went wrong?
It is 6 months from now. The microservices migration has failed completely. The team reverted to the monolith. What went wrong?
Failure Scenarios
Failure Scenarios
Scenario 1: Technical — Kafka consumer lag cascades into order loss
Scenario 1: Technical — Kafka consumer lag cascades into order loss
What happened: Under Black Friday load (12x normal), Kafka consumer groups
for the order-processing service fell behind by 4+ hours. The dead letter queue
filled its 10GB allocation. 2,340 orders were silently dropped because the
retry policy exhausted its 3 attempts while downstream inventory service was
backpressured. No alert fired because monitoring tracked consumer group status
(STABLE) rather than consumer lag (which had no threshold configured).
Why plausible: Load testing only covered 3x normal traffic. The failure
mode — consumer lag + DLQ overflow + silent drop — requires all three conditions
simultaneously, which wasn't in the test matrix.
Consequence: $180K in lost orders, 2,340 customer support tickets, and a
forced rollback to the monolith under production load — itself causing a 45-minute
outage during the revert.
What happened: Under Black Friday load (12x normal), Kafka consumer groups
for the order-processing service fell behind by 4+ hours. The dead letter queue
filled its 10GB allocation. 2,340 orders were silently dropped because the
retry policy exhausted its 3 attempts while downstream inventory service was
backpressured. No alert fired because monitoring tracked consumer group status
(STABLE) rather than consumer lag (which had no threshold configured).
Why plausible: Load testing only covered 3x normal traffic. The failure
mode — consumer lag + DLQ overflow + silent drop — requires all three conditions
simultaneously, which wasn't in the test matrix.
Consequence: $180K in lost orders, 2,340 customer support tickets, and a
forced rollback to the monolith under production load — itself causing a 45-minute
outage during the revert.
Scenario 2: Organizational — Domain boundary mismatch between teams
Scenario 2: Organizational — Domain boundary mismatch between teams
What happened: The "order" bounded context was split between Team Alpha
(order creation) and Team Bravo (fulfillment). Both teams independently implemented
inventory reservation — Alpha with optimistic locking, Bravo with pessimistic
locking. The conflict wasn't discovered until integration testing in week 14 of
a 16-week timeline, requiring a 6-week redesign of the inventory domain model.
Why plausible: Domain-Driven Design workshops defined bounded contexts on
paper, but the actual code ownership didn't align. No cross-team code review
process existed for shared domain objects.
Consequence: 6-week schedule slip, team morale damage from rework, and the
inventory service shipped with a compatibility shim that became permanent
technical debt.
What happened: The "order" bounded context was split between Team Alpha
(order creation) and Team Bravo (fulfillment). Both teams independently implemented
inventory reservation — Alpha with optimistic locking, Bravo with pessimistic
locking. The conflict wasn't discovered until integration testing in week 14 of
a 16-week timeline, requiring a 6-week redesign of the inventory domain model.
Why plausible: Domain-Driven Design workshops defined bounded contexts on
paper, but the actual code ownership didn't align. No cross-team code review
process existed for shared domain objects.
Consequence: 6-week schedule slip, team morale damage from rework, and the
inventory service shipped with a compatibility shim that became permanent
technical debt.
Risk Matrix
Risk Matrix
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|---|---|---|---|---|
| 1 | Technical | Kafka consumer lag cascade | Medium | Catastrophic | P2 |
| 2 | Organizational | Domain boundary mismatch | High | Severe | P2 |
| # | Category | Scenario | Likelihood | Impact | Priority |
|---|---|---|---|---|---|
| 1 | Technical | Kafka consumer lag cascade | Medium | Catastrophic | P2 |
| 2 | Organizational | Domain boundary mismatch | High | Severe | P2 |
Deep Analysis — Risk 1: Kafka Consumer Lag Cascade
Deep Analysis — Risk 1: Kafka Consumer Lag Cascade
Leading Indicators:
- Consumer lag growth rate — Measure: max consumer lag across all partitions for order-processing group. Threshold: >10,000 messages or lag growing >500 msg/sec for 5 consecutive minutes. Where: Kafka monitoring dashboard (Burrow or equivalent). Lead time: 2-4 hours before DLQ overflow at projected rates.
- DLQ fill rate — Measure: dead letter queue size as percentage of allocated storage. Threshold: >30% capacity outside of known incident windows. Where: Kafka topic metrics + PagerDuty alert. Lead time: 1-2 hours before overflow.
Circuit Breaker:
- Trigger: Consumer lag exceeds 1 hour AND DLQ reaches 50% capacity during any traffic event exceeding 5x baseline.
- Fallback: Activate synchronous HTTP fallback path for order processing (already exists in monolith, needs a feature flag to route traffic). Accept the latency penalty (800ms vs 200ms) to guarantee zero order loss.
- Cost of delay: Each hour of delay at Black Friday volumes risks ~$30K in dropped orders. The fallback path adds 600ms latency but loses zero orders.
- Decision owner: On-call SRE lead with VP Engineering escalation.
Leading Indicators:
- Consumer lag growth rate — Measure: max consumer lag across all partitions for order-processing group. Threshold: >10,000 messages or lag growing >500 msg/sec for 5 consecutive minutes. Where: Kafka monitoring dashboard (Burrow or equivalent). Lead time: 2-4 hours before DLQ overflow at projected rates.
- DLQ fill rate — Measure: dead letter queue size as percentage of allocated storage. Threshold: >30% capacity outside of known incident windows. Where: Kafka topic metrics + PagerDuty alert. Lead time: 1-2 hours before overflow.
Circuit Breaker:
- Trigger: Consumer lag exceeds 1 hour AND DLQ reaches 50% capacity during any traffic event exceeding 5x baseline.
- Fallback: Activate synchronous HTTP fallback path for order processing (already exists in monolith, needs a feature flag to route traffic). Accept the latency penalty (800ms vs 200ms) to guarantee zero order loss.
- Cost of delay: Each hour of delay at Black Friday volumes risks ~$30K in dropped orders. The fallback path adds 600ms latency but loses zero orders.
- Decision owner: On-call SRE lead with VP Engineering escalation.
Pre-Mortem Summary
Pre-Mortem Summary
The highest risk to the order service migration is Kafka consumer lag cascading
into silent order loss under peak load. You'll know it's happening when consumer
lag exceeds 10,000 messages with a growth rate above 500 msg/sec for 5 minutes.
Your escape hatch is the synchronous HTTP fallback path behind a feature flag.
The cost of ignoring this: thousands of lost orders and a forced rollback under
production pressure. The cost of acting too early: 4x higher latency on order
processing during the traffic spike. Monitor Kafka consumer lag starting 2 weeks
before any projected traffic event exceeding 3x baseline.
**Why this is good:**
- Scenarios name specific technologies (Kafka, DLQ, optimistic/pessimistic locking)
- Quantities are concrete (12x load, 2,340 orders, $180K, 10GB, 45 minutes)
- Leading indicators have precise thresholds (>10,000 messages, >500 msg/sec, >30% capacity)
- Circuit breaker has a measurable trigger, a concrete fallback, and quantified trade-offs
- Pre-mortem summary is a self-contained briefing that an executive could act on
- Failure narratives are causal chains, not labels — each explains HOW the failure unfoldedThe highest risk to the order service migration is Kafka consumer lag cascading
into silent order loss under peak load. You'll know it's happening when consumer
lag exceeds 10,000 messages with a growth rate above 500 msg/sec for 5 minutes.
Your escape hatch is the synchronous HTTP fallback path behind a feature flag.
The cost of ignoring this: thousands of lost orders and a forced rollback under
production pressure. The cost of acting too early: 4x higher latency on order
processing during the traffic spike. Monitor Kafka consumer lag starting 2 weeks
before any projected traffic event exceeding 3x baseline.
**为何这是正面示例:**
- 场景明确指出了具体技术(Kafka、DLQ、乐观/悲观锁)
- 包含具体的数值(12倍流量、2340个订单、18万美元、10GB、45分钟)
- 预警指标有精确的阈值(>10000条消息、>500条/秒、>30%容量)
- 熔断机制有可量化的触发条件、具体的应急方案和可量化的权衡
- 预验尸分析摘要是一个独立完整的简报,可供高管直接采取行动
- 故障叙事是因果链,而非标签——每个场景都解释了故障是如何发生的When to Use
适用场景
- Before committing to a project plan or timeline
- Before making an irreversible architecture decision
- Before adopting a new technology in production
- Before a major feature launch or product pivot
- Before entering a new market or customer segment
- When the team is highly confident and nobody is raising concerns
- When stakes are high and the cost of failure exceeds the cost of analysis
- When someone says "what's the worst that could happen?"
- 确定项目计划或时间线前
- 做出不可逆的架构决策前
- 在生产环境中采用新技术前
- 重大功能发布或产品转型前
- 进入新市场或客户细分领域前
- 团队高度自信、无人提出质疑时
- 风险高、失败成本超过分析成本时
- 有人提出“最坏的情况是什么?”时
When NOT to Use
不适用场景
- For existing flaws in current code (use — it examines what's wrong NOW)
swing-review - For comparing technology options (use — it gathers facts, not failure scenarios)
swing-research - For generating creative alternatives (use — different cognitive mode)
swing-options - For routine code review (use )
engineering:code-review - When the decision is trivially reversible (low stakes don't justify the analysis)
- When the team needs encouragement, not caution (read the room)
- 排查当前代码中的已知缺陷(请使用——针对当前存在的问题)
swing-review - 比较技术选型方案(请使用——收集事实,而非故障场景)
swing-research - 生成创造性替代方案(请使用——不同的思维模式)
swing-options - 常规代码评审(请使用)
engineering:code-review - 决策可轻易逆转时(低风险场景无需进行此类分析)
- 团队需要鼓励而非警示时(注意团队氛围)
Integration Notes
集成说明
- With swing-clarify: Run swing-clarify first on ambiguous requests before invoking this skill. Clarified scope produces better results.
- With swing-review: Complementary, not overlapping. Run first to anticipate future risks, then
swing-mortemon the current implementation to find existing flaws. Together they cover temporal risk: future (swing-mortem) + present (adversarial).swing-review - With swing-research: When a swing-mortem scenario depends on uncertain facts (e.g., "will Kafka handle this load?"), invoke to verify the factual basis before rating likelihood.
swing-research - With swing-options: After swing-mortem reveals high-priority risks, use to generate alternative approaches that avoid the top failure modes entirely.
swing-options - With deep-dive-analyzer: For complex systems, run first to understand the full architecture, then
deep-dive-analyzerto identify where it could fail. Understanding precedes risk analysis.swing-mortem - With skill-composer: Chain as →
deep-dive-analyzer→swing-mortemfor a full "understand → anticipate failure → generate alternatives" pipeline.swing-options - With orchestrator strategy team: Pre-mortem output feeds directly into the strategy team's Devil's Advocate agent for additional stress-testing of the risk assessment itself.
- 与swing-clarify集成: 若请求模糊,先运行swing-clarify澄清范围,再调用本技能。明确的范围能产生更优质的结果。
- 与swing-review集成: 二者互补,而非重叠。先运行预判未来风险,再运行
swing-mortem排查当前实现中的缺陷。二者结合可覆盖时间维度的风险:未来(swing-mortem)+ 当前(对抗性评审)。swing-review - 与swing-research集成: 若swing-mortem场景依赖不确定的事实(如“Kafka能否承受此负载?”),先调用验证事实依据,再对可能性进行评级。
swing-research - 与swing-options集成: 在swing-mortem识别出高优先级风险后,使用生成可规避顶级故障模式的替代方案。
swing-options - 与deep-dive-analyzer集成: 对于复杂系统,先运行全面了解架构,再运行
deep-dive-analyzer识别故障点。理解是风险分析的前提。swing-mortem - 与skill-composer集成: 可按→
deep-dive-analyzer→swing-mortem的顺序链式调用,形成完整的“理解→预判故障→生成替代方案”流程。swing-options - 与orchestrator strategy team集成: 预验尸分析结果可直接提供给战略团队的Devil's Advocate代理,对风险评估结果进行进一步压力测试。