systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Overview
概述
Debugging is investigation, not experimentation. This skill enforces a rigorous 4-phase process — root cause investigation, pattern analysis, hypothesis testing, and architecture questioning — that prevents shotgun debugging and ensures every fix is understood before it is applied.
Announce at start: "I'm using the systematic-debugging skill to investigate this issue."
调试是调查工作,而非实验。本技能强制执行一套严谨的4阶段流程——根本原因调查、模式分析、假设验证、架构质疑,可避免散弹式调试,确保每次修复在实施前都被充分理解。
开始前声明: "我将使用系统化调试技能来调查此问题。"
Core Principle
核心原则
┌─────────────────────────────────────────────────────────────────┐
│ HARD-GATE: NEVER GUESS. NEVER SHOTGUN DEBUG. │
│ NEVER CHANGE CODE WITHOUT UNDERSTANDING WHY IT IS BROKEN. │
│ │
│ You are a detective gathering evidence, not a gambler trying │
│ random fixes. If you are changing code without understanding │
│ the root cause, STOP immediately. │
└─────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────┐
│ 硬门槛:绝不猜测。绝不进行散弹式调试。 │
│ 在未理解代码损坏原因前,绝不修改代码。 │
│ │
│ 你是收集证据的侦探,不是尝试随机修复的赌徒。如果你在不了解根本 │
│ 原因的情况下修改代码,请立即停止。 │
└─────────────────────────────────────────────────────────────────┘Phase 1: Root Cause Investigation
第一阶段:根本原因调查
Goal: Understand exactly WHAT is happening, not what you think is happening.
目标: 准确了解实际发生了什么,而非你认为发生了什么。
Actions
执行动作
- Read the error message carefully. The entire message. Every line. Including the stack trace.
- Reproduce the bug. If you cannot reproduce it, you cannot fix it. Find the exact steps.
- Gather evidence. Collect:
- Full error message and stack trace
- Input that triggers the bug
- Expected behavior vs actual behavior
- Environment details (versions, config, OS)
- Check recent changes. What changed since this last worked?
- Recent commits (,
git log)git diff - Dependency updates
- Configuration changes
- Environment changes
- Recent commits (
- 仔细阅读错误信息。 完整的信息,每一行都要读,包括堆栈跟踪。
- 复现Bug。 如果你无法复现,就无法修复它。找到精确的复现步骤。
- 收集证据。 收集以下内容:
- 完整错误信息和堆栈跟踪
- 触发Bug的输入
- 预期行为与实际行为的差异
- 环境详情(版本、配置、操作系统)
- 检查近期变更。 上次正常运行之后有哪些改动?
- 近期提交(、
git log)git diff - 依赖更新
- 配置变更
- 环境变更
- 近期提交(
Evidence Gathering Checklist
证据收集检查清单
- Full error message captured (not truncated)
- Stack trace read from bottom to top
- Bug reproduced reliably with specific steps
- Expected vs actual behavior documented
- Recent changes reviewed ()
git log --oneline -20 - Relevant logs examined
- 已捕获完整错误信息(未截断)
- 已从下到上阅读堆栈跟踪
- 可通过特定步骤稳定复现Bug
- 已记录预期行为与实际行为的差异
- 已审核近期变更()
git log --oneline -20 - 已检查相关日志
STOP — HARD-GATE: Do NOT proceed to Phase 2 until:
停止 — 硬门槛:满足以下条件前,禁止进入第二阶段:
- You can reproduce the bug consistently
- You have the full error message and stack trace
- You know what changed recently
- You can describe the bug precisely (not vaguely)
- 你可以稳定复现该Bug
- 你已获取完整错误信息和堆栈跟踪
- 你了解近期发生了哪些变更
- 你可以精确描述Bug(而非模糊描述)
Phase 2: Pattern Analysis
第二阶段:模式分析
Goal: Narrow down WHERE the problem lives and WHEN it occurs.
目标: 缩小问题范围,明确问题发生的位置和触发时机。
Actions
执行动作
- Find working examples. Does this feature work in other contexts? With other inputs? In other environments?
- Compare working vs broken. What is different between the case that works and the case that does not?
- Check dependencies. Are all required services/libraries/configs present and correct?
- Isolate the scope. Can you reproduce with a minimal example? Strip away everything non-essential.
- 找到正常运行的示例。 该功能在其他上下文、其他输入、其他环境下是否能正常运行?
- 对比正常与故障场景。 正常运行的案例和故障案例之间有什么差异?
- 检查依赖。 所有必需的服务/库/配置是否都存在且正确?
- 隔离范围。 你能用最小示例复现问题吗?剥离所有非必要内容。
Comparison Matrix
对比矩阵
Fill this out to identify the pattern:
| Factor | Working Case | Broken Case | Different? |
|---|---|---|---|
| Input data | |||
| Environment | |||
| Configuration | |||
| Dependencies | |||
| Timing/order | |||
| User/permissions | |||
| State/context |
填写下表以识别模式:
| 因素 | 正常场景 | 故障场景 | 是否存在差异? |
|---|---|---|---|
| 输入数据 | |||
| 环境 | |||
| 配置 | |||
| 依赖 | |||
| 时序/执行顺序 | |||
| 用户/权限 | |||
| 状态/上下文 |
STOP — HARD-GATE: Do NOT proceed to Phase 3 until:
停止 — 硬门槛:满足以下条件前,禁止进入第三阶段:
- You have identified at least one working case for comparison
- You have compared working vs broken and identified differences
- You have isolated the scope to the smallest reproducible case
- Dependencies have been verified (versions, availability, config)
- 你已找到至少一个正常运行的案例用于对比
- 你已对比正常与故障场景并识别出差异
- 你已将范围隔离到最小可复现案例
- 依赖已验证(版本、可用性、配置)
Phase 3: Hypothesis and Testing
第三阶段:假设与测试
Goal: Form ONE specific, testable hypothesis and verify it with the smallest possible change.
目标: 形成一个具体的、可测试的假设,并用最小的变更验证它。
Actions
执行动作
- Form ONE hypothesis. Based on evidence from Phases 1-2, what is the single most likely cause?
- State it explicitly: "The bug occurs because [specific cause]"
- If you cannot state it specifically, go back to Phase 1 or 2
- Design a minimal test. What is the smallest change to confirm or deny this hypothesis?
- Prefer adding a test case over modifying production code
- Prefer logging/assertions over code changes
- Prefer reverting a change over writing new code
- Apply the change and test.
- Make ONLY the change needed to test the hypothesis
- Run the test suite
- Observe the result
- Evaluate.
- If CONFIRMED: proceed with the fix, write a regression test
- If DENIED: record what you learned, form a new hypothesis, return to step 1
- 形成一个假设。 基于第一、二阶段收集的证据,最可能的单一原因是什么?
- 明确表述:"Bug发生的原因是[具体原因]"
- 如果你无法具体表述,返回第一或第二阶段
- 设计最小测试。 验证或推翻该假设所需的最小变更是是什么?
- 优先添加测试用例,而非修改生产代码
- 优先添加日志/断言,而非修改代码
- 优先回滚变更,而非编写新代码
- 应用变更并测试。
- 仅做出测试假设所需的变更
- 运行测试套件
- 观察结果
- 评估结果。
- 如果假设被验证:继续实施修复,编写回归测试用例
- 如果假设被推翻:记录你学到的内容,形成新的假设,返回第一步
Hypothesis Log Template
假设日志模板
Hypothesis #1: [description]
Test: [what you did]
Result: CONFIRMED / DENIED
Learning: [what this taught you]
Hypothesis #2: ...假设 #1: [描述]
测试动作: [你执行的操作]
结果: 验证通过 / 推翻
收获: [本次测试的结论]
假设 #2: ...Decision Table: Hypothesis Testing Approach
决策表:假设测试方法
| Hypothesis Type | Testing Method | Example |
|---|---|---|
| Recent code change caused it | | "The bug was introduced in commit abc123" |
| Data shape mismatch | Add logging/assertion | "The API returns null instead of array" |
| Race condition | Add timing logs or serialize | "Request B completes before request A" |
| Configuration error | Compare configs across environments | "Production uses different DB host" |
| Dependency version issue | Lock to known-good version | "Library 2.0 changed the API surface" |
| 假设类型 | 测试方法 | 示例 |
|---|---|---|
| 近期代码变更导致 | | "Bug是在abc123提交中引入的" |
| 数据结构不匹配 | 添加日志/断言 | "API返回null而非数组" |
| 竞态条件 | 添加时序日志或序列化执行 | "请求B比请求A先完成" |
| 配置错误 | 跨环境对比配置 | "生产环境使用了不同的数据库主机" |
| 依赖版本问题 | 锁定到已知正常的版本 | "2.0版本的库修改了API接口" |
STOP — HARD-GATE: Do NOT proceed to Phase 4 unless:
停止 — 硬门槛:满足以下全部条件前,禁止进入第四阶段:
- You have tested at least 3 hypotheses and ALL were denied
- Each hypothesis was specific and testable
- Each test was minimal (one change at a time)
- You recorded learnings from each failed hypothesis
- 你已测试至少3个假设,且全部被推翻
- 每个假设都是具体且可测试的
- 每个测试都是最小化的(每次仅修改一处)
- 你已记录每个失败假设带来的收获
Phase 4: Architecture Questioning
第四阶段:架构质疑
Goal: If 3+ hypotheses have failed, the problem may be structural. Step back and question assumptions.
This phase is triggered ONLY after Phase 3 has been attempted at least 3 times without success.
目标: 如果3个及以上假设都失败,问题可能是结构性的。退一步,质疑所有假设。
本阶段仅在第三阶段至少尝试3次仍未成功时触发。
Actions
执行动作
- Question your assumptions. What have you been assuming is true that might not be?
- Is the data shaped the way you think it is?
- Is the control flow what you expect?
- Are the types what you think they are?
- Is the API contract what you assumed?
- Question the design. Is the current approach fundamentally flawed?
- Is there a race condition in the design?
- Is there a state management problem?
- Is there an incorrect abstraction?
- Are responsibilities misplaced?
- Consider redesign. Sometimes the fix is not a patch but a restructuring.
- Can you simplify the design to eliminate the bug class entirely?
- Is there a pattern that handles this case better?
- Should you replace rather than fix?
- Seek external input. If you are stuck:
- Explain the problem to someone else (rubber duck debugging)
- Search for known issues in dependencies
- Check if others have encountered similar problems
- 质疑你的假设。 哪些你默认成立的事情可能并不正确?
- 数据结构和你想的一样吗?
- 控制流符合你的预期吗?
- 类型和你想的一致吗?
- API契约和你假设的一样吗?
- 质疑设计。 当前的方案是否存在根本性缺陷?
- 设计中是否存在竞态条件?
- 是否存在状态管理问题?
- 是否存在错误的抽象?
- 职责划分是否错位?
- 考虑重新设计。 有时修复不是打补丁,而是重构。
- 你能否简化设计,从根源上消除这类Bug?
- 是否有更适合该场景的模式?
- 你是否应该替换而非修复现有实现?
- 寻求外部输入。 如果你卡住了:
- 向其他人解释问题(小黄鸭调试法)
- 搜索依赖中的已知问题
- 查看是否有其他人遇到过类似问题
STOP — HARD-GATE: Do NOT continue without:
停止 — 硬门槛:满足以下条件前,禁止继续:
- Written list of assumptions that were questioned
- Explicit decision: patch the current design OR redesign
- If redesigning: a plan before implementing
- If patching: a new hypothesis informed by the assumption review
- 已列出所有被质疑的假设的书面清单
- 已做出明确决策:修复当前设计 OR 重新设计
- 如果是重新设计:实施前已有明确方案
- 如果是打补丁:已有基于假设评审得出的新假设
Debugging Decision Flowchart
调试决策流程图
Error encountered
|
v
Can you reproduce it?
|
+-- NO --> Gather more information (logs, user reports, monitoring)
| Try different inputs, environments, timing
| Do NOT proceed until reproducible
|
+-- YES -> Read the FULL error message and stack trace
|
v
Is the cause obvious from the error?
|
+-- YES -> Form hypothesis, test it (Phase 3)
| Still write a regression test
|
+-- NO --> Complete Phase 1 evidence gathering
|
v
Find working case for comparison (Phase 2)
|
v
Identify differences
|
v
Form and test hypotheses (Phase 3)
|
+-- Fixed --> Write regression test, verify
|
+-- 3+ failed hypotheses --> Phase 4遇到错误
|
v
你能复现它吗?
|
+-- 否 --> 收集更多信息(日志、用户反馈、监控)
| 尝试不同的输入、环境、时序
| 可复现前禁止继续推进
|
+-- 是 --> 阅读完整错误信息和堆栈跟踪
|
v
从错误中能明显看出原因吗?
|
+-- 是 --> 形成假设,测试(第三阶段)
| 仍需编写回归测试用例
|
+-- 否 --> 完成第一阶段的证据收集
|
v
找到用于对比的正常案例(第二阶段)
|
v
识别差异
|
v
形成并测试假设(第三阶段)
|
+-- 已修复 --> 编写回归测试,验证
|
+-- 3个及以上假设失败 --> 进入第四阶段Red Flags Table
危险信号表
| Red Flag | What It Means | Action |
|---|---|---|
| Changing code without understanding the bug | Shotgun debugging | Go back to Phase 1 |
| Fix works but you do not know why | Accidental fix, likely to regress | Investigate until you understand |
| Same bug keeps coming back | Root cause not addressed | Go to Phase 4, question design |
| Fix causes new bugs elsewhere | Unexpected coupling | Map dependencies before proceeding |
| "It works on my machine" | Environment difference | Go to Phase 2, comparison matrix |
| Fix requires more than 20 lines | Might be a design issue | Go to Phase 4 |
| Debugging for 30+ minutes | Tunnel vision | Take a break, re-read evidence from Phase 1 |
| Reading the same code repeatedly | Missing something fundamental | Get a fresh perspective, explain aloud |
| Multiple causes seem equally likely | Insufficient investigation | Go back to Phase 1, gather more evidence |
| 危险信号 | 含义 | 应对动作 |
|---|---|---|
| 不理解Bug就修改代码 | 散弹式调试 | 返回第一阶段 |
| 修复生效但你不知道原因 | 意外修复,很可能回退 | 继续调查直到完全理解 |
| 相同Bug反复出现 | 根本原因未解决 | 进入第四阶段,质疑设计 |
| 修复导致其他地方出现新Bug | 非预期的耦合 | 继续前先梳理依赖关系 |
| "我本地运行正常" | 环境差异 | 进入第二阶段,填写对比矩阵 |
| 修复需要超过20行代码 | 可能存在设计问题 | 进入第四阶段 |
| 调试超过30分钟 | 管状视野(思维受限) | 休息一下,重读第一阶段的证据 |
| 反复阅读同一段代码 | 遗漏了一些根本性的内容 | 换个视角,大声解释问题 |
| 多个原因看起来可能性相同 | 调查不充分 | 返回第一阶段,收集更多证据 |
Anti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Changing random things to see if bug goes away | Wastes time, introduces new bugs | Form a hypothesis first |
| Adding try/catch to suppress the error | Hides the real problem | Fix the root cause |
| Rewriting the feature from scratch | Nuclear option is rarely needed | Isolate and fix the specific issue |
| Blaming the framework/library without evidence | Usually your code is wrong | Prove the framework bug with minimal repro |
| Skipping the regression test after fixing | Bug will return | Write the test, always |
| Fixing symptoms instead of root causes | Patches accumulate, system degrades | Trace to the actual cause |
| Debugging for 45+ minutes without stepping back | Tunnel vision reduces effectiveness | Take a break, re-read Phase 1 evidence |
| Ignoring error messages or stack traces | The answer is often in the error | Read every line of the error |
| 反模式 | 错误原因 | 正确做法 |
|---|---|---|
| 随机修改代码看Bug是否消失 | 浪费时间,引入新Bug | 先形成假设 |
| 添加try/catch抑制错误 | 掩盖真正的问题 | 修复根本原因 |
| 从头重写整个功能 | 很少需要用到这种极端方案 | 隔离并修复具体问题 |
| 无证据指责框架/库有问题 | 通常是你的代码写错了 | 用最小复现示例证明框架存在Bug |
| 修复后跳过回归测试 | Bug会再次出现 | 始终编写测试用例 |
| 修复症状而非根本原因 | 补丁越积越多,系统逐渐腐化 | 追踪到真正的原因 |
| 调试45分钟以上都没有退一步梳理 | 管状视野会降低效率 | 休息一下,重读第一阶段的证据 |
| 忽略错误信息或堆栈跟踪 | 答案通常就在错误信息里 | 阅读错误的每一行内容 |
Integration Points
集成点
| Skill | Relationship |
|---|---|
| Every bug fix MUST include a regression test (RED-GREEN cycle) |
| After fixing a bug, verify with fresh evidence |
| When debugging during task execution, pause task, complete debugging, resume |
| Review the fix for completeness and side effects |
| Record new debugging patterns in learned-patterns.md |
| Verify fix does not break acceptance criteria |
| 技能 | 关联关系 |
|---|---|
| 每个Bug修复都必须包含回归测试用例(RED-GREEN周期) |
| 修复Bug后,用新的证据验证 |
| 任务执行过程中需要调试时,暂停任务,完成调试后再恢复 |
| 评审修复的完整性和副作用 |
| 在learned-patterns.md中记录新的调试模式 |
| 验证修复不会破坏验收标准 |
Quick Reference: What NOT To Do
快速参考:禁止行为
- Do NOT change random things and see if the bug goes away
- Do NOT add try/catch to suppress the error
- Do NOT rewrite the feature from scratch as a first resort
- Do NOT blame the framework/library without evidence
- Do NOT skip writing a regression test after fixing
- Do NOT fix symptoms instead of root causes
- Do NOT debug for more than 45 minutes without stepping back
- Do NOT ignore error messages or stack traces
- 禁止 随机修改代码看Bug是否消失
- 禁止 添加try/catch抑制错误
- 禁止 第一选择就是从头重写整个功能
- 禁止 无证据指责框架/库有问题
- 禁止 修复后跳过编写回归测试
- 禁止 修复症状而非根本原因
- 禁止 调试超过45分钟都不退一步梳理
- 禁止 忽略错误信息或堆栈跟踪
Skill Type
技能类型
RIGID — The 4-phase process is mandatory and must be followed in order. Each phase has a HARD-GATE that must be satisfied before proceeding. Never change code without understanding why it is broken.
刚性 — 4阶段流程是强制性的,必须按顺序执行。每个阶段都有硬门槛,必须满足才能进入下一阶段。在未理解代码损坏原因前,绝不修改代码。