systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Overview

概述

Debugging is investigation, not experimentation. This skill enforces a rigorous 4-phase process — root cause investigation, pattern analysis, hypothesis testing, and architecture questioning — that prevents shotgun debugging and ensures every fix is understood before it is applied.

Announce at start: "I'm using the systematic-debugging skill to investigate this issue."

调试是调查工作，而非实验。本技能强制执行一套严谨的4阶段流程——根本原因调查、模式分析、假设验证、架构质疑，可避免散弹式调试，确保每次修复在实施前都被充分理解。

开始前声明： "我将使用系统化调试技能来调查此问题。"

Core Principle

核心原则

┌─────────────────────────────────────────────────────────────────┐
│  HARD-GATE: NEVER GUESS. NEVER SHOTGUN DEBUG.                  │
│  NEVER CHANGE CODE WITHOUT UNDERSTANDING WHY IT IS BROKEN.     │
│                                                                 │
│  You are a detective gathering evidence, not a gambler trying   │
│  random fixes. If you are changing code without understanding   │
│  the root cause, STOP immediately.                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  硬门槛：绝不猜测。绝不进行散弹式调试。                         │
│  在未理解代码损坏原因前，绝不修改代码。                          │
│                                                                 │
│  你是收集证据的侦探，不是尝试随机修复的赌徒。如果你在不了解根本   │
│  原因的情况下修改代码，请立即停止。                              │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Root Cause Investigation

第一阶段：根本原因调查

Goal: Understand exactly WHAT is happening, not what you think is happening.

目标： 准确了解实际发生了什么，而非你认为发生了什么。

Actions

执行动作

Read the error message carefully. The entire message. Every line. Including the stack trace.
Reproduce the bug. If you cannot reproduce it, you cannot fix it. Find the exact steps.
Gather evidence. Collect:
- Full error message and stack trace
- Input that triggers the bug
- Expected behavior vs actual behavior
- Environment details (versions, config, OS)
Check recent changes. What changed since this last worked?
- Recent commits (
```
git log
```
  ,
```
git diff
```
  )
- Dependency updates
- Configuration changes
- Environment changes

仔细阅读错误信息。 完整的信息，每一行都要读，包括堆栈跟踪。
复现Bug。 如果你无法复现，就无法修复它。找到精确的复现步骤。
收集证据。 收集以下内容：
- 完整错误信息和堆栈跟踪
- 触发Bug的输入
- 预期行为与实际行为的差异
- 环境详情（版本、配置、操作系统）
检查近期变更。 上次正常运行之后有哪些改动？
- 近期提交（
```
git log
```
  、
```
git diff
```
  ）
- 依赖更新
- 配置变更
- 环境变更

Evidence Gathering Checklist

证据收集检查清单

Full error message captured (not truncated)
Stack trace read from bottom to top
Bug reproduced reliably with specific steps
Expected vs actual behavior documented
Recent changes reviewed (
```
git log --oneline -20
```
)
Relevant logs examined

已捕获完整错误信息（未截断）
已从下到上阅读堆栈跟踪
可通过特定步骤稳定复现Bug
已记录预期行为与实际行为的差异
已审核近期变更（
```
git log --oneline -20
```
）
已检查相关日志

STOP — HARD-GATE: Do NOT proceed to Phase 2 until:

停止 — 硬门槛：满足以下条件前，禁止进入第二阶段：

You can reproduce the bug consistently
You have the full error message and stack trace
You know what changed recently
You can describe the bug precisely (not vaguely)

你可以稳定复现该Bug
你已获取完整错误信息和堆栈跟踪
你了解近期发生了哪些变更
你可以精确描述Bug（而非模糊描述）

Phase 2: Pattern Analysis

第二阶段：模式分析

Goal: Narrow down WHERE the problem lives and WHEN it occurs.

目标： 缩小问题范围，明确问题发生的位置和触发时机。

Actions

执行动作

Find working examples. Does this feature work in other contexts? With other inputs? In other environments?
Compare working vs broken. What is different between the case that works and the case that does not?
Check dependencies. Are all required services/libraries/configs present and correct?
Isolate the scope. Can you reproduce with a minimal example? Strip away everything non-essential.

找到正常运行的示例。 该功能在其他上下文、其他输入、其他环境下是否能正常运行？
对比正常与故障场景。 正常运行的案例和故障案例之间有什么差异？
检查依赖。 所有必需的服务/库/配置是否都存在且正确？
隔离范围。 你能用最小示例复现问题吗？剥离所有非必要内容。

Comparison Matrix

对比矩阵

Fill this out to identify the pattern:

Factor	Working Case	Broken Case	Different?
Input data
Environment
Configuration
Dependencies
Timing/order
User/permissions
State/context

填写下表以识别模式：

因素	正常场景	故障场景	是否存在差异？
输入数据
环境
配置
依赖
时序/执行顺序
用户/权限
状态/上下文

STOP — HARD-GATE: Do NOT proceed to Phase 3 until:

停止 — 硬门槛：满足以下条件前，禁止进入第三阶段：

You have identified at least one working case for comparison
You have compared working vs broken and identified differences
You have isolated the scope to the smallest reproducible case
Dependencies have been verified (versions, availability, config)

你已找到至少一个正常运行的案例用于对比
你已对比正常与故障场景并识别出差异
你已将范围隔离到最小可复现案例
依赖已验证（版本、可用性、配置）

Phase 3: Hypothesis and Testing

第三阶段：假设与测试

Goal: Form ONE specific, testable hypothesis and verify it with the smallest possible change.

目标： 形成一个具体的、可测试的假设，并用最小的变更验证它。

Actions

执行动作

Form ONE hypothesis. Based on evidence from Phases 1-2, what is the single most likely cause?
- State it explicitly: "The bug occurs because [specific cause]"
- If you cannot state it specifically, go back to Phase 1 or 2
Design a minimal test. What is the smallest change to confirm or deny this hypothesis?
- Prefer adding a test case over modifying production code
- Prefer logging/assertions over code changes
- Prefer reverting a change over writing new code
Apply the change and test.
- Make ONLY the change needed to test the hypothesis
- Run the test suite
- Observe the result
Evaluate.
- If CONFIRMED: proceed with the fix, write a regression test
- If DENIED: record what you learned, form a new hypothesis, return to step 1

形成一个假设。 基于第一、二阶段收集的证据，最可能的单一原因是什么？
- 明确表述："Bug发生的原因是[具体原因]"
- 如果你无法具体表述，返回第一或第二阶段
设计最小测试。 验证或推翻该假设所需的最小变更是是什么？
- 优先添加测试用例，而非修改生产代码
- 优先添加日志/断言，而非修改代码
- 优先回滚变更，而非编写新代码
应用变更并测试。
- 仅做出测试假设所需的变更
- 运行测试套件
- 观察结果
评估结果。
- 如果假设被验证：继续实施修复，编写回归测试用例
- 如果假设被推翻：记录你学到的内容，形成新的假设，返回第一步

Hypothesis Log Template

假设日志模板

Hypothesis #1: [description]
Test: [what you did]
Result: CONFIRMED / DENIED
Learning: [what this taught you]

Hypothesis #2: ...

假设 #1: [描述]
测试动作: [你执行的操作]
结果: 验证通过 / 推翻
收获: [本次测试的结论]

假设 #2: ...

Decision Table: Hypothesis Testing Approach

决策表：假设测试方法

Hypothesis Type	Testing Method	Example
Recent code change caused it	`git bisect` or revert commit	"The bug was introduced in commit abc123"
Data shape mismatch	Add logging/assertion	"The API returns null instead of array"
Race condition	Add timing logs or serialize	"Request B completes before request A"
Configuration error	Compare configs across environments	"Production uses different DB host"
Dependency version issue	Lock to known-good version	"Library 2.0 changed the API surface"

假设类型	测试方法	示例
近期代码变更导致	`git bisect` 或回滚提交	"Bug是在abc123提交中引入的"
数据结构不匹配	添加日志/断言	"API返回null而非数组"
竞态条件	添加时序日志或序列化执行	"请求B比请求A先完成"
配置错误	跨环境对比配置	"生产环境使用了不同的数据库主机"
依赖版本问题	锁定到已知正常的版本	"2.0版本的库修改了API接口"

STOP — HARD-GATE: Do NOT proceed to Phase 4 unless:

停止 — 硬门槛：满足以下全部条件前，禁止进入第四阶段：

You have tested at least 3 hypotheses and ALL were denied
Each hypothesis was specific and testable
Each test was minimal (one change at a time)
You recorded learnings from each failed hypothesis

你已测试至少3个假设，且全部被推翻
每个假设都是具体且可测试的
每个测试都是最小化的（每次仅修改一处）
你已记录每个失败假设带来的收获

Phase 4: Architecture Questioning

第四阶段：架构质疑

Goal: If 3+ hypotheses have failed, the problem may be structural. Step back and question assumptions.

This phase is triggered ONLY after Phase 3 has been attempted at least 3 times without success.

目标： 如果3个及以上假设都失败，问题可能是结构性的。退一步，质疑所有假设。

本阶段仅在第三阶段至少尝试3次仍未成功时触发。

Actions

执行动作

Question your assumptions. What have you been assuming is true that might not be?
- Is the data shaped the way you think it is?
- Is the control flow what you expect?
- Are the types what you think they are?
- Is the API contract what you assumed?
Question the design. Is the current approach fundamentally flawed?
- Is there a race condition in the design?
- Is there a state management problem?
- Is there an incorrect abstraction?
- Are responsibilities misplaced?
Consider redesign. Sometimes the fix is not a patch but a restructuring.
- Can you simplify the design to eliminate the bug class entirely?
- Is there a pattern that handles this case better?
- Should you replace rather than fix?
Seek external input. If you are stuck:
- Explain the problem to someone else (rubber duck debugging)
- Search for known issues in dependencies
- Check if others have encountered similar problems

质疑你的假设。 哪些你默认成立的事情可能并不正确？
- 数据结构和你想的一样吗？
- 控制流符合你的预期吗？
- 类型和你想的一致吗？
- API契约和你假设的一样吗？
质疑设计。 当前的方案是否存在根本性缺陷？
- 设计中是否存在竞态条件？
- 是否存在状态管理问题？
- 是否存在错误的抽象？
- 职责划分是否错位？
考虑重新设计。 有时修复不是打补丁，而是重构。
- 你能否简化设计，从根源上消除这类Bug？
- 是否有更适合该场景的模式？
- 你是否应该替换而非修复现有实现？
寻求外部输入。 如果你卡住了：
- 向其他人解释问题（小黄鸭调试法）
- 搜索依赖中的已知问题
- 查看是否有其他人遇到过类似问题

STOP — HARD-GATE: Do NOT continue without:

停止 — 硬门槛：满足以下条件前，禁止继续：

Written list of assumptions that were questioned
Explicit decision: patch the current design OR redesign
If redesigning: a plan before implementing
If patching: a new hypothesis informed by the assumption review

已列出所有被质疑的假设的书面清单
已做出明确决策：修复当前设计 OR 重新设计
如果是重新设计：实施前已有明确方案
如果是打补丁：已有基于假设评审得出的新假设

Debugging Decision Flowchart

调试决策流程图

Error encountered
    |
    v
Can you reproduce it?
    |
    +-- NO --> Gather more information (logs, user reports, monitoring)
    |          Try different inputs, environments, timing
    |          Do NOT proceed until reproducible
    |
    +-- YES -> Read the FULL error message and stack trace
               |
               v
         Is the cause obvious from the error?
               |
               +-- YES -> Form hypothesis, test it (Phase 3)
               |          Still write a regression test
               |
               +-- NO --> Complete Phase 1 evidence gathering
                          |
                          v
                    Find working case for comparison (Phase 2)
                          |
                          v
                    Identify differences
                          |
                          v
                    Form and test hypotheses (Phase 3)
                          |
                          +-- Fixed --> Write regression test, verify
                          |
                          +-- 3+ failed hypotheses --> Phase 4

遇到错误
    |
    v
你能复现它吗？
    |
    +-- 否 --> 收集更多信息（日志、用户反馈、监控）
    |          尝试不同的输入、环境、时序
    |          可复现前禁止继续推进
    |
    +-- 是 --> 阅读完整错误信息和堆栈跟踪
               |
               v
         从错误中能明显看出原因吗？
               |
               +-- 是 --> 形成假设，测试（第三阶段）
               |          仍需编写回归测试用例
               |
               +-- 否 --> 完成第一阶段的证据收集
                          |
                          v
                    找到用于对比的正常案例（第二阶段）
                          |
                          v
                    识别差异
                          |
                          v
                    形成并测试假设（第三阶段）
                          |
                          +-- 已修复 --> 编写回归测试，验证
                          |
                          +-- 3个及以上假设失败 --> 进入第四阶段

Red Flags Table

危险信号表

Red Flag	What It Means	Action
Changing code without understanding the bug	Shotgun debugging	Go back to Phase 1
Fix works but you do not know why	Accidental fix, likely to regress	Investigate until you understand
Same bug keeps coming back	Root cause not addressed	Go to Phase 4, question design
Fix causes new bugs elsewhere	Unexpected coupling	Map dependencies before proceeding
"It works on my machine"	Environment difference	Go to Phase 2, comparison matrix
Fix requires more than 20 lines	Might be a design issue	Go to Phase 4
Debugging for 30+ minutes	Tunnel vision	Take a break, re-read evidence from Phase 1
Reading the same code repeatedly	Missing something fundamental	Get a fresh perspective, explain aloud
Multiple causes seem equally likely	Insufficient investigation	Go back to Phase 1, gather more evidence

危险信号	含义	应对动作
不理解Bug就修改代码	散弹式调试	返回第一阶段
修复生效但你不知道原因	意外修复，很可能回退	继续调查直到完全理解
相同Bug反复出现	根本原因未解决	进入第四阶段，质疑设计
修复导致其他地方出现新Bug	非预期的耦合	继续前先梳理依赖关系
"我本地运行正常"	环境差异	进入第二阶段，填写对比矩阵
修复需要超过20行代码	可能存在设计问题	进入第四阶段
调试超过30分钟	管状视野（思维受限）	休息一下，重读第一阶段的证据
反复阅读同一段代码	遗漏了一些根本性的内容	换个视角，大声解释问题
多个原因看起来可能性相同	调查不充分	返回第一阶段，收集更多证据

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Is Wrong	Correct Approach
Changing random things to see if bug goes away	Wastes time, introduces new bugs	Form a hypothesis first
Adding try/catch to suppress the error	Hides the real problem	Fix the root cause
Rewriting the feature from scratch	Nuclear option is rarely needed	Isolate and fix the specific issue
Blaming the framework/library without evidence	Usually your code is wrong	Prove the framework bug with minimal repro
Skipping the regression test after fixing	Bug will return	Write the test, always
Fixing symptoms instead of root causes	Patches accumulate, system degrades	Trace to the actual cause
Debugging for 45+ minutes without stepping back	Tunnel vision reduces effectiveness	Take a break, re-read Phase 1 evidence
Ignoring error messages or stack traces	The answer is often in the error	Read every line of the error

反模式	错误原因	正确做法
随机修改代码看Bug是否消失	浪费时间，引入新Bug	先形成假设
添加try/catch抑制错误	掩盖真正的问题	修复根本原因
从头重写整个功能	很少需要用到这种极端方案	隔离并修复具体问题
无证据指责框架/库有问题	通常是你的代码写错了	用最小复现示例证明框架存在Bug
修复后跳过回归测试	Bug会再次出现	始终编写测试用例
修复症状而非根本原因	补丁越积越多，系统逐渐腐化	追踪到真正的原因
调试45分钟以上都没有退一步梳理	管状视野会降低效率	休息一下，重读第一阶段的证据
忽略错误信息或堆栈跟踪	答案通常就在错误信息里	阅读错误的每一行内容

Integration Points

集成点

Skill	Relationship
`test-driven-development`	Every bug fix MUST include a regression test (RED-GREEN cycle)
`verification-before-completion`	After fixing a bug, verify with fresh evidence
`resilient-execution`	When debugging during task execution, pause task, complete debugging, resume
`code-review`	Review the fix for completeness and side effects
`self-learning`	Record new debugging patterns in learned-patterns.md
`acceptance-testing`	Verify fix does not break acceptance criteria

技能	关联关系
`test-driven-development`	每个Bug修复都必须包含回归测试用例（RED-GREEN周期）
`verification-before-completion`	修复Bug后，用新的证据验证
`resilient-execution`	任务执行过程中需要调试时，暂停任务，完成调试后再恢复
`code-review`	评审修复的完整性和副作用
`self-learning`	在learned-patterns.md中记录新的调试模式
`acceptance-testing`	验证修复不会破坏验收标准

Quick Reference: What NOT To Do

快速参考：禁止行为

Do NOT change random things and see if the bug goes away
Do NOT add try/catch to suppress the error
Do NOT rewrite the feature from scratch as a first resort
Do NOT blame the framework/library without evidence
Do NOT skip writing a regression test after fixing
Do NOT fix symptoms instead of root causes
Do NOT debug for more than 45 minutes without stepping back
Do NOT ignore error messages or stack traces

禁止随机修改代码看Bug是否消失
禁止添加try/catch抑制错误
禁止第一选择就是从头重写整个功能
禁止无证据指责框架/库有问题
禁止修复后跳过编写回归测试
禁止修复症状而非根本原因
禁止调试超过45分钟都不退一步梳理
禁止忽略错误信息或堆栈跟踪

Skill Type

技能类型

RIGID — The 4-phase process is mandatory and must be followed in order. Each phase has a HARD-GATE that must be satisfied before proceeding. Never change code without understanding why it is broken.

刚性 — 4阶段流程是强制性的，必须按顺序执行。每个阶段都有硬门槛，必须满足才能进入下一阶段。在未理解代码损坏原因前，绝不修改代码。