systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Overview

概述

Debugging is investigation, not experimentation. This skill enforces a rigorous 4-phase process — root cause investigation, pattern analysis, hypothesis testing, and architecture questioning — that prevents shotgun debugging and ensures every fix is understood before it is applied.
Announce at start: "I'm using the systematic-debugging skill to investigate this issue."

调试是调查工作,而非实验。本技能强制执行一套严谨的4阶段流程——根本原因调查、模式分析、假设验证、架构质疑,可避免散弹式调试,确保每次修复在实施前都被充分理解。
开始前声明: "我将使用系统化调试技能来调查此问题。"

Core Principle

核心原则

┌─────────────────────────────────────────────────────────────────┐
│  HARD-GATE: NEVER GUESS. NEVER SHOTGUN DEBUG.                  │
│  NEVER CHANGE CODE WITHOUT UNDERSTANDING WHY IT IS BROKEN.     │
│                                                                 │
│  You are a detective gathering evidence, not a gambler trying   │
│  random fixes. If you are changing code without understanding   │
│  the root cause, STOP immediately.                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  硬门槛:绝不猜测。绝不进行散弹式调试。                         │
│  在未理解代码损坏原因前,绝不修改代码。                          │
│                                                                 │
│  你是收集证据的侦探,不是尝试随机修复的赌徒。如果你在不了解根本   │
│  原因的情况下修改代码,请立即停止。                              │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Root Cause Investigation

第一阶段:根本原因调查

Goal: Understand exactly WHAT is happening, not what you think is happening.
目标: 准确了解实际发生了什么,而非你认为发生了什么。

Actions

执行动作

  1. Read the error message carefully. The entire message. Every line. Including the stack trace.
  2. Reproduce the bug. If you cannot reproduce it, you cannot fix it. Find the exact steps.
  3. Gather evidence. Collect:
    • Full error message and stack trace
    • Input that triggers the bug
    • Expected behavior vs actual behavior
    • Environment details (versions, config, OS)
  4. Check recent changes. What changed since this last worked?
    • Recent commits (
      git log
      ,
      git diff
      )
    • Dependency updates
    • Configuration changes
    • Environment changes
  1. 仔细阅读错误信息。 完整的信息,每一行都要读,包括堆栈跟踪。
  2. 复现Bug。 如果你无法复现,就无法修复它。找到精确的复现步骤。
  3. 收集证据。 收集以下内容:
    • 完整错误信息和堆栈跟踪
    • 触发Bug的输入
    • 预期行为与实际行为的差异
    • 环境详情(版本、配置、操作系统)
  4. 检查近期变更。 上次正常运行之后有哪些改动?
    • 近期提交(
      git log
      git diff
    • 依赖更新
    • 配置变更
    • 环境变更

Evidence Gathering Checklist

证据收集检查清单

  • Full error message captured (not truncated)
  • Stack trace read from bottom to top
  • Bug reproduced reliably with specific steps
  • Expected vs actual behavior documented
  • Recent changes reviewed (
    git log --oneline -20
    )
  • Relevant logs examined
  • 已捕获完整错误信息(未截断)
  • 已从下到上阅读堆栈跟踪
  • 可通过特定步骤稳定复现Bug
  • 已记录预期行为与实际行为的差异
  • 已审核近期变更(
    git log --oneline -20
  • 已检查相关日志

STOP — HARD-GATE: Do NOT proceed to Phase 2 until:

停止 — 硬门槛:满足以下条件前,禁止进入第二阶段:

  • You can reproduce the bug consistently
  • You have the full error message and stack trace
  • You know what changed recently
  • You can describe the bug precisely (not vaguely)

  • 你可以稳定复现该Bug
  • 你已获取完整错误信息和堆栈跟踪
  • 你了解近期发生了哪些变更
  • 你可以精确描述Bug(而非模糊描述)

Phase 2: Pattern Analysis

第二阶段:模式分析

Goal: Narrow down WHERE the problem lives and WHEN it occurs.
目标: 缩小问题范围,明确问题发生的位置和触发时机。

Actions

执行动作

  1. Find working examples. Does this feature work in other contexts? With other inputs? In other environments?
  2. Compare working vs broken. What is different between the case that works and the case that does not?
  3. Check dependencies. Are all required services/libraries/configs present and correct?
  4. Isolate the scope. Can you reproduce with a minimal example? Strip away everything non-essential.
  1. 找到正常运行的示例。 该功能在其他上下文、其他输入、其他环境下是否能正常运行?
  2. 对比正常与故障场景。 正常运行的案例和故障案例之间有什么差异?
  3. 检查依赖。 所有必需的服务/库/配置是否都存在且正确?
  4. 隔离范围。 你能用最小示例复现问题吗?剥离所有非必要内容。

Comparison Matrix

对比矩阵

Fill this out to identify the pattern:
FactorWorking CaseBroken CaseDifferent?
Input data
Environment
Configuration
Dependencies
Timing/order
User/permissions
State/context
填写下表以识别模式:
因素正常场景故障场景是否存在差异?
输入数据
环境
配置
依赖
时序/执行顺序
用户/权限
状态/上下文

STOP — HARD-GATE: Do NOT proceed to Phase 3 until:

停止 — 硬门槛:满足以下条件前,禁止进入第三阶段:

  • You have identified at least one working case for comparison
  • You have compared working vs broken and identified differences
  • You have isolated the scope to the smallest reproducible case
  • Dependencies have been verified (versions, availability, config)

  • 你已找到至少一个正常运行的案例用于对比
  • 你已对比正常与故障场景并识别出差异
  • 你已将范围隔离到最小可复现案例
  • 依赖已验证(版本、可用性、配置)

Phase 3: Hypothesis and Testing

第三阶段:假设与测试

Goal: Form ONE specific, testable hypothesis and verify it with the smallest possible change.
目标: 形成一个具体的、可测试的假设,并用最小的变更验证它。

Actions

执行动作

  1. Form ONE hypothesis. Based on evidence from Phases 1-2, what is the single most likely cause?
    • State it explicitly: "The bug occurs because [specific cause]"
    • If you cannot state it specifically, go back to Phase 1 or 2
  2. Design a minimal test. What is the smallest change to confirm or deny this hypothesis?
    • Prefer adding a test case over modifying production code
    • Prefer logging/assertions over code changes
    • Prefer reverting a change over writing new code
  3. Apply the change and test.
    • Make ONLY the change needed to test the hypothesis
    • Run the test suite
    • Observe the result
  4. Evaluate.
    • If CONFIRMED: proceed with the fix, write a regression test
    • If DENIED: record what you learned, form a new hypothesis, return to step 1
  1. 形成一个假设。 基于第一、二阶段收集的证据,最可能的单一原因是什么?
    • 明确表述:"Bug发生的原因是[具体原因]"
    • 如果你无法具体表述,返回第一或第二阶段
  2. 设计最小测试。 验证或推翻该假设所需的最小变更是是什么?
    • 优先添加测试用例,而非修改生产代码
    • 优先添加日志/断言,而非修改代码
    • 优先回滚变更,而非编写新代码
  3. 应用变更并测试。
    • 仅做出测试假设所需的变更
    • 运行测试套件
    • 观察结果
  4. 评估结果。
    • 如果假设被验证:继续实施修复,编写回归测试用例
    • 如果假设被推翻:记录你学到的内容,形成新的假设,返回第一步

Hypothesis Log Template

假设日志模板

Hypothesis #1: [description]
Test: [what you did]
Result: CONFIRMED / DENIED
Learning: [what this taught you]

Hypothesis #2: ...
假设 #1: [描述]
测试动作: [你执行的操作]
结果: 验证通过 / 推翻
收获: [本次测试的结论]

假设 #2: ...

Decision Table: Hypothesis Testing Approach

决策表:假设测试方法

Hypothesis TypeTesting MethodExample
Recent code change caused it
git bisect
or revert commit
"The bug was introduced in commit abc123"
Data shape mismatchAdd logging/assertion"The API returns null instead of array"
Race conditionAdd timing logs or serialize"Request B completes before request A"
Configuration errorCompare configs across environments"Production uses different DB host"
Dependency version issueLock to known-good version"Library 2.0 changed the API surface"
假设类型测试方法示例
近期代码变更导致
git bisect
或回滚提交
"Bug是在abc123提交中引入的"
数据结构不匹配添加日志/断言"API返回null而非数组"
竞态条件添加时序日志或序列化执行"请求B比请求A先完成"
配置错误跨环境对比配置"生产环境使用了不同的数据库主机"
依赖版本问题锁定到已知正常的版本"2.0版本的库修改了API接口"

STOP — HARD-GATE: Do NOT proceed to Phase 4 unless:

停止 — 硬门槛:满足以下全部条件前,禁止进入第四阶段:

  • You have tested at least 3 hypotheses and ALL were denied
  • Each hypothesis was specific and testable
  • Each test was minimal (one change at a time)
  • You recorded learnings from each failed hypothesis

  • 你已测试至少3个假设,且全部被推翻
  • 每个假设都是具体且可测试的
  • 每个测试都是最小化的(每次仅修改一处)
  • 你已记录每个失败假设带来的收获

Phase 4: Architecture Questioning

第四阶段:架构质疑

Goal: If 3+ hypotheses have failed, the problem may be structural. Step back and question assumptions.
This phase is triggered ONLY after Phase 3 has been attempted at least 3 times without success.
目标: 如果3个及以上假设都失败,问题可能是结构性的。退一步,质疑所有假设。
本阶段仅在第三阶段至少尝试3次仍未成功时触发。

Actions

执行动作

  1. Question your assumptions. What have you been assuming is true that might not be?
    • Is the data shaped the way you think it is?
    • Is the control flow what you expect?
    • Are the types what you think they are?
    • Is the API contract what you assumed?
  2. Question the design. Is the current approach fundamentally flawed?
    • Is there a race condition in the design?
    • Is there a state management problem?
    • Is there an incorrect abstraction?
    • Are responsibilities misplaced?
  3. Consider redesign. Sometimes the fix is not a patch but a restructuring.
    • Can you simplify the design to eliminate the bug class entirely?
    • Is there a pattern that handles this case better?
    • Should you replace rather than fix?
  4. Seek external input. If you are stuck:
    • Explain the problem to someone else (rubber duck debugging)
    • Search for known issues in dependencies
    • Check if others have encountered similar problems
  1. 质疑你的假设。 哪些你默认成立的事情可能并不正确?
    • 数据结构和你想的一样吗?
    • 控制流符合你的预期吗?
    • 类型和你想的一致吗?
    • API契约和你假设的一样吗?
  2. 质疑设计。 当前的方案是否存在根本性缺陷?
    • 设计中是否存在竞态条件?
    • 是否存在状态管理问题?
    • 是否存在错误的抽象?
    • 职责划分是否错位?
  3. 考虑重新设计。 有时修复不是打补丁,而是重构。
    • 你能否简化设计,从根源上消除这类Bug?
    • 是否有更适合该场景的模式?
    • 你是否应该替换而非修复现有实现?
  4. 寻求外部输入。 如果你卡住了:
    • 向其他人解释问题(小黄鸭调试法)
    • 搜索依赖中的已知问题
    • 查看是否有其他人遇到过类似问题

STOP — HARD-GATE: Do NOT continue without:

停止 — 硬门槛:满足以下条件前,禁止继续:

  • Written list of assumptions that were questioned
  • Explicit decision: patch the current design OR redesign
  • If redesigning: a plan before implementing
  • If patching: a new hypothesis informed by the assumption review

  • 已列出所有被质疑的假设的书面清单
  • 已做出明确决策:修复当前设计 OR 重新设计
  • 如果是重新设计:实施前已有明确方案
  • 如果是打补丁:已有基于假设评审得出的新假设

Debugging Decision Flowchart

调试决策流程图

Error encountered
    |
    v
Can you reproduce it?
    |
    +-- NO --> Gather more information (logs, user reports, monitoring)
    |          Try different inputs, environments, timing
    |          Do NOT proceed until reproducible
    |
    +-- YES -> Read the FULL error message and stack trace
               |
               v
         Is the cause obvious from the error?
               |
               +-- YES -> Form hypothesis, test it (Phase 3)
               |          Still write a regression test
               |
               +-- NO --> Complete Phase 1 evidence gathering
                          |
                          v
                    Find working case for comparison (Phase 2)
                          |
                          v
                    Identify differences
                          |
                          v
                    Form and test hypotheses (Phase 3)
                          |
                          +-- Fixed --> Write regression test, verify
                          |
                          +-- 3+ failed hypotheses --> Phase 4

遇到错误
    |
    v
你能复现它吗?
    |
    +-- 否 --> 收集更多信息(日志、用户反馈、监控)
    |          尝试不同的输入、环境、时序
    |          可复现前禁止继续推进
    |
    +-- 是 --> 阅读完整错误信息和堆栈跟踪
               |
               v
         从错误中能明显看出原因吗?
               |
               +-- 是 --> 形成假设,测试(第三阶段)
               |          仍需编写回归测试用例
               |
               +-- 否 --> 完成第一阶段的证据收集
                          |
                          v
                    找到用于对比的正常案例(第二阶段)
                          |
                          v
                    识别差异
                          |
                          v
                    形成并测试假设(第三阶段)
                          |
                          +-- 已修复 --> 编写回归测试,验证
                          |
                          +-- 3个及以上假设失败 --> 进入第四阶段

Red Flags Table

危险信号表

Red FlagWhat It MeansAction
Changing code without understanding the bugShotgun debuggingGo back to Phase 1
Fix works but you do not know whyAccidental fix, likely to regressInvestigate until you understand
Same bug keeps coming backRoot cause not addressedGo to Phase 4, question design
Fix causes new bugs elsewhereUnexpected couplingMap dependencies before proceeding
"It works on my machine"Environment differenceGo to Phase 2, comparison matrix
Fix requires more than 20 linesMight be a design issueGo to Phase 4
Debugging for 30+ minutesTunnel visionTake a break, re-read evidence from Phase 1
Reading the same code repeatedlyMissing something fundamentalGet a fresh perspective, explain aloud
Multiple causes seem equally likelyInsufficient investigationGo back to Phase 1, gather more evidence

危险信号含义应对动作
不理解Bug就修改代码散弹式调试返回第一阶段
修复生效但你不知道原因意外修复,很可能回退继续调查直到完全理解
相同Bug反复出现根本原因未解决进入第四阶段,质疑设计
修复导致其他地方出现新Bug非预期的耦合继续前先梳理依赖关系
"我本地运行正常"环境差异进入第二阶段,填写对比矩阵
修复需要超过20行代码可能存在设计问题进入第四阶段
调试超过30分钟管状视野(思维受限)休息一下,重读第一阶段的证据
反复阅读同一段代码遗漏了一些根本性的内容换个视角,大声解释问题
多个原因看起来可能性相同调查不充分返回第一阶段,收集更多证据

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It Is WrongCorrect Approach
Changing random things to see if bug goes awayWastes time, introduces new bugsForm a hypothesis first
Adding try/catch to suppress the errorHides the real problemFix the root cause
Rewriting the feature from scratchNuclear option is rarely neededIsolate and fix the specific issue
Blaming the framework/library without evidenceUsually your code is wrongProve the framework bug with minimal repro
Skipping the regression test after fixingBug will returnWrite the test, always
Fixing symptoms instead of root causesPatches accumulate, system degradesTrace to the actual cause
Debugging for 45+ minutes without stepping backTunnel vision reduces effectivenessTake a break, re-read Phase 1 evidence
Ignoring error messages or stack tracesThe answer is often in the errorRead every line of the error

反模式错误原因正确做法
随机修改代码看Bug是否消失浪费时间,引入新Bug先形成假设
添加try/catch抑制错误掩盖真正的问题修复根本原因
从头重写整个功能很少需要用到这种极端方案隔离并修复具体问题
无证据指责框架/库有问题通常是你的代码写错了用最小复现示例证明框架存在Bug
修复后跳过回归测试Bug会再次出现始终编写测试用例
修复症状而非根本原因补丁越积越多,系统逐渐腐化追踪到真正的原因
调试45分钟以上都没有退一步梳理管状视野会降低效率休息一下,重读第一阶段的证据
忽略错误信息或堆栈跟踪答案通常就在错误信息里阅读错误的每一行内容

Integration Points

集成点

SkillRelationship
test-driven-development
Every bug fix MUST include a regression test (RED-GREEN cycle)
verification-before-completion
After fixing a bug, verify with fresh evidence
resilient-execution
When debugging during task execution, pause task, complete debugging, resume
code-review
Review the fix for completeness and side effects
self-learning
Record new debugging patterns in learned-patterns.md
acceptance-testing
Verify fix does not break acceptance criteria

技能关联关系
test-driven-development
每个Bug修复都必须包含回归测试用例(RED-GREEN周期)
verification-before-completion
修复Bug后,用新的证据验证
resilient-execution
任务执行过程中需要调试时,暂停任务,完成调试后再恢复
code-review
评审修复的完整性和副作用
self-learning
在learned-patterns.md中记录新的调试模式
acceptance-testing
验证修复不会破坏验收标准

Quick Reference: What NOT To Do

快速参考:禁止行为

  1. Do NOT change random things and see if the bug goes away
  2. Do NOT add try/catch to suppress the error
  3. Do NOT rewrite the feature from scratch as a first resort
  4. Do NOT blame the framework/library without evidence
  5. Do NOT skip writing a regression test after fixing
  6. Do NOT fix symptoms instead of root causes
  7. Do NOT debug for more than 45 minutes without stepping back
  8. Do NOT ignore error messages or stack traces

  1. 禁止 随机修改代码看Bug是否消失
  2. 禁止 添加try/catch抑制错误
  3. 禁止 第一选择就是从头重写整个功能
  4. 禁止 无证据指责框架/库有问题
  5. 禁止 修复后跳过编写回归测试
  6. 禁止 修复症状而非根本原因
  7. 禁止 调试超过45分钟都不退一步梳理
  8. 禁止 忽略错误信息或堆栈跟踪

Skill Type

技能类型

RIGID — The 4-phase process is mandatory and must be followed in order. Each phase has a HARD-GATE that must be satisfied before proceeding. Never change code without understanding why it is broken.
刚性 — 4阶段流程是强制性的,必须按顺序执行。每个阶段都有硬门槛,必须满足才能进入下一阶段。在未理解代码损坏原因前,绝不修改代码。