systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Overview

概述

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

Violating the letter of this process is violating the spirit of debugging.

随机修复既浪费时间又会引入新问题。快速补丁会掩盖潜在的根本问题。

核心原则： 在尝试修复前，务必找到根本原因。仅修复症状等同于失败。

违反该流程的形式要求，就是违背调试的核心精神。

The Iron Law

铁律

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

未完成根本原因调查，绝不进行修复

如果尚未完成第一阶段，不得提出修复方案。

When to Use

适用场景

Use for ANY technical issue:

Test failures
Bugs in production
Unexpected behavior
Performance problems
Build failures
Integration issues

Use this ESPECIALLY when:

Under time pressure (emergencies make guessing tempting)
"Just one quick fix" seems obvious
You've already tried multiple fixes
Previous fix didn't work
You don't fully understand the issue

Don't skip when:

Issue seems simple (simple bugs have root causes too)
You're in a hurry (rushing guarantees rework)
Someone wants it fixed NOW (systematic is faster than thrashing)

适用于任何技术问题：

测试失败
生产环境缺陷
意外行为
性能问题
构建失败
集成问题

尤其适用于以下场景：

处于时间压力下（紧急情况容易让人想当然猜测）
“只需快速修复一下”看似显而易见
已经尝试过多种修复方案
之前的修复无效
尚未完全理解问题

以下情况也不得跳过流程：

问题看似简单（简单缺陷也有根本原因）
时间紧迫（仓促行事必然导致返工）
有人要求立即修复（系统化调试比盲目尝试更快）

The Four Phases

四个阶段

You MUST complete each phase before proceeding to the next.

必须完成当前阶段后，才能进入下一阶段。

Phase 1: Root Cause Investigation

第一阶段：根本原因调查

BEFORE attempting ANY fix:

在尝试任何修复之前：

1. Read Error Messages Carefully

1. 仔细阅读错误信息

Don't skip past errors or warnings
They often contain the exact solution
Read stack traces completely
Note line numbers, file paths, error codes

Action: Use

read_file

on the relevant source files. Use

search_files

to find the error string in the codebase.

不要跳过错误或警告
它们通常包含确切的解决方案
完整阅读堆栈跟踪
记录行号、文件路径、错误代码

操作： 使用

read_file

查看相关源文件。使用

search_files

在代码库中查找错误字符串。

2. Reproduce Consistently

2. 稳定复现问题

Can you trigger it reliably?
What are the exact steps?
Does it happen every time?
If not reproducible → gather more data, don't guess

Action: Use the

terminal

tool to run the failing test or trigger the bug:

bash

undefined

能否可靠触发问题？
确切步骤是什么？
是否每次都会出现？
如果无法复现 → 收集更多数据，不要猜测

操作： 使用

terminal

工具运行失败的测试或触发缺陷：

bash

undefined

Run specific failing test

运行特定的失败测试

pytest tests/test_module.py::test_name -v

Run with verbose output

运行并输出详细信息

pytest tests/test_module.py -v --tb=long

undefined

pytest tests/test_module.py -v --tb=long

undefined

3. Check Recent Changes

3. 检查近期变更

What changed that could cause this?
Git diff, recent commits
New dependencies, config changes

Action:

bash

undefined

哪些变更可能导致该问题？
Git diff、近期提交记录
新依赖、配置变更

操作：

bash

undefined

Recent commits

近期提交记录

git log --oneline -10

Uncommitted changes

未提交的变更

git diff

Changes in specific file

特定文件的变更记录

git log -p --follow src/problematic_file.py | head -100

undefined

git log -p --follow src/problematic_file.py | head -100

undefined

4. Gather Evidence in Multi-Component Systems

4. 在多组件系统中收集证据

WHEN system has multiple components (API → service → database, CI → build → deploy):

BEFORE proposing fixes, add diagnostic instrumentation:

For EACH component boundary:

Log what data enters the component
Log what data exits the component
Verify environment/config propagation
Check state at each layer

Run once to gather evidence showing WHERE it breaks. THEN analyze evidence to identify the failing component. THEN investigate that specific component.

当系统包含多个组件时（API → 服务 → 数据库，CI → 构建 → 部署）：

在提出修复方案前，添加诊断工具：

针对每个组件边界：

记录进入组件的数据
记录离开组件的数据
验证环境/配置的传递
检查每一层的状态

运行一次以收集证据，确定问题出在哪里。然后分析证据，定位故障组件。再针对该特定组件进行调查。

5. Trace Data Flow

5. 追踪数据流

WHEN error is deep in the call stack:

Where does the bad value originate?
What called this function with the bad value?
Keep tracing upstream until you find the source
Fix at the source, not at the symptom

Action: Use

search_files

to trace references:

python

undefined

当错误位于调用栈深处时：

错误值源自何处？
哪个函数传入了错误值？
持续向上游追踪，直到找到源头
在源头修复，而非仅修复症状

操作： 使用

search_files

追踪引用：

python

undefined

Find where the function is called

查找函数的调用位置

search_files("function_name(", path="src/", file_glob="*.py")

Find where the variable is set

查找变量的赋值位置

search_files("variable_name\s*=", path="src/", file_glob="*.py")

undefined

search_files("variable_name\s*=", path="src/", file_glob="*.py")

undefined

Phase 1 Completion Checklist

第一阶段完成 checklist

Phase 2: Pattern Analysis

第二阶段：模式分析

Find the pattern before fixing:

修复前先找到模式：

1. Find Working Examples

1. 寻找可行示例

Locate similar working code in the same codebase
What works that's similar to what's broken?

Action: Use

search_files

to find comparable patterns:

python

search_files("similar_pattern", path="src/", file_glob="*.py")

在同一代码库中定位类似的可运行代码
哪些类似的代码是可行的？

操作： 使用

search_files

查找可对比的模式：

python

search_files("similar_pattern", path="src/", file_glob="*.py")

2. Compare Against References

2. 与参考实现对比

If implementing a pattern, read the reference implementation COMPLETELY
Don't skim — read every line
Understand the pattern fully before applying

如果是实现某种模式，请完整阅读参考实现
不要略读 — 逐行阅读
在应用前完全理解该模式

3. Identify Differences

3. 识别差异

What's different between working and broken?
List every difference, however small
Don't assume "that can't matter"

可行代码与故障代码之间有哪些不同？
列出所有差异，无论多小
不要假设“这无关紧要”

4. Understand Dependencies

4. 理解依赖关系

What other components does this need?
What settings, config, environment?
What assumptions does it make?

该代码需要哪些其他组件？
需要哪些设置、配置、环境？
它有哪些隐含假设？

Phase 3: Hypothesis and Testing

第三阶段：假设与测试

Scientific method:

采用科学方法：

1. Form a Single Hypothesis

1. 形成单一假设

State clearly: "I think X is the root cause because Y"
Write it down
Be specific, not vague

清晰表述：“我认为X是根本原因，因为Y”
将其写下来
要具体，不要模糊

2. Test Minimally

2. 最小化测试

Make the SMALLEST possible change to test the hypothesis
One variable at a time
Don't fix multiple things at once

做出最小的变更来验证假设
一次只变更一个变量
不要同时修复多个问题

3. Verify Before Continuing

3. 验证后再继续

Did it work? → Phase 4
Didn't work? → Form NEW hypothesis
DON'T add more fixes on top

有效 → 进入第四阶段
无效 → 形成新的假设
不要叠加更多修复

4. When You Don't Know

4. 当你不确定时

Say "I don't understand X"
Don't pretend to know
Ask the user for help
Research more

说出“我不理解X”
不要假装知道
向用户求助
做更多研究

Phase 4: Implementation

第四阶段：实施修复

Fix the root cause, not the symptom:

修复根本原因，而非症状：

1. Create Failing Test Case

1. 创建失败测试用例

Simplest possible reproduction
Automated test if possible
MUST have before fixing
Use the
```
test-driven-development
```
skill

最简单的复现方式
尽可能做成自动化测试
必须在修复前完成
使用
```
test-driven-development
```
技能

2. Implement Single Fix

2. 实施单一修复

Address the root cause identified
ONE change at a time
No "while I'm here" improvements
No bundled refactoring

针对已识别的根本原因进行修复
一次只做一个变更
不要顺便做“其他改进”
不要捆绑重构

3. Verify Fix

3. 验证修复

bash

undefined

bash

undefined

Run the specific regression test

运行特定的回归测试

pytest tests/test_module.py::test_regression -v

Run full suite — no regressions

运行完整测试套件 — 确保无回归

pytest tests/ -q

undefined

pytest tests/ -q

undefined

4. If Fix Doesn't Work — The Rule of Three

4. 如果修复无效 — 三次规则

STOP.
Count: How many fixes have you tried?
If < 3: Return to Phase 1, re-analyze with new information
If ≥ 3: STOP and question the architecture (step 5 below)
DON'T attempt Fix #4 without architectural discussion

停止。
计数：已经尝试了多少次修复？
如果 <3：回到第一阶段，结合新信息重新分析
如果 ≥3：停止并质疑架构（见下文第5步）
未经架构讨论，不得尝试第四次修复

5. If 3+ Fixes Failed: Question Architecture

5. 如果三次以上修复失败：质疑架构

Pattern indicating an architectural problem:

Each fix reveals new shared state/coupling in a different place
Fixes require "massive refactoring" to implement
Each fix creates new symptoms elsewhere

STOP and question fundamentals:

Is this pattern fundamentally sound?
Are we "sticking with it through sheer inertia"?
Should we refactor the architecture vs. continue fixing symptoms?

Discuss with the user before attempting more fixes.

This is NOT a failed hypothesis — this is a wrong architecture.

表明存在架构问题的模式：

每次修复都会在不同位置暴露出新的共享状态/耦合
修复需要“大规模重构”才能实现
每次修复都会在其他地方引发新症状

停止并质疑基础问题：

该模式从根本上是否合理？
我们是否只是“因惯性而坚持”？
我们应该重构架构，还是继续修复症状？

在尝试更多修复前，与用户讨论。

这不是假设失败 — 而是架构存在问题。

Red Flags — STOP and Follow Process

危险信号 — 停止并遵循流程

If you catch yourself thinking:

"Quick fix for now, investigate later"
"Just try changing X and see if it works"
"Add multiple changes, run tests"
"Skip the test, I'll manually verify"
"It's probably X, let me fix that"
"I don't fully understand but this might work"
"Pattern says X but I'll adapt it differently"
"Here are the main problems: [lists fixes without investigation]"
Proposing solutions before tracing data flow
"One more fix attempt" (when already tried 2+)
Each fix reveals a new problem in a different place

ALL of these mean: STOP. Return to Phase 1.

If 3+ fixes failed: Question the architecture (Phase 4 step 5).

如果你发现自己有以下想法：

“先快速修复，之后再调查”
“试试改X看看能不能行”
“做多个变更，然后运行测试”
“跳过测试，我手动验证就行”
“可能是X的问题，我来修复它”
“我不完全理解，但这可能有效”
“模式要求X，但我要换种方式调整”
“主要问题有这些：[未调查就列出修复方案]”
在追踪数据流前就提出解决方案
“再试一次修复”（已经尝试过2次以上）
每次修复都会在不同地方暴露出新问题

所有这些都意味着：停止。回到第一阶段。

如果三次以上修复失败： 质疑架构（第四阶段第5步）。

Common Rationalizations

常见借口与真相

Excuse	Reality
"Issue is simple, don't need process"	Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"	Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"	First fix sets the pattern. Do it right from the start.
"I'll write test after confirming fix works"	Untested fixes don't stick. Test first proves it.
"Multiple fixes at once saves time"	Can't isolate what worked. Causes new bugs.
"Reference too long, I'll adapt the pattern"	Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"	Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)	3+ failures = architectural problem. Question the pattern, don't fix again.

借口	真相
“问题很简单，不需要流程”	简单问题也有根本原因。流程处理简单缺陷速度很快。
“紧急情况，没时间走流程”	系统化调试比盲目尝试更快。
“先试试这个，之后再调查”	第一次修复会定下模式。从一开始就做对。
“确认修复有效后再写测试”	未测试的修复无法持久。先写测试能验证问题。
“同时做多个修复节省时间”	无法确定哪个变更有效。会引入新问题。
“参考文档太长，我调整一下模式就行”	一知半解必然导致缺陷。请完整阅读。
“我看到问题了，我来修复”	看到症状 ≠ 理解根本原因。
“再试一次修复”（已经失败2次以上）	三次以上失败 = 架构问题。质疑模式，不要继续修复。

Quick Reference

快速参考

Phase	Key Activities	Success Criteria
1. Root Cause	Read errors, reproduce, check changes, gather evidence, trace data flow	Understand WHAT and WHY
2. Pattern	Find working examples, compare, identify differences	Know what's different
3. Hypothesis	Form theory, test minimally, one variable at a time	Confirmed or new hypothesis
4. Implementation	Create regression test, fix root cause, verify	Bug resolved, all tests pass

阶段	核心活动	成功标准
1. 根本原因	阅读错误信息、复现问题、检查变更、收集证据、追踪数据流	理解问题是什么及为什么发生
2. 模式分析	寻找可行示例、对比参考、识别差异	明确差异点
3. 假设验证	形成理论、最小化测试、一次变更一个变量	假设得到确认或形成新假设
4. 实施修复	创建回归测试、修复根本原因、验证	缺陷解决，所有测试通过

Hermes Agent Integration

Hermes Agent 集成

Investigation Tools

调查工具

Use these Hermes tools during Phase 1:

search_files
— Find error strings, trace function calls, locate patterns
read_file
— Read source code with line numbers for precise analysis
terminal
— Run tests, check git history, reproduce bugs
web_search
/
web_extract
— Research error messages, library docs

在第一阶段使用以下 Hermes 工具：

search_files
— 查找错误字符串、追踪函数调用、定位模式
read_file
— 带行号阅读源代码，进行精准分析
terminal
— 运行测试、查看git历史、复现缺陷
web_search
/
web_extract
— 研究错误信息、查阅库文档

With delegate_task

与 delegate_task 配合使用

For complex multi-component debugging, dispatch investigation subagents:

python

delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
    """,
    toolsets=['terminal', 'file']
)

针对复杂的多组件调试，分派调查子代理：

python

delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
    """,
    toolsets=['terminal', 'file']
)

With test-driven-development

与 test-driven-development 配合使用

When fixing bugs:

Write a test that reproduces the bug (RED)
Debug systematically to find root cause
Fix the root cause (GREEN)
The test proves the fix and prevents regression

修复缺陷时：

编写复现缺陷的测试（RED状态）
系统化调试找到根本原因
修复根本原因（GREEN状态）
测试可验证修复效果并防止回归

Real-World Impact

实际效果

From debugging sessions:

Systematic approach: 15-30 minutes to fix
Random fixes approach: 2-3 hours of thrashing
First-time fix rate: 95% vs 40%
New bugs introduced: Near zero vs common

No shortcuts. No guessing. Systematic always wins.

来自调试会话的数据：

系统化方法：15-30分钟修复
随机修复方法：2-3小时盲目尝试
首次修复成功率：95% vs 40%
引入新缺陷：几乎为零 vs 常见

没有捷径。不要猜测。系统化方法永远是最优解。