systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Overview
概述
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复既浪费时间又会引入新问题。快速补丁会掩盖潜在的根本问题。
核心原则: 在尝试修复前,务必找到根本原因。仅修复症状等同于失败。
违反该流程的形式要求,就是违背调试的核心精神。
The Iron Law
铁律
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未完成根本原因调查,绝不进行修复如果尚未完成第一阶段,不得提出修复方案。
When to Use
适用场景
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Someone wants it fixed NOW (systematic is faster than thrashing)
适用于任何技术问题:
- 测试失败
- 生产环境缺陷
- 意外行为
- 性能问题
- 构建失败
- 集成问题
尤其适用于以下场景:
- 处于时间压力下(紧急情况容易让人想当然猜测)
- “只需快速修复一下”看似显而易见
- 已经尝试过多种修复方案
- 之前的修复无效
- 尚未完全理解问题
以下情况也不得跳过流程:
- 问题看似简单(简单缺陷也有根本原因)
- 时间紧迫(仓促行事必然导致返工)
- 有人要求立即修复(系统化调试比盲目尝试更快)
The Four Phases
四个阶段
You MUST complete each phase before proceeding to the next.
必须完成当前阶段后,才能进入下一阶段。
Phase 1: Root Cause Investigation
第一阶段:根本原因调查
BEFORE attempting ANY fix:
在尝试任何修复之前:
1. Read Error Messages Carefully
1. 仔细阅读错误信息
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
Action: Use on the relevant source files. Use to find the error string in the codebase.
read_filesearch_files- 不要跳过错误或警告
- 它们通常包含确切的解决方案
- 完整阅读堆栈跟踪
- 记录行号、文件路径、错误代码
操作: 使用 查看相关源文件。使用 在代码库中查找错误字符串。
read_filesearch_files2. Reproduce Consistently
2. 稳定复现问题
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
Action: Use the tool to run the failing test or trigger the bug:
terminalbash
undefined- 能否可靠触发问题?
- 确切步骤是什么?
- 是否每次都会出现?
- 如果无法复现 → 收集更多数据,不要猜测
操作: 使用 工具运行失败的测试或触发缺陷:
terminalbash
undefinedRun specific failing test
运行特定的失败测试
pytest tests/test_module.py::test_name -v
pytest tests/test_module.py::test_name -v
Run with verbose output
运行并输出详细信息
pytest tests/test_module.py -v --tb=long
undefinedpytest tests/test_module.py -v --tb=long
undefined3. Check Recent Changes
3. 检查近期变更
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
Action:
bash
undefined- 哪些变更可能导致该问题?
- Git diff、近期提交记录
- 新依赖、配置变更
操作:
bash
undefinedRecent commits
近期提交记录
git log --oneline -10
git log --oneline -10
Uncommitted changes
未提交的变更
git diff
git diff
Changes in specific file
特定文件的变更记录
git log -p --follow src/problematic_file.py | head -100
undefinedgit log -p --follow src/problematic_file.py | head -100
undefined4. Gather Evidence in Multi-Component Systems
4. 在多组件系统中收集证据
WHEN system has multiple components (API → service → database, CI → build → deploy):
BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary:
- Log what data enters the component
- Log what data exits the component
- Verify environment/config propagation
- Check state at each layer
Run once to gather evidence showing WHERE it breaks.
THEN analyze evidence to identify the failing component.
THEN investigate that specific component.
当系统包含多个组件时(API → 服务 → 数据库,CI → 构建 → 部署):
在提出修复方案前,添加诊断工具:
针对每个组件边界:
- 记录进入组件的数据
- 记录离开组件的数据
- 验证环境/配置的传递
- 检查每一层的状态
运行一次以收集证据,确定问题出在哪里。
然后分析证据,定位故障组件。
再针对该特定组件进行调查。
5. Trace Data Flow
5. 追踪数据流
WHEN error is deep in the call stack:
- Where does the bad value originate?
- What called this function with the bad value?
- Keep tracing upstream until you find the source
- Fix at the source, not at the symptom
Action: Use to trace references:
search_filespython
undefined当错误位于调用栈深处时:
- 错误值源自何处?
- 哪个函数传入了错误值?
- 持续向上游追踪,直到找到源头
- 在源头修复,而非仅修复症状
操作: 使用 追踪引用:
search_filespython
undefinedFind where the function is called
查找函数的调用位置
search_files("function_name(", path="src/", file_glob="*.py")
search_files("function_name(", path="src/", file_glob="*.py")
Find where the variable is set
查找变量的赋值位置
search_files("variable_name\s*=", path="src/", file_glob="*.py")
undefinedsearch_files("variable_name\s*=", path="src/", file_glob="*.py")
undefinedPhase 1 Completion Checklist
第一阶段完成 checklist
- Error messages fully read and understood
- Issue reproduced consistently
- Recent changes identified and reviewed
- Evidence gathered (logs, state, data flow)
- Problem isolated to specific component/code
- Root cause hypothesis formed
STOP: Do not proceed to Phase 2 until you understand WHY it's happening.
- 错误信息已完整阅读并理解
- 问题已稳定复现
- 已识别并审查近期变更
- 已收集证据(日志、状态、数据流)
- 问题已定位到特定组件/代码
- 已形成根本原因假设
停止: 在理解问题发生的原因之前,不得进入第二阶段。
Phase 2: Pattern Analysis
第二阶段:模式分析
Find the pattern before fixing:
修复前先找到模式:
1. Find Working Examples
1. 寻找可行示例
- Locate similar working code in the same codebase
- What works that's similar to what's broken?
Action: Use to find comparable patterns:
search_filespython
search_files("similar_pattern", path="src/", file_glob="*.py")- 在同一代码库中定位类似的可运行代码
- 哪些类似的代码是可行的?
操作: 使用 查找可对比的模式:
search_filespython
search_files("similar_pattern", path="src/", file_glob="*.py")2. Compare Against References
2. 与参考实现对比
- If implementing a pattern, read the reference implementation COMPLETELY
- Don't skim — read every line
- Understand the pattern fully before applying
- 如果是实现某种模式,请完整阅读参考实现
- 不要略读 — 逐行阅读
- 在应用前完全理解该模式
3. Identify Differences
3. 识别差异
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
- 可行代码与故障代码之间有哪些不同?
- 列出所有差异,无论多小
- 不要假设“这无关紧要”
4. Understand Dependencies
4. 理解依赖关系
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
- 该代码需要哪些其他组件?
- 需要哪些设置、配置、环境?
- 它有哪些隐含假设?
Phase 3: Hypothesis and Testing
第三阶段:假设与测试
Scientific method:
采用科学方法:
1. Form a Single Hypothesis
1. 形成单一假设
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
- 清晰表述:“我认为X是根本原因,因为Y”
- 将其写下来
- 要具体,不要模糊
2. Test Minimally
2. 最小化测试
- Make the SMALLEST possible change to test the hypothesis
- One variable at a time
- Don't fix multiple things at once
- 做出最小的变更来验证假设
- 一次只变更一个变量
- 不要同时修复多个问题
3. Verify Before Continuing
3. 验证后再继续
- Did it work? → Phase 4
- Didn't work? → Form NEW hypothesis
- DON'T add more fixes on top
- 有效 → 进入第四阶段
- 无效 → 形成新的假设
- 不要叠加更多修复
4. When You Don't Know
4. 当你不确定时
- Say "I don't understand X"
- Don't pretend to know
- Ask the user for help
- Research more
- 说出“我不理解X”
- 不要假装知道
- 向用户求助
- 做更多研究
Phase 4: Implementation
第四阶段:实施修复
Fix the root cause, not the symptom:
修复根本原因,而非症状:
1. Create Failing Test Case
1. 创建失败测试用例
- Simplest possible reproduction
- Automated test if possible
- MUST have before fixing
- Use the skill
test-driven-development
- 最简单的复现方式
- 尽可能做成自动化测试
- 必须在修复前完成
- 使用 技能
test-driven-development
2. Implement Single Fix
2. 实施单一修复
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
- 针对已识别的根本原因进行修复
- 一次只做一个变更
- 不要顺便做“其他改进”
- 不要捆绑重构
3. Verify Fix
3. 验证修复
bash
undefinedbash
undefinedRun the specific regression test
运行特定的回归测试
pytest tests/test_module.py::test_regression -v
pytest tests/test_module.py::test_regression -v
Run full suite — no regressions
运行完整测试套件 — 确保无回归
pytest tests/ -q
undefinedpytest tests/ -q
undefined4. If Fix Doesn't Work — The Rule of Three
4. 如果修复无效 — 三次规则
- STOP.
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If ≥ 3: STOP and question the architecture (step 5 below)
- DON'T attempt Fix #4 without architectural discussion
- 停止。
- 计数:已经尝试了多少次修复?
- 如果 <3:回到第一阶段,结合新信息重新分析
- 如果 ≥3:停止并质疑架构(见下文第5步)
- 未经架构讨论,不得尝试第四次修复
5. If 3+ Fixes Failed: Question Architecture
5. 如果三次以上修复失败:质疑架构
Pattern indicating an architectural problem:
- Each fix reveals new shared state/coupling in a different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:
- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor the architecture vs. continue fixing symptoms?
Discuss with the user before attempting more fixes.
This is NOT a failed hypothesis — this is a wrong architecture.
表明存在架构问题的模式:
- 每次修复都会在不同位置暴露出新的共享状态/耦合
- 修复需要“大规模重构”才能实现
- 每次修复都会在其他地方引发新症状
停止并质疑基础问题:
- 该模式从根本上是否合理?
- 我们是否只是“因惯性而坚持”?
- 我们应该重构架构,还是继续修复症状?
在尝试更多修复前,与用户讨论。
这不是假设失败 — 而是架构存在问题。
Red Flags — STOP and Follow Process
危险信号 — 停止并遵循流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals a new problem in a different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (Phase 4 step 5).
如果你发现自己有以下想法:
- “先快速修复,之后再调查”
- “试试改X看看能不能行”
- “做多个变更,然后运行测试”
- “跳过测试,我手动验证就行”
- “可能是X的问题,我来修复它”
- “我不完全理解,但这可能有效”
- “模式要求X,但我要换种方式调整”
- “主要问题有这些:[未调查就列出修复方案]”
- 在追踪数据流前就提出解决方案
- “再试一次修复”(已经尝试过2次以上)
- 每次修复都会在不同地方暴露出新问题
所有这些都意味着:停止。回到第一阶段。
如果三次以上修复失败: 质疑架构(第四阶段第5步)。
Common Rationalizations
常见借口与真相
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question the pattern, don't fix again. |
| 借口 | 真相 |
|---|---|
| “问题很简单,不需要流程” | 简单问题也有根本原因。流程处理简单缺陷速度很快。 |
| “紧急情况,没时间走流程” | 系统化调试比盲目尝试更快。 |
| “先试试这个,之后再调查” | 第一次修复会定下模式。从一开始就做对。 |
| “确认修复有效后再写测试” | 未测试的修复无法持久。先写测试能验证问题。 |
| “同时做多个修复节省时间” | 无法确定哪个变更有效。会引入新问题。 |
| “参考文档太长,我调整一下模式就行” | 一知半解必然导致缺陷。请完整阅读。 |
| “我看到问题了,我来修复” | 看到症状 ≠ 理解根本原因。 |
| “再试一次修复”(已经失败2次以上) | 三次以上失败 = 架构问题。质疑模式,不要继续修复。 |
Quick Reference
快速参考
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence, trace data flow | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare, identify differences | Know what's different |
| 3. Hypothesis | Form theory, test minimally, one variable at a time | Confirmed or new hypothesis |
| 4. Implementation | Create regression test, fix root cause, verify | Bug resolved, all tests pass |
| 阶段 | 核心活动 | 成功标准 |
|---|---|---|
| 1. 根本原因 | 阅读错误信息、复现问题、检查变更、收集证据、追踪数据流 | 理解问题是什么及为什么发生 |
| 2. 模式分析 | 寻找可行示例、对比参考、识别差异 | 明确差异点 |
| 3. 假设验证 | 形成理论、最小化测试、一次变更一个变量 | 假设得到确认或形成新假设 |
| 4. 实施修复 | 创建回归测试、修复根本原因、验证 | 缺陷解决,所有测试通过 |
Hermes Agent Integration
Hermes Agent 集成
Investigation Tools
调查工具
Use these Hermes tools during Phase 1:
- — Find error strings, trace function calls, locate patterns
search_files - — Read source code with line numbers for precise analysis
read_file - — Run tests, check git history, reproduce bugs
terminal - /
web_search— Research error messages, library docsweb_extract
在第一阶段使用以下 Hermes 工具:
- — 查找错误字符串、追踪函数调用、定位模式
search_files - — 带行号阅读源代码,进行精准分析
read_file - — 运行测试、查看git历史、复现缺陷
terminal - /
web_search— 研究错误信息、查阅库文档web_extract
With delegate_task
与 delegate_task 配合使用
For complex multi-component debugging, dispatch investigation subagents:
python
delegate_task(
goal="Investigate why [specific test/behavior] fails",
context="""
Follow systematic-debugging skill:
1. Read the error message carefully
2. Reproduce the issue
3. Trace the data flow to find root cause
4. Report findings — do NOT fix yet
Error: [paste full error]
File: [path to failing code]
Test command: [exact command]
""",
toolsets=['terminal', 'file']
)针对复杂的多组件调试,分派调查子代理:
python
delegate_task(
goal="Investigate why [specific test/behavior] fails",
context="""
Follow systematic-debugging skill:
1. Read the error message carefully
2. Reproduce the issue
3. Trace the data flow to find root cause
4. Report findings — do NOT fix yet
Error: [paste full error]
File: [path to failing code]
Test command: [exact command]
""",
toolsets=['terminal', 'file']
)With test-driven-development
与 test-driven-development 配合使用
When fixing bugs:
- Write a test that reproduces the bug (RED)
- Debug systematically to find root cause
- Fix the root cause (GREEN)
- The test proves the fix and prevents regression
修复缺陷时:
- 编写复现缺陷的测试(RED状态)
- 系统化调试找到根本原因
- 修复根本原因(GREEN状态)
- 测试可验证修复效果并防止回归
Real-World Impact
实际效果
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
No shortcuts. No guessing. Systematic always wins.
来自调试会话的数据:
- 系统化方法:15-30分钟修复
- 随机修复方法:2-3小时盲目尝试
- 首次修复成功率:95% vs 40%
- 引入新缺陷:几乎为零 vs 常见
没有捷径。不要猜测。系统化方法永远是最优解。