systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Overview

概述

Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复既浪费时间又会引入新问题。快速补丁会掩盖潜在的根本问题。
核心原则: 在尝试修复前,务必找到根本原因。仅修复症状等同于失败。
违反该流程的形式要求,就是违背调试的核心精神。

The Iron Law

铁律

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
If you haven't completed Phase 1, you cannot propose fixes.
未完成根本原因调查,绝不进行修复
如果尚未完成第一阶段,不得提出修复方案。

When to Use

适用场景

Use for ANY technical issue:
  • Test failures
  • Bugs in production
  • Unexpected behavior
  • Performance problems
  • Build failures
  • Integration issues
Use this ESPECIALLY when:
  • Under time pressure (emergencies make guessing tempting)
  • "Just one quick fix" seems obvious
  • You've already tried multiple fixes
  • Previous fix didn't work
  • You don't fully understand the issue
Don't skip when:
  • Issue seems simple (simple bugs have root causes too)
  • You're in a hurry (rushing guarantees rework)
  • Someone wants it fixed NOW (systematic is faster than thrashing)
适用于任何技术问题:
  • 测试失败
  • 生产环境缺陷
  • 意外行为
  • 性能问题
  • 构建失败
  • 集成问题
尤其适用于以下场景:
  • 处于时间压力下(紧急情况容易让人想当然猜测)
  • “只需快速修复一下”看似显而易见
  • 已经尝试过多种修复方案
  • 之前的修复无效
  • 尚未完全理解问题
以下情况也不得跳过流程:
  • 问题看似简单(简单缺陷也有根本原因)
  • 时间紧迫(仓促行事必然导致返工)
  • 有人要求立即修复(系统化调试比盲目尝试更快)

The Four Phases

四个阶段

You MUST complete each phase before proceeding to the next.

必须完成当前阶段后,才能进入下一阶段。

Phase 1: Root Cause Investigation

第一阶段:根本原因调查

BEFORE attempting ANY fix:
在尝试任何修复之前:

1. Read Error Messages Carefully

1. 仔细阅读错误信息

  • Don't skip past errors or warnings
  • They often contain the exact solution
  • Read stack traces completely
  • Note line numbers, file paths, error codes
Action: Use
read_file
on the relevant source files. Use
search_files
to find the error string in the codebase.
  • 不要跳过错误或警告
  • 它们通常包含确切的解决方案
  • 完整阅读堆栈跟踪
  • 记录行号、文件路径、错误代码
操作: 使用
read_file
查看相关源文件。使用
search_files
在代码库中查找错误字符串。

2. Reproduce Consistently

2. 稳定复现问题

  • Can you trigger it reliably?
  • What are the exact steps?
  • Does it happen every time?
  • If not reproducible → gather more data, don't guess
Action: Use the
terminal
tool to run the failing test or trigger the bug:
bash
undefined
  • 能否可靠触发问题?
  • 确切步骤是什么?
  • 是否每次都会出现?
  • 如果无法复现 → 收集更多数据,不要猜测
操作: 使用
terminal
工具运行失败的测试或触发缺陷:
bash
undefined

Run specific failing test

运行特定的失败测试

pytest tests/test_module.py::test_name -v
pytest tests/test_module.py::test_name -v

Run with verbose output

运行并输出详细信息

pytest tests/test_module.py -v --tb=long
undefined
pytest tests/test_module.py -v --tb=long
undefined

3. Check Recent Changes

3. 检查近期变更

  • What changed that could cause this?
  • Git diff, recent commits
  • New dependencies, config changes
Action:
bash
undefined
  • 哪些变更可能导致该问题?
  • Git diff、近期提交记录
  • 新依赖、配置变更
操作:
bash
undefined

Recent commits

近期提交记录

git log --oneline -10
git log --oneline -10

Uncommitted changes

未提交的变更

git diff
git diff

Changes in specific file

特定文件的变更记录

git log -p --follow src/problematic_file.py | head -100
undefined
git log -p --follow src/problematic_file.py | head -100
undefined

4. Gather Evidence in Multi-Component Systems

4. 在多组件系统中收集证据

WHEN system has multiple components (API → service → database, CI → build → deploy):
BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary:
  • Log what data enters the component
  • Log what data exits the component
  • Verify environment/config propagation
  • Check state at each layer
Run once to gather evidence showing WHERE it breaks. THEN analyze evidence to identify the failing component. THEN investigate that specific component.
当系统包含多个组件时(API → 服务 → 数据库,CI → 构建 → 部署):
在提出修复方案前,添加诊断工具:
针对每个组件边界:
  • 记录进入组件的数据
  • 记录离开组件的数据
  • 验证环境/配置的传递
  • 检查每一层的状态
运行一次以收集证据,确定问题出在哪里。 然后分析证据,定位故障组件。 再针对该特定组件进行调查。

5. Trace Data Flow

5. 追踪数据流

WHEN error is deep in the call stack:
  • Where does the bad value originate?
  • What called this function with the bad value?
  • Keep tracing upstream until you find the source
  • Fix at the source, not at the symptom
Action: Use
search_files
to trace references:
python
undefined
当错误位于调用栈深处时:
  • 错误值源自何处?
  • 哪个函数传入了错误值?
  • 持续向上游追踪,直到找到源头
  • 在源头修复,而非仅修复症状
操作: 使用
search_files
追踪引用:
python
undefined

Find where the function is called

查找函数的调用位置

search_files("function_name(", path="src/", file_glob="*.py")
search_files("function_name(", path="src/", file_glob="*.py")

Find where the variable is set

查找变量的赋值位置

search_files("variable_name\s*=", path="src/", file_glob="*.py")
undefined
search_files("variable_name\s*=", path="src/", file_glob="*.py")
undefined

Phase 1 Completion Checklist

第一阶段完成 checklist

  • Error messages fully read and understood
  • Issue reproduced consistently
  • Recent changes identified and reviewed
  • Evidence gathered (logs, state, data flow)
  • Problem isolated to specific component/code
  • Root cause hypothesis formed
STOP: Do not proceed to Phase 2 until you understand WHY it's happening.

  • 错误信息已完整阅读并理解
  • 问题已稳定复现
  • 已识别并审查近期变更
  • 已收集证据(日志、状态、数据流)
  • 问题已定位到特定组件/代码
  • 已形成根本原因假设
停止: 在理解问题发生的原因之前,不得进入第二阶段。

Phase 2: Pattern Analysis

第二阶段:模式分析

Find the pattern before fixing:
修复前先找到模式:

1. Find Working Examples

1. 寻找可行示例

  • Locate similar working code in the same codebase
  • What works that's similar to what's broken?
Action: Use
search_files
to find comparable patterns:
python
search_files("similar_pattern", path="src/", file_glob="*.py")
  • 在同一代码库中定位类似的可运行代码
  • 哪些类似的代码是可行的?
操作: 使用
search_files
查找可对比的模式:
python
search_files("similar_pattern", path="src/", file_glob="*.py")

2. Compare Against References

2. 与参考实现对比

  • If implementing a pattern, read the reference implementation COMPLETELY
  • Don't skim — read every line
  • Understand the pattern fully before applying
  • 如果是实现某种模式,请完整阅读参考实现
  • 不要略读 — 逐行阅读
  • 在应用前完全理解该模式

3. Identify Differences

3. 识别差异

  • What's different between working and broken?
  • List every difference, however small
  • Don't assume "that can't matter"
  • 可行代码与故障代码之间有哪些不同?
  • 列出所有差异,无论多小
  • 不要假设“这无关紧要”

4. Understand Dependencies

4. 理解依赖关系

  • What other components does this need?
  • What settings, config, environment?
  • What assumptions does it make?

  • 该代码需要哪些其他组件?
  • 需要哪些设置、配置、环境?
  • 它有哪些隐含假设?

Phase 3: Hypothesis and Testing

第三阶段:假设与测试

Scientific method:
采用科学方法:

1. Form a Single Hypothesis

1. 形成单一假设

  • State clearly: "I think X is the root cause because Y"
  • Write it down
  • Be specific, not vague
  • 清晰表述:“我认为X是根本原因,因为Y”
  • 将其写下来
  • 要具体,不要模糊

2. Test Minimally

2. 最小化测试

  • Make the SMALLEST possible change to test the hypothesis
  • One variable at a time
  • Don't fix multiple things at once
  • 做出最小的变更来验证假设
  • 一次只变更一个变量
  • 不要同时修复多个问题

3. Verify Before Continuing

3. 验证后再继续

  • Did it work? → Phase 4
  • Didn't work? → Form NEW hypothesis
  • DON'T add more fixes on top
  • 有效 → 进入第四阶段
  • 无效 → 形成新的假设
  • 不要叠加更多修复

4. When You Don't Know

4. 当你不确定时

  • Say "I don't understand X"
  • Don't pretend to know
  • Ask the user for help
  • Research more

  • 说出“我不理解X”
  • 不要假装知道
  • 向用户求助
  • 做更多研究

Phase 4: Implementation

第四阶段:实施修复

Fix the root cause, not the symptom:
修复根本原因,而非症状:

1. Create Failing Test Case

1. 创建失败测试用例

  • Simplest possible reproduction
  • Automated test if possible
  • MUST have before fixing
  • Use the
    test-driven-development
    skill
  • 最简单的复现方式
  • 尽可能做成自动化测试
  • 必须在修复前完成
  • 使用
    test-driven-development
    技能

2. Implement Single Fix

2. 实施单一修复

  • Address the root cause identified
  • ONE change at a time
  • No "while I'm here" improvements
  • No bundled refactoring
  • 针对已识别的根本原因进行修复
  • 一次只做一个变更
  • 不要顺便做“其他改进”
  • 不要捆绑重构

3. Verify Fix

3. 验证修复

bash
undefined
bash
undefined

Run the specific regression test

运行特定的回归测试

pytest tests/test_module.py::test_regression -v
pytest tests/test_module.py::test_regression -v

Run full suite — no regressions

运行完整测试套件 — 确保无回归

pytest tests/ -q
undefined
pytest tests/ -q
undefined

4. If Fix Doesn't Work — The Rule of Three

4. 如果修复无效 — 三次规则

  • STOP.
  • Count: How many fixes have you tried?
  • If < 3: Return to Phase 1, re-analyze with new information
  • If ≥ 3: STOP and question the architecture (step 5 below)
  • DON'T attempt Fix #4 without architectural discussion
  • 停止。
  • 计数:已经尝试了多少次修复?
  • 如果 <3:回到第一阶段,结合新信息重新分析
  • 如果 ≥3:停止并质疑架构(见下文第5步)
  • 未经架构讨论,不得尝试第四次修复

5. If 3+ Fixes Failed: Question Architecture

5. 如果三次以上修复失败:质疑架构

Pattern indicating an architectural problem:
  • Each fix reveals new shared state/coupling in a different place
  • Fixes require "massive refactoring" to implement
  • Each fix creates new symptoms elsewhere
STOP and question fundamentals:
  • Is this pattern fundamentally sound?
  • Are we "sticking with it through sheer inertia"?
  • Should we refactor the architecture vs. continue fixing symptoms?
Discuss with the user before attempting more fixes.
This is NOT a failed hypothesis — this is a wrong architecture.

表明存在架构问题的模式:
  • 每次修复都会在不同位置暴露出新的共享状态/耦合
  • 修复需要“大规模重构”才能实现
  • 每次修复都会在其他地方引发新症状
停止并质疑基础问题:
  • 该模式从根本上是否合理?
  • 我们是否只是“因惯性而坚持”?
  • 我们应该重构架构,还是继续修复症状?
在尝试更多修复前,与用户讨论。
这不是假设失败 — 而是架构存在问题。

Red Flags — STOP and Follow Process

危险信号 — 停止并遵循流程

If you catch yourself thinking:
  • "Quick fix for now, investigate later"
  • "Just try changing X and see if it works"
  • "Add multiple changes, run tests"
  • "Skip the test, I'll manually verify"
  • "It's probably X, let me fix that"
  • "I don't fully understand but this might work"
  • "Pattern says X but I'll adapt it differently"
  • "Here are the main problems: [lists fixes without investigation]"
  • Proposing solutions before tracing data flow
  • "One more fix attempt" (when already tried 2+)
  • Each fix reveals a new problem in a different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (Phase 4 step 5).
如果你发现自己有以下想法:
  • “先快速修复,之后再调查”
  • “试试改X看看能不能行”
  • “做多个变更,然后运行测试”
  • “跳过测试,我手动验证就行”
  • “可能是X的问题,我来修复它”
  • “我不完全理解,但这可能有效”
  • “模式要求X,但我要换种方式调整”
  • “主要问题有这些:[未调查就列出修复方案]”
  • 在追踪数据流前就提出解决方案
  • “再试一次修复”(已经尝试过2次以上)
  • 每次修复都会在不同地方暴露出新问题
所有这些都意味着:停止。回到第一阶段。
如果三次以上修复失败: 质疑架构(第四阶段第5步)。

Common Rationalizations

常见借口与真相

ExcuseReality
"Issue is simple, don't need process"Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"First fix sets the pattern. Do it right from the start.
"I'll write test after confirming fix works"Untested fixes don't stick. Test first proves it.
"Multiple fixes at once saves time"Can't isolate what worked. Causes new bugs.
"Reference too long, I'll adapt the pattern"Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)3+ failures = architectural problem. Question the pattern, don't fix again.
借口真相
“问题很简单,不需要流程”简单问题也有根本原因。流程处理简单缺陷速度很快。
“紧急情况,没时间走流程”系统化调试比盲目尝试更快。
“先试试这个,之后再调查”第一次修复会定下模式。从一开始就做对。
“确认修复有效后再写测试”未测试的修复无法持久。先写测试能验证问题。
“同时做多个修复节省时间”无法确定哪个变更有效。会引入新问题。
“参考文档太长,我调整一下模式就行”一知半解必然导致缺陷。请完整阅读。
“我看到问题了,我来修复”看到症状 ≠ 理解根本原因。
“再试一次修复”(已经失败2次以上)三次以上失败 = 架构问题。质疑模式,不要继续修复。

Quick Reference

快速参考

PhaseKey ActivitiesSuccess Criteria
1. Root CauseRead errors, reproduce, check changes, gather evidence, trace data flowUnderstand WHAT and WHY
2. PatternFind working examples, compare, identify differencesKnow what's different
3. HypothesisForm theory, test minimally, one variable at a timeConfirmed or new hypothesis
4. ImplementationCreate regression test, fix root cause, verifyBug resolved, all tests pass
阶段核心活动成功标准
1. 根本原因阅读错误信息、复现问题、检查变更、收集证据、追踪数据流理解问题是什么及为什么发生
2. 模式分析寻找可行示例、对比参考、识别差异明确差异点
3. 假设验证形成理论、最小化测试、一次变更一个变量假设得到确认或形成新假设
4. 实施修复创建回归测试、修复根本原因、验证缺陷解决,所有测试通过

Hermes Agent Integration

Hermes Agent 集成

Investigation Tools

调查工具

Use these Hermes tools during Phase 1:
  • search_files
    — Find error strings, trace function calls, locate patterns
  • read_file
    — Read source code with line numbers for precise analysis
  • terminal
    — Run tests, check git history, reproduce bugs
  • web_search
    /
    web_extract
    — Research error messages, library docs
在第一阶段使用以下 Hermes 工具:
  • search_files
    — 查找错误字符串、追踪函数调用、定位模式
  • read_file
    — 带行号阅读源代码,进行精准分析
  • terminal
    — 运行测试、查看git历史、复现缺陷
  • web_search
    /
    web_extract
    — 研究错误信息、查阅库文档

With delegate_task

与 delegate_task 配合使用

For complex multi-component debugging, dispatch investigation subagents:
python
delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
    """,
    toolsets=['terminal', 'file']
)
针对复杂的多组件调试,分派调查子代理:
python
delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
    """,
    toolsets=['terminal', 'file']
)

With test-driven-development

与 test-driven-development 配合使用

When fixing bugs:
  1. Write a test that reproduces the bug (RED)
  2. Debug systematically to find root cause
  3. Fix the root cause (GREEN)
  4. The test proves the fix and prevents regression
修复缺陷时:
  1. 编写复现缺陷的测试(RED状态)
  2. 系统化调试找到根本原因
  3. 修复根本原因(GREEN状态)
  4. 测试可验证修复效果并防止回归

Real-World Impact

实际效果

From debugging sessions:
  • Systematic approach: 15-30 minutes to fix
  • Random fixes approach: 2-3 hours of thrashing
  • First-time fix rate: 95% vs 40%
  • New bugs introduced: Near zero vs common
No shortcuts. No guessing. Systematic always wins.
来自调试会话的数据:
  • 系统化方法:15-30分钟修复
  • 随机修复方法:2-3小时盲目尝试
  • 首次修复成功率:95% vs 40%
  • 引入新缺陷:几乎为零 vs 常见
没有捷径。不要猜测。系统化方法永远是最优解。