systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
随意的修复不仅浪费时间,还会引入新的bug。快速补丁会掩盖潜在的问题。
核心原则: 在尝试修复前,必须先找到根本原因。仅修复症状等同于失败。
The Iron Law
铁律
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未完成根本原因调查前,禁止进行任何修复如果尚未完成第一阶段,不得提出修复方案。
The Four Phases
四个阶段
Phase 1: Root Cause Investigation
第一阶段:根本原因调查
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- If not reproducible, gather more data - don't guess
-
Check Recent Changes
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component SystemsWhen system has multiple components (CI -> build -> signing, API -> service -> database):
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation Run once to gather evidence showing WHERE it breaks THEN analyze to identify failing component -
Trace Data FlowSee references/root-cause-tracing.md for backward tracing technique.Quick version: Where does bad value originate? Keep tracing up until you find the source. Fix at source, not symptom.
在尝试任何修复之前:
-
仔细阅读错误信息
- 不要跳过错误或警告
- 完整阅读堆栈跟踪信息
- 记录行号、文件路径、错误代码
-
稳定复现问题
- 能否可靠触发问题?
- 具体步骤是什么?
- 如果无法稳定复现,收集更多数据——不要猜测
-
检查近期变更
- Git diff、近期提交记录
- 新增依赖项、配置变更
- 环境差异
-
在多组件系统中收集证据当系统包含多个组件时(如CI -> 构建 -> 签名,API -> 服务 -> 数据库):
针对每个组件边界: - 记录进入组件的数据 - 记录流出组件的数据 - 验证环境/配置的传递是否正确 运行一次以收集证据,确定问题出在哪个环节 再通过分析定位故障组件 -
追踪数据流可查看 references/root-cause-tracing.md 了解反向追踪技术。简化版:错误值源自何处?持续向上追踪直到找到源头。从源头修复,而非仅修复症状。
Phase 2: Pattern Analysis
第二阶段:模式分析
- Find Working Examples - Locate similar working code in same codebase
- Compare Against References - Read reference implementations COMPLETELY, don't skim
- Identify Differences - List every difference between working and broken
- Understand Dependencies - What settings, config, environment assumptions?
- 寻找可用示例 - 在同一代码库中定位类似的可正常运行代码
- 对比参考实现 - 完整阅读参考实现,不要略读
- 识别差异 - 列出正常代码与故障代码之间的所有差异
- 梳理依赖关系 - 涉及哪些设置、配置、环境假设?
Phase 3: Hypothesis and Testing
第三阶段:假设与测试
- Form Single Hypothesis - "I think X is the root cause because Y"
- Test Minimally - SMALLEST possible change, one variable at a time
- Verify Before Continuing - Worked? Phase 4. Didn't? NEW hypothesis, don't stack fixes
- 形成单一假设 - “我认为X是根本原因,因为Y”
- 最小化测试 - 采用最小的变更,每次只修改一个变量
- 验证后再推进 - 修复有效?进入第四阶段。无效?提出新假设,不要叠加修复方案
Phase 4: Implementation
第四阶段:实施修复
-
Create Failing Test Case - Simplest reproduction, automated if possible
-
Implement Single Fix - ONE change, no "while I'm here" improvements
-
Verify Fix - Test passes? No regressions?
-
If Fix Doesn't Work:
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze
- If >= 3: STOP and question the architecture
-
If 3+ Fixes Failed: Question ArchitecturePattern indicating architectural problem:
- Each fix reveals new shared state/coupling
- Fixes require "massive refactoring"
- Each fix creates new symptoms elsewhere
STOP. Discuss with user before attempting more fixes.
-
编写失败测试用例 - 最简单的复现方式,尽可能实现自动化
-
实施单一修复 - 仅做一处变更,不要顺便进行其他优化
-
验证修复效果 - 测试通过?无回归问题?
-
如果修复无效:
- 统计:已经尝试了多少次修复?
- 若少于3次:回到第一阶段,重新分析
- 若≥3次:停止操作,质疑架构设计
-
若3次以上修复均失败:质疑架构设计以下模式表明存在架构问题:
- 每次修复都会暴露新的共享状态/耦合关系
- 修复需要“大规模重构”
- 每次修复都会在其他地方引发新的症状
停止操作。在尝试更多修复前,与用户沟通。
Red Flags - STOP and Follow Process
危险信号 - 停止操作并遵循流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see"
- "Add multiple changes, run tests"
- "I'm confident it's X, let me fix that"
- "One more fix attempt" (when already tried 2+)
- Proposing solutions before tracing data flow
ALL of these mean: STOP. Return to Phase 1.
如果你产生以下想法:
- “先快速修复,之后再深入调查”
- “试试修改X看看效果”
- “同时做多处变更,然后运行测试”
- “我确定是X的问题,直接修复”
- “再试最后一次修复”(已尝试2次以上)
- 未追踪数据流就提出解决方案
以上所有情况都意味着:停止操作。回到第一阶段。
Supporting Techniques
辅助技巧
Defense-in-Depth
纵深防御
When you fix a bug, validate at EVERY layer:
| Layer | Purpose | Example |
|---|---|---|
| Entry Point | Reject invalid input at API boundary | |
| Business Logic | Ensure data makes sense for operation | Validate before processing |
| Environment Guards | Prevent dangerous ops in specific contexts | Refuse git init outside tmpdir in tests |
| Debug Instrumentation | Capture context for forensics | Log with stack trace before dangerous ops |
Single validation feels sufficient, but different code paths bypass it. Make bugs structurally impossible.
修复bug时,在每一层都进行验证:
| 层级 | 目的 | 示例 |
|---|---|---|
| 入口点 | 在API边界拒绝无效输入 | |
| 业务逻辑 | 确保数据符合操作要求 | 处理前先验证 |
| 环境防护 | 防止在特定环境中执行危险操作 | 测试中拒绝在临时目录外执行git init |
| 调试埋点 | 为取证收集上下文信息 | 在危险操作前记录带堆栈跟踪的日志 |
单一验证看似足够,但不同的代码路径可能绕过它。要从结构上杜绝bug的产生。
Condition-Based Waiting
基于条件的等待
Flaky tests guess at timing. Wait for actual conditions instead:
python
undefined不稳定的测试会猜测执行时机。应改为等待实际条件满足:
python
undefinedBAD: Guessing at timing
错误示例:猜测执行时机
await asyncio.sleep(0.05)
result = get_result()
await asyncio.sleep(0.05)
result = get_result()
GOOD: Wait for condition
正确示例:等待条件满足
await wait_for(lambda: get_result() is not None)
result = get_result()
Pattern:
```python
async def wait_for(condition, timeout_ms=5000):
start = time.time()
while True:
if condition():
return
if (time.time() - start) * 1000 > timeout_ms:
raise TimeoutError("Condition not met")
await asyncio.sleep(0.01) # Poll every 10msawait wait_for(lambda: get_result() is not None)
result = get_result()
模式示例:
```python
async def wait_for(condition, timeout_ms=5000):
start = time.time()
while True:
if condition():
return
if (time.time() - start) * 1000 > timeout_ms:
raise TimeoutError("Condition not met")
await asyncio.sleep(0.01) # 每10ms轮询一次Common Rationalizations
常见自我合理化借口
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I see the problem, let me fix it" | Seeing symptoms != understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
| 借口 | 真相 |
|---|---|
| “问题很简单,不需要走流程” | 简单问题也有根本原因。这套流程处理简单bug的速度很快。 |
| “情况紧急,没时间走流程” | 系统化调试比试错式修复更快。 |
| “先试试这个,之后再调查” | 第一次修复会定下错误的模式。从一开始就应该正确操作。 |
| “我看到问题所在了,直接修复” | 看到症状不等于理解根本原因。 |
| “再试最后一次修复”(已尝试2次以上) | 3次以上失败意味着存在架构问题。质疑现有模式,不要继续修复。 |
Verification
验证
Run:
python scripts/verify.py执行:
python scripts/verify.pyReferences
参考资料
- references/root-cause-tracing.md - Trace bugs backward through call stack
- references/root-cause-tracing.md - 通过调用栈反向追踪bug