systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Overview
概述
Random fixes waste time and create new bugs. Quick patches mask underlying
issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom
fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复不仅浪费时间,还会引入新Bug。快速补丁只会掩盖潜在问题。
核心原则: 在尝试修复前,务必找到根本原因。仅修复症状等同于失败。
违反此流程的任何环节,都是违背调试的核心精神。
The Iron Law
铁律
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未完成根本原因调查,严禁进行任何修复如果尚未完成第一阶段,不得提出修复方案。
When to Use
适用场景
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
适用于所有技术问题:
- 测试失败
- 生产环境Bug
- 意外行为
- 性能问题
- 构建失败
- 集成问题
尤其在以下场景必须使用:
- 处于时间压力下(紧急情况容易让人凭猜测行事)
- “只需快速修复一下”看似显而易见
- 已经尝试过多种修复方法
- 之前的修复无效
- 你并未完全理解问题
切勿跳过此流程的场景:
- 问题看似简单(简单Bug也有根本原因)
- 你时间紧迫(仓促行事必然导致返工)
- 经理要求立即修复(系统化调试比盲目尝试更快)
The Four Phases
四个阶段
You MUST complete each phase before proceeding to the next.
必须完成上一阶段后,才能进入下一阶段。
Phase 1: Root Cause Investigation
阶段1:根本原因调查
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
-
Check Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component SystemsWHEN system has multiple components (CI → build → signing, API → service → database):BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentExample (multi-layer system):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer fails (secrets → workflow ✓, workflow → build ✗) -
Trace Data FlowWHEN error is deep in call stack:Quick version:
- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
在尝试任何修复前:
-
仔细阅读错误信息
- 不要跳过错误或警告
- 它们通常包含确切的解决方案
- 完整阅读堆栈跟踪信息
- 记录行号、文件路径、错误代码
-
稳定复现问题
- 你能否可靠地触发问题?
- 确切步骤是什么?
- 每次都会发生吗?
- 如果无法复现 → 收集更多数据,不要猜测
-
检查近期变更
- 哪些变更可能导致这个问题?
- Git diff、最近的提交
- 新增依赖、配置变更
- 环境差异
-
在多组件系统中收集证据当系统包含多个组件时(CI → 构建 → 签名、API → 服务 → 数据库):在提出修复方案前,添加诊断工具:
针对每个组件边界: - 记录进入组件的数据 - 记录离开组件的数据 - 验证环境/配置的传递 - 检查每一层的状态 运行一次以收集证据,确定问题出在哪个环节 然后分析证据,找出故障组件 再针对该组件进行调查示例(多层系统):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"这会揭示: 哪一层出现故障(密钥 → 工作流 ✓,工作流 → 构建 ✗) -
追踪数据流当错误出现在调用栈深处时:快速版流程:
- 错误值源自何处?
- 是谁传入了错误值?
- 持续向上追踪,直到找到源头
- 在源头修复,而非仅修复症状
Phase 2: Pattern Analysis
阶段2:模式分析
Find the pattern before fixing:
-
Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
-
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
修复前先找出规律:
-
寻找可用示例
- 在同一代码库中定位类似的可运行代码
- 哪些类似的代码可以正常运行?
-
与参考实现对比
- 如果是实现某种模式,完整阅读参考实现
- 不要略读——逐行阅读
- 在应用前完全理解该模式
-
识别差异
- 可运行代码与故障代码之间有哪些不同?
- 列出所有差异,无论多小
- 不要假设“这无关紧要”
-
理解依赖关系
- 该组件还依赖哪些其他组件?
- 需要哪些设置、配置、环境?
- 它有哪些隐含假设?
Phase 3: Hypothesis and Testing
阶段3:假设与测试
Scientific method:
-
Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
-
Verify Before Continuing
- Did it work? Yes → Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
-
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
遵循科学方法:
-
形成单一假设
- 清晰表述:“我认为X是根本原因,因为Y”
- 写下来
- 要具体,不要模糊
-
最小化测试
- 做出最小的变更来验证假设
- 一次只变更一个变量
- 不要同时修复多个问题
-
验证后再继续
- 有效?是 → 进入阶段4
- 无效?形成新的假设
- 不要叠加更多修复
-
当你不确定时
- 直接说“我不理解X”
- 不要不懂装懂
- 寻求帮助
- 深入研究
Phase 4: Implementation
阶段4:实施修复
Fix the root cause, not the symptom:
-
Create Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework
- MUST have before fixing
-
Implement Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify Fix
- Test passes now?
- No other tests broken?
- Issue actually resolved?
-
If Fix Doesn't Work
- STOP
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If ≥ 3: STOP and question the architecture (step 5 below)
- DON'T attempt Fix #4 without architectural discussion
-
If 3+ Fixes Failed: Question ArchitecturePattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs. continue fixing symptoms?
Discuss with the user before attempting more fixesThis is NOT a failed hypothesis - this is a wrong architecture.
修复根本原因,而非症状:
-
创建失败测试用例
- 最简单的复现方式
- 尽可能实现自动化测试
- 如果没有框架,可使用一次性测试脚本
- 修复前必须完成此步骤
-
实施单一修复
- 针对已确定的根本原因进行修复
- 一次只做一个变更
- 不要顺便做“其他改进”
- 不要捆绑重构
-
验证修复效果
- 测试现在通过了吗?
- 其他测试有没有被破坏?
- 问题真的解决了吗?
-
如果修复无效
- 停止操作
- 统计:你已经尝试了多少次修复?
- 如果 <3:回到阶段1,结合新信息重新分析
- 如果 ≥3:停止并质疑架构(见下文第5步)
- 未经架构讨论,不要尝试第4次修复
-
如果3次以上修复失败:质疑架构表明存在架构问题的模式:
- 每次修复都会在不同位置暴露出新的共享状态/耦合/问题
- 修复需要“大规模重构”才能实现
- 每次修复都会在其他地方引发新症状
停止并质疑基础问题:- 这个模式从根本上合理吗?
- 我们是不是“因惯性而坚持”?
- 我们应该重构架构,还是继续修复症状?
在尝试更多修复前与用户讨论这不是假设错误——而是架构本身存在问题。
Red Flags - STOP and Follow Process
危险信号——停止并遵循流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
如果你发现自己有以下想法:
- “先快速修复,之后再调查”
- “试试改X看看行不行”
- “同时做多个变更,然后运行测试”
- “跳过测试,我手动验证就行”
- “可能是X的问题,我来修复它”
- “我不完全理解,但这个可能有用”
- “模式要求X,但我要做些调整”
- “主要问题有这些:[列出修复方案但未做调查]”
- 在追踪数据流前就提出解决方案
- “再试一次修复”(已经尝试2次以上)
- 每次修复都会在不同位置暴露出新问题
以上所有情况都意味着:停止操作。回到阶段1。
如果3次以上修复失败:质疑架构(见阶段4.5)
Common Rationalizations
常见借口与实际情况
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
| 借口 | 实际情况 |
|---|---|
| “问题很简单,不需要遵循流程” | 简单问题也有根本原因。此流程处理简单Bug的速度很快。 |
| “紧急情况,没时间走流程” | 系统化调试比盲目猜测更快。 |
| “先试试这个,之后再调查” | 第一次修复会定下模式。从一开始就做对。 |
| “确认修复有效后再写测试” | 未测试的修复无法持久。先写测试能验证修复效果。 |
| “同时做多个修复更省时间” | 无法确定哪个变更起作用。还会引入新Bug。 |
| “参考内容太长,我调整一下模式” | 一知半解必然导致Bug。请完整阅读参考内容。 |
| “我看到问题了,我来修复” | 看到症状≠理解根本原因。 |
| “再试一次修复”(已经失败2次以上) | 3次以上失败=架构问题。质疑模式,不要继续修复。 |
Quick Reference
快速参考
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
| 阶段 | 核心活动 | 成功标准 |
|---|---|---|
| 1. 根本原因调查 | 阅读错误信息、复现问题、检查变更、收集证据 | 理解问题是什么以及为什么会发生 |
| 2. 模式分析 | 寻找可用示例、对比参考实现 | 识别差异 |
| 3. 假设与测试 | 形成理论、最小化测试 | 验证假设或形成新假设 |
| 4. 实施修复 | 创建测试用例、修复、验证 | Bug解决,测试通过 |
Technique: Root Cause Tracing
技巧:根本原因追踪
When bugs manifest deep in the call stack, trace backward to find the original
trigger.
当Bug出现在调用栈深处时,反向追踪找到最初的触发点。
The Tracing Process
追踪流程
-
Observe the Symptom
Error: git init failed in /Users/jesse/project/packages/core -
Find Immediate Cause - What code directly causes this?typescript
await execFileAsync('git', ['init'], { cwd: projectDir }) -
Ask: What Called This?typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId) → called by Session.initializeWorkspace() → called by Session.create() → called by test at Project.create() -
Keep Tracing Up - What value was passed?
- (empty string!)
projectDir = '' - Empty string as resolves to
cwdprocess.cwd()
-
Find Original Trigger - Where did empty string come from?typescript
const context = setupCoreTest() // Returns { tempDir: '' } Project.create('name', context.tempDir) // Accessed before beforeEach!
-
观察症状
Error: git init failed in /Users/jesse/project/packages/core -
找到直接原因 - 哪段代码直接导致了这个错误?typescript
await execFileAsync('git', ['init'], { cwd: projectDir }) -
提问:是谁调用了这段代码?typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId) → called by Session.initializeWorkspace() → called by Session.create() → called by test at Project.create() -
持续向上追踪 - 传入的值是什么?
- (空字符串!)
projectDir = '' - 空字符串作为会解析为
cwdprocess.cwd()
-
找到最初触发点 - 空字符串来自何处?typescript
const context = setupCoreTest() // Returns { tempDir: '' } Project.create('name', context.tempDir) // 在beforeEach之前就被访问了!
Adding Stack Traces
添加堆栈跟踪
When you can't trace manually, add instrumentation:
typescript
async function gitInit(directory: string) {
const stack = new Error().stack
console.error('DEBUG git init:', {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
})
await execFileAsync('git', ['init'], { cwd: directory })
}Tips:
- Use in tests (logger may be suppressed)
console.error() - Log before the dangerous operation, not after it fails
- Include context: directory, cwd, environment variables
- shows complete call chain
new Error().stack
当无法手动追踪时,添加工具代码:
typescript
async function gitInit(directory: string) {
const stack = new Error().stack
console.error('DEBUG git init:', {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
})
await execFileAsync('git', ['init'], { cwd: directory })
}提示:
- 在测试中使用(日志可能被屏蔽)
console.error() - 在危险操作前记录日志,而非失败后
- 包含上下文信息:目录、当前工作目录、环境变量
- 会显示完整调用链
new Error().stack
Finding Which Test Causes Pollution
找出导致污染的测试用例
If something appears during tests but you don't know which test, use bisection:
bash
undefined如果测试过程中出现问题但不知道是哪个测试用例导致的,可使用二分法:
bash
undefinedRun tests one-by-one, stop at first polluter
逐个运行测试,在第一个导致污染的用例处停止
for f in src/**/*.test.ts; do
npm test "$f" && [ -d .git ] && echo "POLLUTER: $f" && break
done
**NEVER fix just where the error appears.** Trace back to find the original
trigger.for f in src/**/*.test.ts; do
npm test "$f" && [ -d .git ] && echo "POLLUTER: $f" && break
done
**切勿仅在错误出现的位置修复。** 反向追踪找到最初的触发点。Technique: Defense-in-Depth Validation
技巧:纵深防御验证
After finding root cause, validate at EVERY layer data passes through. Make the
bug structurally impossible.
找到根本原因后,在数据流经的每一层都添加验证。从结构上杜绝Bug再次出现。
Why Multiple Layers
为什么需要多层验证
- Single validation: "We fixed the bug"
- Multiple layers: "We made the bug impossible"
Different layers catch different cases:
- Entry validation catches most bugs
- Business logic catches edge cases
- Environment guards prevent context-specific dangers
- Debug logging helps when other layers fail
- 单一验证:“我们修复了Bug”
- 多层验证:“我们让Bug无法出现”
不同层可捕获不同情况:
- 入口验证可捕获大多数Bug
- 业务逻辑可捕获边缘情况
- 环境防护可防止特定上下文下的危险操作
- 调试日志可在其他层失效时提供帮助
The Four Layers
四个层级
Layer 1: Entry Point Validation - Reject invalid input at API boundary
typescript
function createProject(name: string, workingDirectory: string) {
if (!workingDirectory || workingDirectory.trim() === '') {
throw new Error('workingDirectory cannot be empty')
}
if (!existsSync(workingDirectory)) {
throw new Error(`workingDirectory does not exist: ${workingDirectory}`)
}
}Layer 2: Business Logic Validation - Ensure data makes sense for operation
typescript
function initializeWorkspace(projectDir: string, sessionId: string) {
if (!projectDir) {
throw new Error('projectDir required for workspace initialization')
}
}Layer 3: Environment Guards - Prevent dangerous operations in specific
contexts
typescript
async function gitInit(directory: string) {
if (process.env.NODE_ENV === 'test') {
const normalized = normalize(resolve(directory))
const tmpDir = normalize(resolve(tmpdir()))
if (!normalized.startsWith(tmpDir)) {
throw new Error(`Refusing git init outside temp dir during tests`)
}
}
}Layer 4: Debug Instrumentation - Capture context for forensics
typescript
async function gitInit(directory: string) {
logger.debug('About to git init', {
directory,
cwd: process.cwd(),
stack: new Error().stack,
})
}层级1:入口点验证 - 在API边界拒绝无效输入
typescript
function createProject(name: string, workingDirectory: string) {
if (!workingDirectory || workingDirectory.trim() === '') {
throw new Error('workingDirectory cannot be empty')
}
if (!existsSync(workingDirectory)) {
throw new Error(`workingDirectory does not exist: ${workingDirectory}`)
}
}层级2:业务逻辑验证 - 确保数据符合操作要求
typescript
function initializeWorkspace(projectDir: string, sessionId: string) {
if (!projectDir) {
throw new Error('projectDir required for workspace initialization')
}
}层级3:环境防护 - 防止在特定上下文中执行危险操作
typescript
async function gitInit(directory: string) {
if (process.env.NODE_ENV === 'test') {
const normalized = normalize(resolve(directory))
const tmpDir = normalize(resolve(tmpdir()))
if (!normalized.startsWith(tmpDir)) {
throw new Error(`Refusing git init outside temp dir during tests`)
}
}
}层级4:调试工具 - 捕获上下文信息用于取证
typescript
async function gitInit(directory: string) {
logger.debug('About to git init', {
directory,
cwd: process.cwd(),
stack: new Error().stack,
})
}Applying Defense-in-Depth
应用纵深防御
When you find a bug:
- Trace the data flow - Where does bad value originate? Where used?
- Map all checkpoints - List every point data passes through
- Add validation at each layer - Entry, business, environment, debug
- Test each layer - Try to bypass layer 1, verify layer 2 catches it
Don't stop at one validation point. Add checks at every layer.
当发现Bug时:
- 追踪数据流 - 错误值源自何处?在哪里被使用?
- 映射所有检查点 - 列出数据流经的每一个节点
- 在每一层添加验证 - 入口、业务逻辑、环境、调试
- 测试每一层 - 尝试绕过层级1,验证层级2能否捕获
不要仅停留在一个验证点。 在每一层都添加检查。
Real-World Impact
实际影响
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
来自调试会话的数据:
- 系统化方法:15-30分钟修复
- 随机修复方法:2-3小时的盲目尝试
- 首次修复成功率:95% vs 40%
- 引入新Bug的概率:几乎为0 vs 频繁发生