root-cause-tracing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Root Cause Tracing

根因追踪

Overview

概述

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
Core principle: Trace backward through the call chain until you find the original trigger, then fix at the source.
Bug通常会在调用栈的深层显现(比如在错误目录执行git init、文件创建在错误位置、用错误路径打开数据库)。你的第一反应可能是修复错误出现的地方,但这只是治标不治本。
核心原则: 沿着调用链反向追踪,直到找到原始触发点,然后从根源修复问题。

When to Use

适用场景

dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
Use when:
  • Error happens deep in execution (not at entry point)
  • Stack trace shows long call chain
  • Unclear where invalid data originated
  • Need to find which test/code triggers the problem
dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
适用场景:
  • 错误发生在执行流程的深层(而非入口点)
  • 栈追踪显示调用链较长
  • 不清楚无效数据的来源
  • 需要定位触发问题的测试用例/代码

The Tracing Process

追踪流程

1. Observe the Symptom

1. 观察症状

Error: git init failed in /Users/jesse/project/packages/core
Error: git init failed in /Users/jesse/project/packages/core

2. Find Immediate Cause

2. 寻找直接原因

What code directly causes this?
typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
哪段代码直接导致了这个问题?
typescript
await execFileAsync('git', ['init'], { cwd: projectDir });

3. Ask: What Called This?

3. 提问:这段代码被谁调用?

typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()
typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()

4. Keep Tracing Up

4. 继续向上追踪

What value was passed?
  • projectDir = ''
    (empty string!)
  • Empty string as
    cwd
    resolves to
    process.cwd()
  • That's the source code directory!
传入的参数值是什么?
  • projectDir = ''
    (空字符串!)
  • 将空字符串作为
    cwd
    会解析为
    process.cwd()
  • 也就是源代码目录!

5. Find Original Trigger

5. 找到原始触发点

Where did empty string come from?
typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
空字符串来自哪里?
typescript
const context = setupCoreTest(); // 返回 { tempDir: '' }
Project.create('name', context.tempDir); // 在before beforeEach之前就被访问了!

Adding Stack Traces

添加栈追踪

When you can't trace manually, add instrumentation:
typescript
// Before the problematic operation
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
Critical: Use
console.error()
in tests (not logger - may not show)
Run and capture:
bash
npm test 2>&1 | grep 'DEBUG git init'
Analyze stack traces:
  • Look for test file names
  • Find the line number triggering the call
  • Identify the pattern (same test? same parameter?)
当无法手动追踪时,添加插桩代码:
typescript
// 在有问题的操作之前
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
关键提示: 在测试中使用
console.error()
(不要用日志工具,可能不会显示)
运行并捕获输出:
bash
npm test 2>&1 | grep 'DEBUG git init'
分析栈追踪:
  • 查找测试文件名
  • 找到触发调用的行号
  • 识别规律(同一测试用例?同一参数?)

Finding Which Test Causes Pollution

定位导致污染的测试用例

If something appears during tests but you don't know which test:
Use the bisection script: @find-polluter.sh
bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
Runs tests one-by-one, stops at first polluter. See script for usage.
如果测试过程中出现问题,但不知道是哪个测试用例导致的:
使用二分法脚本:@find-polluter.sh
bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
逐个运行测试用例,在第一个导致污染的用例处停止。查看脚本获取使用说明。

Real Example: Empty projectDir

真实案例:空projectDir

Symptom:
.git
created in
packages/core/
(source code)
Trace chain:
  1. git init
    runs in
    process.cwd()
    ← empty cwd parameter
  2. WorktreeManager called with empty projectDir
  3. Session.create() passed empty string
  4. Test accessed
    context.tempDir
    before beforeEach
  5. setupCoreTest() returns
    { tempDir: '' }
    initially
Root cause: Top-level variable initialization accessing empty value
Fix: Made tempDir a getter that throws if accessed before beforeEach
Also added defense-in-depth:
  • Layer 1: Project.create() validates directory
  • Layer 2: WorkspaceManager validates not empty
  • Layer 3: NODE_ENV guard refuses git init outside tmpdir
  • Layer 4: Stack trace logging before git init
症状:
.git
被创建在
packages/core/
(源代码目录)
追踪链:
  1. git init
    process.cwd()
    中运行 ← 空cwd参数
  2. WorktreeManager被传入空的projectDir
  3. Session.create()被传入空字符串
  4. 测试用例在before beforeEach之前访问了
    context.tempDir
  5. setupCoreTest()初始返回
    { tempDir: '' }
根因: 顶层变量初始化时访问了空值
修复方案: 将tempDir改为getter,若在before beforeEach之前访问则抛出错误
额外添加多层防御:
  • 第一层:Project.create()验证目录有效性
  • 第二层:WorkspaceManager验证参数非空
  • 第三层:NODE_ENV防护禁止在临时目录外执行git init
  • 第四层:在git init前添加栈追踪日志

Key Principle

核心原则

dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
NEVER fix just where the error appears. Trace back to find the original trigger.
dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
永远不要只修复错误出现的地方。 回溯找到原始触发点,从根源解决。

Stack Trace Tips

栈追踪技巧

In tests: Use
console.error()
not logger - logger may be suppressed Before operation: Log before the dangerous operation, not after it fails Include context: Directory, cwd, environment variables, timestamps Capture stack:
new Error().stack
shows complete call chain
在测试中: 使用
console.error()
而非日志工具——日志工具可能被屏蔽 在操作前: 在危险操作前记录日志,而非失败后 包含上下文: 目录、当前工作目录、环境变量、时间戳 捕获栈信息:
new Error().stack
会显示完整调用链

Real-World Impact

实际效果

From debugging session (2025-10-03):
  • Found root cause through 5-level trace
  • Fixed at source (getter validation)
  • Added 4 layers of defense
  • 1847 tests passed, zero pollution
来自2025-10-03的调试会话:
  • 通过5层追踪找到根因
  • 在根源修复(getter验证)
  • 添加了4层防御
  • 1847个测试用例全部通过,无任何污染