debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统性调试

Overview

概述

Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复不仅浪费时间,还会引入新bug。快速补丁只会掩盖底层问题。
核心原则: 尝试修复前务必找到根本原因。只解决表象问题属于失败的调试。
违反该流程的字面要求,就是违背调试的核心精神。

The Iron Law

铁则

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
If you haven't completed Phase 1, you cannot propose fixes.
未完成根因调查前,禁止提出任何修复方案
如果你还没完成第一阶段,就不能提出修复建议。

When to Use

适用场景

Use for ANY technical issue:
  • Test failures
  • Bugs in production
  • Unexpected behavior
  • Performance problems
  • Build failures
  • Integration issues
Use this ESPECIALLY when:
  • Under time pressure (emergencies make guessing tempting)
  • "Just one quick fix" seems obvious
  • You've already tried multiple fixes
  • Previous fix didn't work
  • You don't fully understand the issue
Don't skip when:
  • Issue seems simple (simple bugs have root causes too)
  • You're in a hurry (rushing guarantees rework)
  • Manager wants it fixed NOW (systematic is faster than thrashing)
适用于所有技术问题:
  • 测试失败
  • 生产环境bug
  • 异常行为
  • 性能问题
  • 构建失败
  • 集成问题
尤其推荐在以下场景使用:
  • 时间紧迫(紧急情况很容易让人忍不住靠猜测解决问题)
  • "就一个快速修复"看起来显而易见
  • 你已经尝试过多种修复方案
  • 之前的修复没有生效
  • 你没有完全理解问题
以下情况也不要跳过流程:
  • 问题看起来很简单(简单的bug也有根本原因)
  • 你赶时间(仓促处理必然会导致返工)
  • 经理要求立刻修复(系统性方法比乱试一通效率更高)

The Four Phases

四个阶段

You MUST complete each phase before proceeding to the next.
你必须按顺序完成每个阶段,才能进入下一阶段。

Phase 1: Root Cause Investigation

阶段1:根因调查

BEFORE attempting ANY fix:
  1. Read Error Messages Carefully
    • Don't skip past errors or warnings
    • They often contain the exact solution
    • Read stack traces completely
    • Note line numbers, file paths, error codes
  2. Reproduce Consistently
    • Can you trigger it reliably?
    • What are the exact steps?
    • Does it happen every time?
    • If not reproducible → gather more data, don't guess
  3. Check Recent Changes
    • What changed that could cause this?
    • Git diff, recent commits
    • New dependencies, config changes
    • Environmental differences
  4. Gather Evidence in Multi-Component Systems
    WHEN system has multiple components (CI → build → signing, API → service → database):
    BEFORE proposing fixes, add diagnostic instrumentation:
    For EACH component boundary:
      - Log what data enters component
      - Log what data exits component
      - Verify environment/config propagation
      - Check state at each layer
    
    Run once to gather evidence showing WHERE it breaks
    THEN analyze evidence to identify failing component
    THEN investigate that specific component
    Example (multi-layer system):
    bash
    # Layer 1: Workflow
    echo "=== Secrets available in workflow: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # Layer 2: Build script
    echo "=== Env vars in build script: ==="
    env | grep IDENTITY || echo "IDENTITY not in environment"
    
    # Layer 3: Signing script
    echo "=== Keychain state: ==="
    security list-keychains
    security find-identity -v
    
    # Layer 4: Actual signing
    codesign --sign "$IDENTITY" --verbose=4 "$APP"
    This reveals: Which layer fails (secrets → workflow ✓, workflow → build ✗)
  5. Trace Data Flow
    WHEN error is deep in call stack:
    Use built-in backward tracing:
    • Where does bad value originate?
    • What called this with bad value?
    • Keep tracing up until you find the source
    • Fix at source, not at symptom
尝试任何修复前:
  1. 仔细阅读错误信息
    • 不要跳过错误或警告
    • 它们往往直接给出了解决方案
    • 完整阅读堆栈跟踪
    • 记录行号、文件路径、错误码
  2. 稳定复现问题
    • 你能可靠触发问题吗?
    • 具体的复现步骤是什么?
    • 每次都能复现吗?
    • 如果无法复现→收集更多数据,不要猜测
  3. 检查近期变更
    • 哪些变更可能导致该问题?
    • Git diff、近期提交记录
    • 新依赖、配置变更
    • 环境差异
  4. 多组件系统中收集证据
    当系统包含多个组件时(CI → 构建 → 签名、API → 服务 → 数据库):
    提出修复方案前,先添加诊断埋点:
    针对每个组件边界:
      - 记录进入组件的数据
      - 记录离开组件的数据
      - 验证环境/配置的传递是否正常
      - 检查每一层的状态
    
    运行一次收集证据,定位故障发生的位置
    然后分析证据确定故障组件
    再针对性调查该组件
    示例(多层系统):
    bash
    # 第一层:工作流
    echo "=== 工作流中可用的密钥: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # 第二层:构建脚本
    echo "=== 构建脚本中的环境变量: ==="
    env | grep IDENTITY || echo "IDENTITY 不在环境变量中"
    
    # 第三层:签名脚本
    echo "=== 钥匙串状态: ==="
    security list-keychains
    security find-identity -v
    
    # 第四层:实际签名
    codesign --sign "$IDENTITY" --verbose=4 "$APP"
    这会展示: 哪一层出现了故障(密钥 → 工作流 ✓, 工作流 → 构建 ✗)
  5. 追踪数据流
    当错误出现在调用栈深处时:
    使用内置的反向追踪:
    • 错误值来自哪里?
    • 是谁传入了错误值?
    • 持续向上追溯直到找到源头
    • 从源头修复,而不是只修复表象

Built-In Deep-Stack Tracing

内置深层堆栈追踪

Use this when the bug appears far away from the real trigger.
Typical signals:
  • Error happens deep in execution, not at the entry point
  • Stack trace is long
  • You can see the failure site but not where the bad value came from
  • The tempting fix is at the symptom point
Tracing process:
  1. Observe the symptom.
  2. Find the immediate cause.
  3. Ask what called it.
  4. Keep tracing up until you find the first bad input, invalid assumption, or wrong trigger.
  5. Fix there, then add defense-in-depth at lower layers if useful.
Mini example:
typescript
await execFileAsync("git", ["init"], { cwd: projectDir });
  • Immediate cause:
    git init
    runs in the wrong directory
  • Next question: who passed
    projectDir
    ?
  • Next question: where did that bad
    projectDir
    come from?
  • Root cause might be much higher up than the failing line
When manual tracing stalls, add instrumentation before the dangerous operation:
typescript
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error("DEBUG git init:", {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync("git", ["init"], { cwd: directory });
}
Tracing rule: never stop at "this line crashed." Keep going until you can say which caller, input, or state transition created the bad value.
当bug出现的位置距离实际触发点很远时使用该方法。
典型信号:
  • 错误发生在执行深处,而不是入口点
  • 堆栈跟踪很长
  • 你能看到故障点,但不知道错误值来自哪里
  • 你很想直接在表象出现的位置修复
追踪流程:
  1. 观察表象。
  2. 找到直接原因。
  3. 询问是谁调用了这段逻辑。
  4. 持续向上追溯,直到找到第一个错误输入、无效假设或错误触发点。
  5. 在源头修复,如果有必要可以在下层添加纵深防御。
小示例:
typescript
await execFileAsync("git", ["init"], { cwd: projectDir });
  • 直接原因:
    git init
    在错误的目录下执行
  • 下一个问题:是谁传入了
    projectDir
  • 下一个问题:这个错误的
    projectDir
    来自哪里?
  • 根本原因可能比故障行的位置高得多
当手动追踪停滞时,在危险操作前添加埋点:
typescript
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error("DEBUG git init:", {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync("git", ["init"], { cwd: directory });
}
追踪规则: 永远不要停在“这行崩溃了”。持续追溯,直到你能明确是哪个调用方、输入或者状态转换导致了错误值的产生。

Phase 2: Pattern Analysis

阶段2:模式分析

Find the pattern before fixing:
  1. Find Working Examples
    • Locate similar working code in same codebase
    • What works that's similar to what's broken?
  2. Compare Against References
    • If implementing pattern, read reference implementation COMPLETELY
    • Don't skim - read every line
    • Understand the pattern fully before applying
  3. Identify Differences
    • What's different between working and broken?
    • List every difference, however small
    • Don't assume "that can't matter"
  4. Understand Dependencies
    • What other components does this need?
    • What settings, config, environment?
    • What assumptions does it make?
修复前先找到规律:
  1. 找到可运行的示例
    • 在同一个代码库中找到类似的可运行代码
    • 哪些和故障逻辑相似的代码是正常运行的?
  2. 和参考实现对比
    • 如果是在实现某个模式,完整阅读参考实现
    • 不要略读——逐行阅读
    • 应用模式前完全理解它
  3. 识别差异
    • 正常运行的代码和故障代码有什么不同?
    • 列出所有差异,无论多小
    • 不要假设“那不可能有影响”
  4. 理解依赖
    • 这段逻辑需要哪些其他组件?
    • 需要哪些设置、配置、环境?
    • 它做出了哪些假设?

Phase 3: Hypothesis and Testing

阶段3:假设与测试

Scientific method:
  1. Form Single Hypothesis
    • State clearly: "I think X is the root cause because Y"
    • Write it down
    • Be specific, not vague
  2. Test Minimally
    • Make the SMALLEST possible change to test hypothesis
    • One variable at a time
    • Don't fix multiple things at once
  3. Verify Before Continuing
    • Did it work? Yes → Phase 4
    • Didn't work? Form NEW hypothesis
    • DON'T add more fixes on top
  4. When You Don't Know
    • Say "I don't understand X"
    • Don't pretend to know
    • Ask for help
    • Research more
科学方法:
  1. 提出单一假设
    • 清晰表述:“我认为X是根本原因,因为Y”
    • 写下来
    • 要具体,不要模糊
  2. 最小化测试
    • 做尽可能小的改动来测试假设
    • 一次只改一个变量
    • 不要同时修复多个问题
  3. 验证后再继续
    • 生效了吗?是的→进入阶段4
    • 没生效?提出新的假设
    • 不要在现有基础上加更多修复
  4. 当你不知道的时候
    • 直接说“我不理解X”
    • 不要假装知道
    • 寻求帮助
    • 做更多调研

Phase 4: Implementation

阶段4:实现修复

Fix the root cause, not the symptom:
  1. Decide on Testing Strategy
    Auto-decide based on complexity:
    • Write test for: Complex algorithms, business logic, data transformations where bugs are likely
    • Skip test for: UI components, React hooks, simple CRUD, straightforward mappings, anything you're 100% certain is correct
    • Test type: Only deterministic unit tests - no integration tests, no complex mocking, no async complexity
    If writing test:
    • Simplest possible reproduction
    • Automated test that fails before fix
    • Verify logic, not implementation details
    If skipping test:
    • Verify fix with typecheck/lint
    • Manual verification for UI changes
    • Code review confidence that fix is correct
  2. Implement Single Fix
    • Address the root cause identified
    • ONE change at a time
    • No "while I'm here" improvements
    • No bundled refactoring
  3. Verify Fix
    If test was written:
    • Test passes now?
    • No other tests broken?
    If no test:
    • Typecheck passes?
    • Lint clean?
    • Manual verification confirms fix?
    Always check:
    • Issue actually resolved?
    • No regressions in related functionality?
  4. If Fix Doesn't Work
    • STOP
    • Count: How many fixes have you tried?
    • If < 3: Return to Phase 1, re-analyze with new information
    • If ≥ 3: STOP and question the architecture (step 5 below)
    • DON'T attempt Fix #4 without architectural discussion
  5. If 3+ Fixes Failed: Question Architecture
    Pattern indicating architectural problem:
    • Each fix reveals new shared state/coupling/problem in different place
    • Fixes require "massive refactoring" to implement
    • Each fix creates new symptoms elsewhere
    STOP and question fundamentals:
    • Is this pattern fundamentally sound?
    • Are we "sticking with it through sheer inertia"?
    • Should we refactor architecture vs. continue fixing symptoms?
    Discuss with your human partner before attempting more fixes
    This is NOT a failed hypothesis - this is a wrong architecture.
修复根本原因,而不是表象:
  1. 确定测试策略
    根据复杂度自动判断:
    • 需要写测试: 复杂算法、业务逻辑、容易出bug的数据转换逻辑
    • 不需要写测试: UI组件、React hooks、简单CRUD、直接的映射逻辑、你100%确定正确的内容
    • 测试类型: 只写确定性的单元测试——不要集成测试、不要复杂mock、不要异步复杂度
    如果写测试:
    • 尽可能最简单的复现用例
    • 自动化测试在修复前会失败
    • 验证逻辑,而不是实现细节
    如果跳过测试:
    • 通过类型检查/ lint验证修复
    • UI变更做手动验证
    • 代码评审确认修复正确
  2. 实现单一修复
    • 解决你找到的根本原因
    • 一次只做一个改动
    • 不要顺便做其他优化
    • 不要打包重构
  3. 验证修复
    如果写了测试:
    • 测试现在通过了吗?
    • 没有其他测试被破坏?
    如果没写测试:
    • 类型检查通过?
    • Lint无报错?
    • 手动验证确认修复生效?
    始终检查:
    • 问题真的解决了?
    • 相关功能没有回归?
  4. 如果修复不生效
    • 停止
    • 计数:你已经尝试了多少次修复?
    • 如果<3:回到阶段1,用新信息重新分析
    • 如果≥3:停止,质疑架构(见下方步骤5)
    • 没有架构讨论的情况下不要尝试第4次修复
  5. 如果3次以上修复都失败:质疑架构
    表明存在架构问题的模式:
    • 每次修复都会在其他地方暴露新的共享状态/耦合/问题
    • 修复需要“大规模重构”才能实现
    • 每次修复都会在其他地方产生新的问题
    停止,质疑基础设计:
    • 这个模式本身是合理的吗?
    • 我们是不是“纯粹靠惯性在坚持”?
    • 我们应该重构架构,还是继续修复表象?
    尝试更多修复前和你的同事讨论
    这不是假设失败——这是架构错误。

Red Flags - STOP and Follow Process

红色预警——停止并遵守流程

If you catch yourself thinking:
  • "Quick fix for now, investigate later"
  • "Just try changing X and see if it works"
  • "Add multiple changes, run tests"
  • "It's probably X, let me fix that"
  • "I don't fully understand but this might work"
  • "Pattern says X but I'll adapt it differently"
  • "Here are the main problems: [lists fixes without investigation]"
  • Proposing solutions before tracing data flow
  • "One more fix attempt" (when already tried 2+)
  • Each fix reveals new problem in different place
  • Writing tests for UI components when you're certain the fix is correct
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
如果你发现自己有这些想法:
  • "先快速修复,之后再调查"
  • "试试改X看看能不能行"
  • "改多个地方,跑下测试"
  • "可能是X的问题,我先修了"
  • "我不是完全理解,但这个可能有用"
  • "模式要求是X,但我要改一下"
  • "主要问题是:[没有调查就列出一堆修复方案]"
  • 还没追踪数据流就提出解决方案
  • “再试一次修复”(已经试过2次以上)
  • 每次修复都在不同地方暴露新问题
  • 当你确定修复正确的时候还要给UI组件写测试
以上所有情况都意味着:停止。回到阶段1。
如果3次以上修复失败: 质疑架构(见阶段4.5)

your human partner's Signals You're Doing It Wrong

同事提示你做得不对的信号

Watch for these redirections:
  • "Is that not happening?" - You assumed without verifying
  • "Will it show us...?" - You should have added evidence gathering
  • "Stop guessing" - You're proposing fixes without understanding
  • "Ultrathink this" - Question fundamentals, not just symptoms
  • "We're stuck?" (frustrated) - Your approach isn't working
When you see these: STOP. Return to Phase 1.
留意这些提醒:
  • "不是这样吗?"——你没有验证就做了假设
  • "这个能展示...吗?"——你应该先收集证据
  • "别猜了"——你没有理解问题就提出修复
  • "深入想一下"——质疑基础设计,而不是只看表象
  • "我们卡住了?"(沮丧语气)——你的方法不管用
当你看到这些信号:停止。回到阶段1。

Common Rationalizations

常见借口

ExcuseReality
"Issue is simple, don't need process"Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"First fix sets the pattern. Do it right from the start.
"Multiple fixes at once saves time"Can't isolate what worked. Causes new bugs.
"Reference too long, I'll adapt the pattern"Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)3+ failures = architectural problem. Question pattern, don't fix again.
"UI fix doesn't need tests"Correct! UI components verified via typecheck/manual testing, not unit tests.
借口现实
"问题很简单,不需要走流程"简单问题也有根本原因。处理简单bug时流程效率更高。
"紧急情况,没时间走流程"系统性调试比猜来试去的乱搞更快。
"先试试这个,之后再调查"第一次修复会定下基调。从一开始就做对。
"一次改多个地方省时间"没法区分是什么生效了。会引入新bug。
"参考实现太长了,我改一下模式用"一知半解必然会出bug。完整阅读参考实现。
"我知道问题在哪,我来修"看到表象≠理解根本原因。
"再试一次修复"(失败2次以上后)3次以上失败=架构问题。质疑模式,不要再修复了。
"UI修复不需要测试"没错!UI组件通过类型检查/手动测试验证,不需要单元测试。

Quick Reference

快速参考

PhaseKey ActivitiesSuccess Criteria
1. Root CauseRead errors, reproduce, check changes, gather evidenceUnderstand WHAT and WHY
2. PatternFind working examples, compareIdentify differences
3. HypothesisForm theory, test minimallyConfirmed or new hypothesis
4. ImplementationCreate test, fix, verifyBug resolved, tests pass
阶段核心活动成功标准
1. 根因调查阅读错误、复现问题、检查变更、收集证据明确是什么问题、为什么会出现问题
2. 模式分析找到可运行示例、对比差异识别出差异点
3. 假设测试提出理论、最小化测试假设被确认,或者提出新假设
4. 实现修复编写测试、修复、验证bug解决,测试通过

When Process Reveals "No Root Cause"

当流程显示“没有根本原因”时

If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
  1. You've completed the process
  2. Document what you investigated
  3. Implement appropriate handling (retry, timeout, error message)
  4. Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
如果系统性调查显示问题确实是环境相关、依赖时序或者外部因素导致的:
  1. 你已经完成了流程
  2. 记录你做过的调查
  3. 实现合适的处理逻辑(重试、超时、错误提示)
  4. 添加监控/日志方便未来调查
但是: 95%的“无根本原因”情况都是调查不完整导致的。

Integration with Other Skills

和其他技能的集成

Testing skills (when needed):
  • tdd (if available) - Use when fixing complex business logic that needs test coverage
  • Skip for UI components, simple CRUD, or anything verifiable via typecheck/manual testing
测试技能(需要时):
  • tdd(如果可用)——修复需要测试覆盖的复杂业务逻辑时使用
  • UI组件、简单CRUD、或者任何可以通过类型检查/手动测试验证的内容跳过即可

Real-World Impact

实际影响

From debugging sessions:
  • Systematic approach: 15-30 minutes to fix
  • Random fixes approach: 2-3 hours of thrashing
  • First-time fix rate: 95% vs 40%
  • New bugs introduced: Near zero vs common
来自实际调试场景的数据:
  • 系统性方法:15-30分钟修复问题
  • 随机修复方法:2-3小时乱搞
  • 一次修复成功率:95% vs 40%
  • 引入新bug:几乎为0 vs 非常普遍