debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统性调试
Overview
概述
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复不仅浪费时间,还会引入新bug。快速补丁只会掩盖底层问题。
核心原则: 尝试修复前务必找到根本原因。只解决表象问题属于失败的调试。
违反该流程的字面要求,就是违背调试的核心精神。
The Iron Law
铁则
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未完成根因调查前,禁止提出任何修复方案如果你还没完成第一阶段,就不能提出修复建议。
When to Use
适用场景
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
适用于所有技术问题:
- 测试失败
- 生产环境bug
- 异常行为
- 性能问题
- 构建失败
- 集成问题
尤其推荐在以下场景使用:
- 时间紧迫(紧急情况很容易让人忍不住靠猜测解决问题)
- "就一个快速修复"看起来显而易见
- 你已经尝试过多种修复方案
- 之前的修复没有生效
- 你没有完全理解问题
以下情况也不要跳过流程:
- 问题看起来很简单(简单的bug也有根本原因)
- 你赶时间(仓促处理必然会导致返工)
- 经理要求立刻修复(系统性方法比乱试一通效率更高)
The Four Phases
四个阶段
You MUST complete each phase before proceeding to the next.
你必须按顺序完成每个阶段,才能进入下一阶段。
Phase 1: Root Cause Investigation
阶段1:根因调查
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
-
Check Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component SystemsWHEN system has multiple components (CI → build → signing, API → service → database):BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentExample (multi-layer system):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer fails (secrets → workflow ✓, workflow → build ✗) -
Trace Data FlowWHEN error is deep in call stack:Use built-in backward tracing:
- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
尝试任何修复前:
-
仔细阅读错误信息
- 不要跳过错误或警告
- 它们往往直接给出了解决方案
- 完整阅读堆栈跟踪
- 记录行号、文件路径、错误码
-
稳定复现问题
- 你能可靠触发问题吗?
- 具体的复现步骤是什么?
- 每次都能复现吗?
- 如果无法复现→收集更多数据,不要猜测
-
检查近期变更
- 哪些变更可能导致该问题?
- Git diff、近期提交记录
- 新依赖、配置变更
- 环境差异
-
多组件系统中收集证据当系统包含多个组件时(CI → 构建 → 签名、API → 服务 → 数据库):提出修复方案前,先添加诊断埋点:
针对每个组件边界: - 记录进入组件的数据 - 记录离开组件的数据 - 验证环境/配置的传递是否正常 - 检查每一层的状态 运行一次收集证据,定位故障发生的位置 然后分析证据确定故障组件 再针对性调查该组件示例(多层系统):bash# 第一层:工作流 echo "=== 工作流中可用的密钥: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # 第二层:构建脚本 echo "=== 构建脚本中的环境变量: ===" env | grep IDENTITY || echo "IDENTITY 不在环境变量中" # 第三层:签名脚本 echo "=== 钥匙串状态: ===" security list-keychains security find-identity -v # 第四层:实际签名 codesign --sign "$IDENTITY" --verbose=4 "$APP"这会展示: 哪一层出现了故障(密钥 → 工作流 ✓, 工作流 → 构建 ✗) -
追踪数据流当错误出现在调用栈深处时:使用内置的反向追踪:
- 错误值来自哪里?
- 是谁传入了错误值?
- 持续向上追溯直到找到源头
- 从源头修复,而不是只修复表象
Built-In Deep-Stack Tracing
内置深层堆栈追踪
Use this when the bug appears far away from the real trigger.
Typical signals:
- Error happens deep in execution, not at the entry point
- Stack trace is long
- You can see the failure site but not where the bad value came from
- The tempting fix is at the symptom point
Tracing process:
- Observe the symptom.
- Find the immediate cause.
- Ask what called it.
- Keep tracing up until you find the first bad input, invalid assumption, or wrong trigger.
- Fix there, then add defense-in-depth at lower layers if useful.
Mini example:
typescript
await execFileAsync("git", ["init"], { cwd: projectDir });- Immediate cause: runs in the wrong directory
git init - Next question: who passed ?
projectDir - Next question: where did that bad come from?
projectDir - Root cause might be much higher up than the failing line
When manual tracing stalls, add instrumentation before the dangerous operation:
typescript
async function gitInit(directory: string) {
const stack = new Error().stack;
console.error("DEBUG git init:", {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
});
await execFileAsync("git", ["init"], { cwd: directory });
}Tracing rule: never stop at "this line crashed." Keep going until you can say which caller, input, or state transition created the bad value.
当bug出现的位置距离实际触发点很远时使用该方法。
典型信号:
- 错误发生在执行深处,而不是入口点
- 堆栈跟踪很长
- 你能看到故障点,但不知道错误值来自哪里
- 你很想直接在表象出现的位置修复
追踪流程:
- 观察表象。
- 找到直接原因。
- 询问是谁调用了这段逻辑。
- 持续向上追溯,直到找到第一个错误输入、无效假设或错误触发点。
- 在源头修复,如果有必要可以在下层添加纵深防御。
小示例:
typescript
await execFileAsync("git", ["init"], { cwd: projectDir });- 直接原因:在错误的目录下执行
git init - 下一个问题:是谁传入了 ?
projectDir - 下一个问题:这个错误的 来自哪里?
projectDir - 根本原因可能比故障行的位置高得多
当手动追踪停滞时,在危险操作前添加埋点:
typescript
async function gitInit(directory: string) {
const stack = new Error().stack;
console.error("DEBUG git init:", {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
});
await execFileAsync("git", ["init"], { cwd: directory });
}追踪规则: 永远不要停在“这行崩溃了”。持续追溯,直到你能明确是哪个调用方、输入或者状态转换导致了错误值的产生。
Phase 2: Pattern Analysis
阶段2:模式分析
Find the pattern before fixing:
-
Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
-
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
修复前先找到规律:
-
找到可运行的示例
- 在同一个代码库中找到类似的可运行代码
- 哪些和故障逻辑相似的代码是正常运行的?
-
和参考实现对比
- 如果是在实现某个模式,完整阅读参考实现
- 不要略读——逐行阅读
- 应用模式前完全理解它
-
识别差异
- 正常运行的代码和故障代码有什么不同?
- 列出所有差异,无论多小
- 不要假设“那不可能有影响”
-
理解依赖
- 这段逻辑需要哪些其他组件?
- 需要哪些设置、配置、环境?
- 它做出了哪些假设?
Phase 3: Hypothesis and Testing
阶段3:假设与测试
Scientific method:
-
Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
-
Verify Before Continuing
- Did it work? Yes → Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
-
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
科学方法:
-
提出单一假设
- 清晰表述:“我认为X是根本原因,因为Y”
- 写下来
- 要具体,不要模糊
-
最小化测试
- 做尽可能小的改动来测试假设
- 一次只改一个变量
- 不要同时修复多个问题
-
验证后再继续
- 生效了吗?是的→进入阶段4
- 没生效?提出新的假设
- 不要在现有基础上加更多修复
-
当你不知道的时候
- 直接说“我不理解X”
- 不要假装知道
- 寻求帮助
- 做更多调研
Phase 4: Implementation
阶段4:实现修复
Fix the root cause, not the symptom:
-
Decide on Testing StrategyAuto-decide based on complexity:
- Write test for: Complex algorithms, business logic, data transformations where bugs are likely
- Skip test for: UI components, React hooks, simple CRUD, straightforward mappings, anything you're 100% certain is correct
- Test type: Only deterministic unit tests - no integration tests, no complex mocking, no async complexity
If writing test:- Simplest possible reproduction
- Automated test that fails before fix
- Verify logic, not implementation details
If skipping test:- Verify fix with typecheck/lint
- Manual verification for UI changes
- Code review confidence that fix is correct
-
Implement Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify FixIf test was written:
- Test passes now?
- No other tests broken?
If no test:- Typecheck passes?
- Lint clean?
- Manual verification confirms fix?
Always check:- Issue actually resolved?
- No regressions in related functionality?
-
If Fix Doesn't Work
- STOP
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If ≥ 3: STOP and question the architecture (step 5 below)
- DON'T attempt Fix #4 without architectural discussion
-
If 3+ Fixes Failed: Question ArchitecturePattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs. continue fixing symptoms?
Discuss with your human partner before attempting more fixesThis is NOT a failed hypothesis - this is a wrong architecture.
修复根本原因,而不是表象:
-
确定测试策略根据复杂度自动判断:
- 需要写测试: 复杂算法、业务逻辑、容易出bug的数据转换逻辑
- 不需要写测试: UI组件、React hooks、简单CRUD、直接的映射逻辑、你100%确定正确的内容
- 测试类型: 只写确定性的单元测试——不要集成测试、不要复杂mock、不要异步复杂度
如果写测试:- 尽可能最简单的复现用例
- 自动化测试在修复前会失败
- 验证逻辑,而不是实现细节
如果跳过测试:- 通过类型检查/ lint验证修复
- UI变更做手动验证
- 代码评审确认修复正确
-
实现单一修复
- 解决你找到的根本原因
- 一次只做一个改动
- 不要顺便做其他优化
- 不要打包重构
-
验证修复如果写了测试:
- 测试现在通过了吗?
- 没有其他测试被破坏?
如果没写测试:- 类型检查通过?
- Lint无报错?
- 手动验证确认修复生效?
始终检查:- 问题真的解决了?
- 相关功能没有回归?
-
如果修复不生效
- 停止
- 计数:你已经尝试了多少次修复?
- 如果<3:回到阶段1,用新信息重新分析
- 如果≥3:停止,质疑架构(见下方步骤5)
- 没有架构讨论的情况下不要尝试第4次修复
-
如果3次以上修复都失败:质疑架构表明存在架构问题的模式:
- 每次修复都会在其他地方暴露新的共享状态/耦合/问题
- 修复需要“大规模重构”才能实现
- 每次修复都会在其他地方产生新的问题
停止,质疑基础设计:- 这个模式本身是合理的吗?
- 我们是不是“纯粹靠惯性在坚持”?
- 我们应该重构架构,还是继续修复表象?
尝试更多修复前和你的同事讨论这不是假设失败——这是架构错误。
Red Flags - STOP and Follow Process
红色预警——停止并遵守流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
- Writing tests for UI components when you're certain the fix is correct
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
如果你发现自己有这些想法:
- "先快速修复,之后再调查"
- "试试改X看看能不能行"
- "改多个地方,跑下测试"
- "可能是X的问题,我先修了"
- "我不是完全理解,但这个可能有用"
- "模式要求是X,但我要改一下"
- "主要问题是:[没有调查就列出一堆修复方案]"
- 还没追踪数据流就提出解决方案
- “再试一次修复”(已经试过2次以上)
- 每次修复都在不同地方暴露新问题
- 当你确定修复正确的时候还要给UI组件写测试
以上所有情况都意味着:停止。回到阶段1。
如果3次以上修复失败: 质疑架构(见阶段4.5)
your human partner's Signals You're Doing It Wrong
同事提示你做得不对的信号
Watch for these redirections:
- "Is that not happening?" - You assumed without verifying
- "Will it show us...?" - You should have added evidence gathering
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Your approach isn't working
When you see these: STOP. Return to Phase 1.
留意这些提醒:
- "不是这样吗?"——你没有验证就做了假设
- "这个能展示...吗?"——你应该先收集证据
- "别猜了"——你没有理解问题就提出修复
- "深入想一下"——质疑基础设计,而不是只看表象
- "我们卡住了?"(沮丧语气)——你的方法不管用
当你看到这些信号:停止。回到阶段1。
Common Rationalizations
常见借口
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
| "UI fix doesn't need tests" | Correct! UI components verified via typecheck/manual testing, not unit tests. |
| 借口 | 现实 |
|---|---|
| "问题很简单,不需要走流程" | 简单问题也有根本原因。处理简单bug时流程效率更高。 |
| "紧急情况,没时间走流程" | 系统性调试比猜来试去的乱搞更快。 |
| "先试试这个,之后再调查" | 第一次修复会定下基调。从一开始就做对。 |
| "一次改多个地方省时间" | 没法区分是什么生效了。会引入新bug。 |
| "参考实现太长了,我改一下模式用" | 一知半解必然会出bug。完整阅读参考实现。 |
| "我知道问题在哪,我来修" | 看到表象≠理解根本原因。 |
| "再试一次修复"(失败2次以上后) | 3次以上失败=架构问题。质疑模式,不要再修复了。 |
| "UI修复不需要测试" | 没错!UI组件通过类型检查/手动测试验证,不需要单元测试。 |
Quick Reference
快速参考
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
| 阶段 | 核心活动 | 成功标准 |
|---|---|---|
| 1. 根因调查 | 阅读错误、复现问题、检查变更、收集证据 | 明确是什么问题、为什么会出现问题 |
| 2. 模式分析 | 找到可运行示例、对比差异 | 识别出差异点 |
| 3. 假设测试 | 提出理论、最小化测试 | 假设被确认,或者提出新假设 |
| 4. 实现修复 | 编写测试、修复、验证 | bug解决,测试通过 |
When Process Reveals "No Root Cause"
当流程显示“没有根本原因”时
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
如果系统性调查显示问题确实是环境相关、依赖时序或者外部因素导致的:
- 你已经完成了流程
- 记录你做过的调查
- 实现合适的处理逻辑(重试、超时、错误提示)
- 添加监控/日志方便未来调查
但是: 95%的“无根本原因”情况都是调查不完整导致的。
Integration with Other Skills
和其他技能的集成
Testing skills (when needed):
- tdd (if available) - Use when fixing complex business logic that needs test coverage
- Skip for UI components, simple CRUD, or anything verifiable via typecheck/manual testing
测试技能(需要时):
- tdd(如果可用)——修复需要测试覆盖的复杂业务逻辑时使用
- UI组件、简单CRUD、或者任何可以通过类型检查/手动测试验证的内容跳过即可
Real-World Impact
实际影响
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
来自实际调试场景的数据:
- 系统性方法:15-30分钟修复问题
- 随机修复方法:2-3小时乱搞
- 一次修复成功率:95% vs 40%
- 引入新bug:几乎为0 vs 非常普遍