systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

系统化调试

Systematic Debugging

概述

Overview

随意修复既浪费时间又会引入新 bug。草率的补丁只会掩盖深层问题。
核心原则: 在尝试修复之前,务必先找到根本原因。只修症状就是失败。
敷衍走流程等于违背调试的精神。
Random fixes are both time-consuming and introduce new bugs. Hasty patches only mask underlying issues.
Core Principle: Always find the root cause before attempting a fix. Fixing only the symptom is a failure.
Going through the motions defeats the purpose of debugging.

铁律

Non-Negotiable Rule

不做根因调查,不许提修复方案
如果你还没完成第一阶段,就不能提出修复方案。
No root cause investigation, no fix proposal
You cannot propose a fix until you have completed the first phase.

何时使用

When to Use

用于任何技术问题:
  • 测试失败
  • 生产环境 bug
  • 异常行为
  • 性能问题
  • 构建失败
  • 集成问题
尤其在以下情况必须使用:
  • 时间紧迫(紧急情况最容易让人猜测式修复)
  • 觉得"一个小修改"就能搞定
  • 已经尝试了多种修复
  • 上一次修复没有生效
  • 你没有完全理解问题
以下情况也不要跳过:
  • 问题看起来很简单(简单的 bug 也有根本原因)
  • 你很赶时间(越急越容易返工)
  • 领导要求立刻修好(系统化调试比反复尝试更快)
Use for any technical issue:
  • Test failures
  • Production bugs
  • Abnormal behavior
  • Performance issues
  • Build failures
  • Integration issues
Mandatory use in the following situations:
  • Time constraints (emergencies are when people are most likely to guess fixes)
  • Thinking "a small change will fix it"
  • Have tried multiple fixes already
  • The last fix didn't work
  • You don't fully understand the problem
Do NOT skip in these cases either:
  • The problem seems simple (even simple bugs have root causes)
  • You're in a hurry (rushing leads to rework)
  • Leadership demands an immediate fix (systematic debugging is faster than trial and error)

四个阶段

Four Phases

你必须完成每个阶段后才能进入下一个。
You must complete each phase before moving to the next.

第一阶段:根因调查

Phase 1: Root Cause Investigation

在尝试任何修复之前:
  1. 仔细阅读错误信息
    • 不要跳过错误或警告
    • 它们往往直接包含解决方案
    • 完整阅读堆栈跟踪
    • 记下行号、文件路径、错误码
  2. 稳定复现
    • 你能可靠地触发它吗?
    • 具体的复现步骤是什么?
    • 每次都能复现吗?
    • 如果无法复现 → 收集更多数据,不要猜测
  3. 检查近期变更
    • 什么变更可能导致了这个问题?
    • git diff、最近的提交
    • 新依赖、配置变更
    • 环境差异
  4. 在多组件系统中收集证据
    当系统有多个组件时(CI → 构建 → 签名,API → 服务 → 数据库):
    在提出修复方案之前,先添加诊断埋点:
    对每个组件边界:
      - 记录进入组件的数据
      - 记录离开组件的数据
      - 验证环境/配置的传递
      - 检查每一层的状态
    
    执行一次以收集证据,确定断裂点在哪里
    然后分析证据,定位故障组件
    然后针对该组件深入调查
    示例(多层系统):
    bash
    # 第 1 层:工作流
    echo "=== Secrets available in workflow: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # 第 2 层:构建脚本
    echo "=== Env vars in build script: ==="
    env | grep IDENTITY || echo "IDENTITY not in environment"
    
    # 第 3 层:签名脚本
    echo "=== Keychain state: ==="
    security list-keychains
    security find-identity -v
    
    # 第 4 层:实际签名
    codesign --sign "$IDENTITY" --verbose=4 "$APP"
    由此可以看出: 哪一层出了问题(secrets → workflow ✓, workflow → build ✗)
  5. 跟踪数据流
    当错误发生在调用栈深处时:
    参见本目录下的
    root-cause-tracing.md
    ,了解完整的反向追踪技术。
    简要版本:
    • 错误值从哪里产生的?
    • 谁用错误值调用了这里?
    • 持续向上追踪直到找到源头
    • 在源头修复,而不是在症状处修复
Before attempting any fix:
  1. Read error messages carefully
    • Don't skip errors or warnings
    • They often contain the solution directly
    • Read the full stack trace
    • Note line numbers, file paths, error codes
  2. Stable Reproduction
    • Can you reliably trigger it?
    • What are the exact reproduction steps?
    • Does it reproduce every time?
    • If you can't reproduce → collect more data, don't guess
  3. Check Recent Changes
    • What changes could have caused this issue?
    • git diff, recent commits
    • New dependencies, configuration changes
    • Environment differences
  4. Collect Evidence in Multi-Component Systems
    When the system has multiple components (CI → Build → Signing, API → Service → Database):
    Before proposing a fix, add diagnostic instrumentation:
    For each component boundary:
      - Record data entering the component
      - Record data leaving the component
      - Verify environment/configuration propagation
      - Check state at each layer
    
    Execute once to collect evidence, identify where the break occurs
    Then analyze evidence to locate the faulty component
    Then investigate that component in depth
    Example (multi-layer system):
    bash
    # Layer 1: Workflow
    echo "=== Secrets available in workflow: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # Layer 2: Build script
    echo "=== Env vars in build script: ==="
    env | grep IDENTITY || echo "IDENTITY not in environment"
    
    # Layer 3: Signing script
    echo "=== Keychain state: ==="
    security list-keychains
    security find-identity -v
    
    # Layer 4: Actual signing
    codesign --sign "$IDENTITY" --verbose=4 "$APP"
    This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗)
  5. Trace Data Flow
    When errors occur deep in the call stack:
    See
    root-cause-tracing.md
    in this directory for full reverse-tracing techniques.
    Short version:
    • Where did the incorrect value originate?
    • Who called this with the incorrect value?
    • Keep tracing upward until you find the source
    • Fix at the source, not at the symptom

第二阶段:模式分析

Phase 2: Pattern Analysis

先找到模式,再修复:
  1. 找到可正常工作的示例
    • 在同一代码库中找到类似的正常代码
    • 有什么正常的代码与出问题的代码相似?
  2. 与参考实现对比
    • 如果是实现某个模式,完整阅读参考实现
    • 不要略读——逐行阅读
    • 在应用之前彻底理解该模式
  3. 识别差异
    • 正常代码和出问题的代码之间有什么不同?
    • 列出每一个差异,无论多小
    • 不要假设"那不可能有影响"
  4. 理解依赖关系
    • 这个功能需要哪些其他组件?
    • 需要哪些设置、配置、环境?
    • 它有哪些隐含假设?
Find patterns before fixing:
  1. Find working examples
    • Look for similar working code in the same codebase
    • What working code resembles the problematic code?
  2. Compare with reference implementations
    • If implementing a pattern, read the reference implementation fully
    • Don't skim — read line by line
    • Fully understand the pattern before applying it
  3. Identify differences
    • What's different between the working code and the problematic code?
    • List every difference, no matter how small
    • Don't assume "that can't affect anything"
  4. Understand dependencies
    • What other components does this feature require?
    • What settings, configurations, environments are needed?
    • What implicit assumptions does it have?

第三阶段:假设与验证

Phase 3: Hypothesis and Verification

科学方法:
  1. 提出单一假设
    • 清晰地陈述:"我认为 X 是根本原因,因为 Y"
    • 写下来
    • 要具体,不要含糊
  2. 最小化测试
    • 做出最小的改动来验证假设
    • 每次只改一个变量
    • 不要同时修复多个问题
  3. 继续之前先验证
    • 生效了?是 → 进入第四阶段
    • 没生效?提出新假设
    • 不要在上面叠加更多修复
  4. 当你不确定时
    • 说"我不理解 X"
    • 不要假装自己知道
    • 寻求帮助
    • 做更多调研
Scientific method:
  1. Form a single hypothesis
    • State clearly: "I believe X is the root cause because Y"
    • Write it down
    • Be specific, not vague
  2. Minimal testing
    • Make the smallest change to verify the hypothesis
    • Change only one variable at a time
    • Don't fix multiple issues at once
  3. Verify before proceeding
    • Did it work? Yes → move to Phase 4
    • Didn't work? Form a new hypothesis
    • Don't stack more fixes on top
  4. When you're unsure
    • Say "I don't understand X"
    • Don't pretend you know
    • Ask for help
    • Do more research

第四阶段:实施

Phase 4: Implementation

修复根本原因,而非症状:
  1. 创建失败的测试用例
    • 最简化的复现
    • 尽可能用自动化测试
    • 没有测试框架就写一次性测试脚本
    • 修复前必须先有测试
    • 使用
      superpowers:test-driven-development
      技能来编写规范的失败测试
  2. 实施单一修复
    • 修复已定位的根本原因
    • 每次只改一处
    • 不做"顺便改改"的优化
    • 不捆绑重构
  3. 验证修复
    • 测试现在通过了吗?
    • 其他测试没有被破坏吧?
    • 问题真的解决了吗?
  4. 如果修复不起作用
    • 停下来
    • 数一数:你已经尝试了几次修复?
    • 少于 3 次:回到第一阶段,用新信息重新分析
    • 3 次或以上:停下来质疑架构(见下方第 5 步)
    • 没有经过架构讨论,不要尝试第 4 次修复
  5. 如果 3 次以上修复都失败了:质疑架构
    以下模式表明存在架构问题:
    • 每次修复都暴露出新的共享状态/耦合/其他位置的问题
    • 修复需要"大规模重构"才能实现
    • 每次修复都在其他地方产生新的症状
    停下来质疑根本性问题:
    • 这个模式从根本上合理吗?
    • 我们是不是在"惯性驱动"下坚持了错误方案?
    • 应该重构架构还是继续修补症状?
    在尝试更多修复之前,和你的搭档讨论
    这不是假设失败——这是架构有误。
Fix the root cause, not the symptom:
  1. Create a failing test case
    • Minimal reproduction
    • Use automated tests whenever possible
    • Write a one-time test script if no test framework is available
    • Must have a test before fixing
    • Use the
      superpowers:test-driven-development
      skill to write a proper failing test
  2. Implement a single fix
    • Fix the identified root cause
    • Change only one thing at a time
    • Don't make "while I'm at it" optimizations
    • Don't bundle refactoring
  3. Verify the fix
    • Does the test pass now?
    • Are other tests not broken?
    • Is the problem truly resolved?
  4. If the fix doesn't work
    • Stop
    • Count: How many fixes have you tried?
    • Fewer than 3: Go back to Phase 1, re-analyze with new information
    • 3 or more: Stop and question the architecture (see Step 5 below)
    • Don't attempt a 4th fix without architectural discussion
  5. If 3+ fixes fail: Question the architecture
    These patterns indicate architectural issues:
    • Each fix reveals new shared state/coupling/issues elsewhere
    • Fix requires "massive refactoring" to implement
    • Each fix creates new symptoms elsewhere
    Stop and question fundamental issues:
    • Is this pattern fundamentally sound?
    • Are we sticking to a bad solution out of inertia?
    • Should we refactor the architecture or keep patching symptoms?
    Discuss with your partner before attempting more fixes
    This isn't hypothesis failure — this is architectural failure.

红线——停下来,按流程走

Red Lines — Stop and Follow the Process

如果你发现自己在想:
  • "先临时修一下,以后再排查"
  • "试着改改 X 看看行不行"
  • "一次性改多个地方,跑测试看看"
  • "跳过测试,我手动验证"
  • "大概是 X 的问题,让我修一下"
  • "我不完全理解,但这应该能行"
  • "模式说的是 X,但我换个方式用"
  • "主要问题有这些:[未经调查就列出修复方案]"
  • 没有追踪数据流就提出解决方案
  • "再试一次修复"(已经尝试了 2 次以上)
  • 每次修复都暴露出不同地方的新问题
以上这些都意味着:停下来。回到第一阶段。
如果 3 次以上修复都失败了: 质疑架构(见第四阶段第 5 步)
If you catch yourself thinking:
  • "I'll fix it temporarily and investigate later"
  • "Let me try changing X to see if it works"
  • "Change multiple things at once and run tests"
  • "Skip tests, I'll verify manually"
  • "It's probably X, let me fix it"
  • "I don't fully understand, but this should work"
  • "The pattern says X, but I'll use it differently"
  • "The main issues are: [listing fixes without investigation]"
  • Proposing solutions without tracing data flow
  • "Just try one more fix" (after 2+ failed attempts)
  • Each fix reveals new issues in different places
All of these mean: Stop. Go back to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)

搭档发出的信号——说明你的方法不对

Partner Signals — Your Approach Is Wrong

留意这些提醒:
  • "难道不是这样吗?"——你在没有验证的情况下做了假设
  • "它能告诉我们……吗?"——你应该先收集证据
  • "别猜了"——你在没有理解的情况下提出修复
  • "深入想想"——要质疑根本性问题,而不只是症状
  • "我们卡住了?"(沮丧的语气)——你的方法没有奏效
当你看到这些信号时: 停下来。回到第一阶段。
Watch for these reminders:
  • "Isn't it...?" — You're making assumptions without verification
  • "Can it tell us...?" — You should collect evidence first
  • "Don't guess" — You're proposing fixes without understanding
  • "Think deeper" — You need to question fundamental issues, not just symptoms
  • "Are we stuck?" (frustrated tone) — Your approach isn't working
When you see these signals: Stop. Go back to Phase 1.

常见借口

Common Excuses

借口现实
"问题很简单,不需要走流程"简单问题也有根本原因。对于简单 bug,流程很快就能走完。
"紧急情况,没时间走流程"系统化调试比反复猜测式修复更快。
"先试一下,再排查"第一次修复就定下了基调。从一开始就做对。
"确认修复有效后再写测试"没有测试的修复留不住。先写测试才能证明修复有效。
"一次修多个问题省时间"无法隔离哪个生效了。还会引入新 bug。
"参考实现太长了,我自己改改"一知半解必然出 bug。完整阅读。
"我看出问题了,让我修一下"看到症状 ≠ 理解根因。
"再试一次"(在 2 次以上失败后)3 次以上失败 = 架构问题。质疑模式,不要继续修。
ExcuseReality
"The problem is simple, no need to follow the process"Even simple problems have root causes. For simple bugs, the process is quick to complete.
"It's an emergency, no time for the process"Systematic debugging is faster than trial and error.
"Try it first, then investigate"The first fix sets the tone. Do it right from the start.
"Write tests after confirming the fix works"Fixes without tests don't last. Writing tests first proves the fix works.
"Fixing multiple issues at once saves time"You can't isolate what worked. It also introduces new bugs.
"The reference implementation is too long, I'll modify it myself"Partial understanding inevitably leads to bugs. Read it fully.
"I see the problem, let me fix it"Seeing the symptom ≠ understanding the root cause.
"Just try one more time" (after 2+ failures)3+ failures = architectural issue. Question the pattern, don't keep fixing.

速查表

Cheat Sheet

阶段关键活动通过标准
1. 根因阅读错误、复现、检查变更、收集证据理解了什么出了问题以及为什么
2. 模式找到正常示例、对比识别出差异
3. 假设提出理论、最小化验证假设被验证或产生新假设
4. 实施创建测试、修复、验证bug 已修复,测试通过
PhaseKey ActivitiesPass Criteria
1. Root CauseRead errors, reproduce, check changes, collect evidenceUnderstand what went wrong and why
2. PatternFind working examples, compareIdentify differences
3. HypothesisForm theory, minimal verificationHypothesis is verified or new hypothesis formed
4. ImplementationCreate test, fix, verifyBug is fixed, tests pass

当流程显示"找不到根因"

When the Process Says "No Root Cause Found"

如果系统化排查后发现问题确实是环境相关、时序相关或外部因素导致的:
  1. 你已经完成了流程
  2. 记录你排查了什么
  3. 实施适当的处理措施(重试、超时、错误提示)
  4. 添加监控/日志以便后续排查
但是: 95% 的"找不到根因"其实是排查不充分。
If systematic investigation reveals the issue is indeed environment-related, timing-related, or caused by external factors:
  1. You've completed the process
  2. Document what you investigated
  3. Implement appropriate handling (retries, timeouts, error messages)
  4. Add monitoring/logging for future investigation
However: 95% of "no root cause found" cases are due to insufficient investigation.

辅助技术

Supporting Techniques

以下技术是系统化调试的组成部分,可在本目录中找到:
  • root-cause-tracing.md
    - 沿调用栈反向追踪 bug,找到最初的触发点
  • defense-in-depth.md
    - 找到根因后,在多个层级添加校验
  • condition-based-waiting.md
    - 用条件轮询替代硬编码等待时间
相关技能:
  • superpowers:test-driven-development - 用于创建失败测试用例(第四阶段,第 1 步)
  • superpowers:verification-before-completion - 在宣称成功之前验证修复确实有效
The following techniques are part of systematic debugging and can be found in this directory:
  • root-cause-tracing.md
    - Trace bugs backward along the call stack to find the initial trigger
  • defense-in-depth.md
    - After finding the root cause, add validation at multiple levels
  • condition-based-waiting.md
    - Replace hardcoded wait times with conditional polling
Related Skills:
  • superpowers:test-driven-development - Used to create failing test cases (Phase 4, Step 1)
  • superpowers:verification-before-completion - Verify the fix actually works before declaring success

实际效果

Practical Results

调试实践中的数据:
  • 系统化方法:15-30 分钟修复
  • 随意修复方法:2-3 小时反复折腾
  • 一次修复成功率:95% vs 40%
  • 引入新 bug:几乎为零 vs 经常发生
Data from debugging practices:
  • Systematic approach: 15-30 minutes to fix
  • Random fix approach: 2-3 hours of trial and error
  • First-fix success rate: 95% vs 40%
  • New bugs introduced: Almost zero vs frequent occurrences