systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese系统化调试
Systematic Debugging
概述
Overview
随意修复既浪费时间又会引入新 bug。草率的补丁只会掩盖深层问题。
核心原则: 在尝试修复之前,务必先找到根本原因。只修症状就是失败。
敷衍走流程等于违背调试的精神。
Random fixes are both time-consuming and introduce new bugs. Hasty patches only mask underlying issues.
Core Principle: Always find the root cause before attempting a fix. Fixing only the symptom is a failure.
Going through the motions defeats the purpose of debugging.
铁律
Non-Negotiable Rule
不做根因调查,不许提修复方案如果你还没完成第一阶段,就不能提出修复方案。
No root cause investigation, no fix proposalYou cannot propose a fix until you have completed the first phase.
何时使用
When to Use
用于任何技术问题:
- 测试失败
- 生产环境 bug
- 异常行为
- 性能问题
- 构建失败
- 集成问题
尤其在以下情况必须使用:
- 时间紧迫(紧急情况最容易让人猜测式修复)
- 觉得"一个小修改"就能搞定
- 已经尝试了多种修复
- 上一次修复没有生效
- 你没有完全理解问题
以下情况也不要跳过:
- 问题看起来很简单(简单的 bug 也有根本原因)
- 你很赶时间(越急越容易返工)
- 领导要求立刻修好(系统化调试比反复尝试更快)
Use for any technical issue:
- Test failures
- Production bugs
- Abnormal behavior
- Performance issues
- Build failures
- Integration issues
Mandatory use in the following situations:
- Time constraints (emergencies are when people are most likely to guess fixes)
- Thinking "a small change will fix it"
- Have tried multiple fixes already
- The last fix didn't work
- You don't fully understand the problem
Do NOT skip in these cases either:
- The problem seems simple (even simple bugs have root causes)
- You're in a hurry (rushing leads to rework)
- Leadership demands an immediate fix (systematic debugging is faster than trial and error)
四个阶段
Four Phases
你必须完成每个阶段后才能进入下一个。
You must complete each phase before moving to the next.
第一阶段:根因调查
Phase 1: Root Cause Investigation
在尝试任何修复之前:
-
仔细阅读错误信息
- 不要跳过错误或警告
- 它们往往直接包含解决方案
- 完整阅读堆栈跟踪
- 记下行号、文件路径、错误码
-
稳定复现
- 你能可靠地触发它吗?
- 具体的复现步骤是什么?
- 每次都能复现吗?
- 如果无法复现 → 收集更多数据,不要猜测
-
检查近期变更
- 什么变更可能导致了这个问题?
- git diff、最近的提交
- 新依赖、配置变更
- 环境差异
-
在多组件系统中收集证据当系统有多个组件时(CI → 构建 → 签名,API → 服务 → 数据库):在提出修复方案之前,先添加诊断埋点:
对每个组件边界: - 记录进入组件的数据 - 记录离开组件的数据 - 验证环境/配置的传递 - 检查每一层的状态 执行一次以收集证据,确定断裂点在哪里 然后分析证据,定位故障组件 然后针对该组件深入调查示例(多层系统):bash# 第 1 层:工作流 echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # 第 2 层:构建脚本 echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # 第 3 层:签名脚本 echo "=== Keychain state: ===" security list-keychains security find-identity -v # 第 4 层:实际签名 codesign --sign "$IDENTITY" --verbose=4 "$APP"由此可以看出: 哪一层出了问题(secrets → workflow ✓, workflow → build ✗) -
跟踪数据流当错误发生在调用栈深处时:参见本目录下的,了解完整的反向追踪技术。
root-cause-tracing.md简要版本:- 错误值从哪里产生的?
- 谁用错误值调用了这里?
- 持续向上追踪直到找到源头
- 在源头修复,而不是在症状处修复
Before attempting any fix:
-
Read error messages carefully
- Don't skip errors or warnings
- They often contain the solution directly
- Read the full stack trace
- Note line numbers, file paths, error codes
-
Stable Reproduction
- Can you reliably trigger it?
- What are the exact reproduction steps?
- Does it reproduce every time?
- If you can't reproduce → collect more data, don't guess
-
Check Recent Changes
- What changes could have caused this issue?
- git diff, recent commits
- New dependencies, configuration changes
- Environment differences
-
Collect Evidence in Multi-Component SystemsWhen the system has multiple components (CI → Build → Signing, API → Service → Database):Before proposing a fix, add diagnostic instrumentation:
For each component boundary: - Record data entering the component - Record data leaving the component - Verify environment/configuration propagation - Check state at each layer Execute once to collect evidence, identify where the break occurs Then analyze evidence to locate the faulty component Then investigate that component in depthExample (multi-layer system):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗) -
Trace Data FlowWhen errors occur deep in the call stack:Seein this directory for full reverse-tracing techniques.
root-cause-tracing.mdShort version:- Where did the incorrect value originate?
- Who called this with the incorrect value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom
第二阶段:模式分析
Phase 2: Pattern Analysis
先找到模式,再修复:
-
找到可正常工作的示例
- 在同一代码库中找到类似的正常代码
- 有什么正常的代码与出问题的代码相似?
-
与参考实现对比
- 如果是实现某个模式,完整阅读参考实现
- 不要略读——逐行阅读
- 在应用之前彻底理解该模式
-
识别差异
- 正常代码和出问题的代码之间有什么不同?
- 列出每一个差异,无论多小
- 不要假设"那不可能有影响"
-
理解依赖关系
- 这个功能需要哪些其他组件?
- 需要哪些设置、配置、环境?
- 它有哪些隐含假设?
Find patterns before fixing:
-
Find working examples
- Look for similar working code in the same codebase
- What working code resembles the problematic code?
-
Compare with reference implementations
- If implementing a pattern, read the reference implementation fully
- Don't skim — read line by line
- Fully understand the pattern before applying it
-
Identify differences
- What's different between the working code and the problematic code?
- List every difference, no matter how small
- Don't assume "that can't affect anything"
-
Understand dependencies
- What other components does this feature require?
- What settings, configurations, environments are needed?
- What implicit assumptions does it have?
第三阶段:假设与验证
Phase 3: Hypothesis and Verification
科学方法:
-
提出单一假设
- 清晰地陈述:"我认为 X 是根本原因,因为 Y"
- 写下来
- 要具体,不要含糊
-
最小化测试
- 做出最小的改动来验证假设
- 每次只改一个变量
- 不要同时修复多个问题
-
继续之前先验证
- 生效了?是 → 进入第四阶段
- 没生效?提出新假设
- 不要在上面叠加更多修复
-
当你不确定时
- 说"我不理解 X"
- 不要假装自己知道
- 寻求帮助
- 做更多调研
Scientific method:
-
Form a single hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Minimal testing
- Make the smallest change to verify the hypothesis
- Change only one variable at a time
- Don't fix multiple issues at once
-
Verify before proceeding
- Did it work? Yes → move to Phase 4
- Didn't work? Form a new hypothesis
- Don't stack more fixes on top
-
When you're unsure
- Say "I don't understand X"
- Don't pretend you know
- Ask for help
- Do more research
第四阶段:实施
Phase 4: Implementation
修复根本原因,而非症状:
-
创建失败的测试用例
- 最简化的复现
- 尽可能用自动化测试
- 没有测试框架就写一次性测试脚本
- 修复前必须先有测试
- 使用 技能来编写规范的失败测试
superpowers:test-driven-development
-
实施单一修复
- 修复已定位的根本原因
- 每次只改一处
- 不做"顺便改改"的优化
- 不捆绑重构
-
验证修复
- 测试现在通过了吗?
- 其他测试没有被破坏吧?
- 问题真的解决了吗?
-
如果修复不起作用
- 停下来
- 数一数:你已经尝试了几次修复?
- 少于 3 次:回到第一阶段,用新信息重新分析
- 3 次或以上:停下来质疑架构(见下方第 5 步)
- 没有经过架构讨论,不要尝试第 4 次修复
-
如果 3 次以上修复都失败了:质疑架构以下模式表明存在架构问题:
- 每次修复都暴露出新的共享状态/耦合/其他位置的问题
- 修复需要"大规模重构"才能实现
- 每次修复都在其他地方产生新的症状
停下来质疑根本性问题:- 这个模式从根本上合理吗?
- 我们是不是在"惯性驱动"下坚持了错误方案?
- 应该重构架构还是继续修补症状?
在尝试更多修复之前,和你的搭档讨论这不是假设失败——这是架构有误。
Fix the root cause, not the symptom:
-
Create a failing test case
- Minimal reproduction
- Use automated tests whenever possible
- Write a one-time test script if no test framework is available
- Must have a test before fixing
- Use the skill to write a proper failing test
superpowers:test-driven-development
-
Implement a single fix
- Fix the identified root cause
- Change only one thing at a time
- Don't make "while I'm at it" optimizations
- Don't bundle refactoring
-
Verify the fix
- Does the test pass now?
- Are other tests not broken?
- Is the problem truly resolved?
-
If the fix doesn't work
- Stop
- Count: How many fixes have you tried?
- Fewer than 3: Go back to Phase 1, re-analyze with new information
- 3 or more: Stop and question the architecture (see Step 5 below)
- Don't attempt a 4th fix without architectural discussion
-
If 3+ fixes fail: Question the architectureThese patterns indicate architectural issues:
- Each fix reveals new shared state/coupling/issues elsewhere
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamental issues:- Is this pattern fundamentally sound?
- Are we sticking to a bad solution out of inertia?
- Should we refactor the architecture or keep patching symptoms?
Discuss with your partner before attempting more fixesThis isn't hypothesis failure — this is architectural failure.
红线——停下来,按流程走
Red Lines — Stop and Follow the Process
如果你发现自己在想:
- "先临时修一下,以后再排查"
- "试着改改 X 看看行不行"
- "一次性改多个地方,跑测试看看"
- "跳过测试,我手动验证"
- "大概是 X 的问题,让我修一下"
- "我不完全理解,但这应该能行"
- "模式说的是 X,但我换个方式用"
- "主要问题有这些:[未经调查就列出修复方案]"
- 没有追踪数据流就提出解决方案
- "再试一次修复"(已经尝试了 2 次以上)
- 每次修复都暴露出不同地方的新问题
以上这些都意味着:停下来。回到第一阶段。
如果 3 次以上修复都失败了: 质疑架构(见第四阶段第 5 步)
If you catch yourself thinking:
- "I'll fix it temporarily and investigate later"
- "Let me try changing X to see if it works"
- "Change multiple things at once and run tests"
- "Skip tests, I'll verify manually"
- "It's probably X, let me fix it"
- "I don't fully understand, but this should work"
- "The pattern says X, but I'll use it differently"
- "The main issues are: [listing fixes without investigation]"
- Proposing solutions without tracing data flow
- "Just try one more fix" (after 2+ failed attempts)
- Each fix reveals new issues in different places
All of these mean: Stop. Go back to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)
搭档发出的信号——说明你的方法不对
Partner Signals — Your Approach Is Wrong
留意这些提醒:
- "难道不是这样吗?"——你在没有验证的情况下做了假设
- "它能告诉我们……吗?"——你应该先收集证据
- "别猜了"——你在没有理解的情况下提出修复
- "深入想想"——要质疑根本性问题,而不只是症状
- "我们卡住了?"(沮丧的语气)——你的方法没有奏效
当你看到这些信号时: 停下来。回到第一阶段。
Watch for these reminders:
- "Isn't it...?" — You're making assumptions without verification
- "Can it tell us...?" — You should collect evidence first
- "Don't guess" — You're proposing fixes without understanding
- "Think deeper" — You need to question fundamental issues, not just symptoms
- "Are we stuck?" (frustrated tone) — Your approach isn't working
When you see these signals: Stop. Go back to Phase 1.
常见借口
Common Excuses
| 借口 | 现实 |
|---|---|
| "问题很简单,不需要走流程" | 简单问题也有根本原因。对于简单 bug,流程很快就能走完。 |
| "紧急情况,没时间走流程" | 系统化调试比反复猜测式修复更快。 |
| "先试一下,再排查" | 第一次修复就定下了基调。从一开始就做对。 |
| "确认修复有效后再写测试" | 没有测试的修复留不住。先写测试才能证明修复有效。 |
| "一次修多个问题省时间" | 无法隔离哪个生效了。还会引入新 bug。 |
| "参考实现太长了,我自己改改" | 一知半解必然出 bug。完整阅读。 |
| "我看出问题了,让我修一下" | 看到症状 ≠ 理解根因。 |
| "再试一次"(在 2 次以上失败后) | 3 次以上失败 = 架构问题。质疑模式,不要继续修。 |
| Excuse | Reality |
|---|---|
| "The problem is simple, no need to follow the process" | Even simple problems have root causes. For simple bugs, the process is quick to complete. |
| "It's an emergency, no time for the process" | Systematic debugging is faster than trial and error. |
| "Try it first, then investigate" | The first fix sets the tone. Do it right from the start. |
| "Write tests after confirming the fix works" | Fixes without tests don't last. Writing tests first proves the fix works. |
| "Fixing multiple issues at once saves time" | You can't isolate what worked. It also introduces new bugs. |
| "The reference implementation is too long, I'll modify it myself" | Partial understanding inevitably leads to bugs. Read it fully. |
| "I see the problem, let me fix it" | Seeing the symptom ≠ understanding the root cause. |
| "Just try one more time" (after 2+ failures) | 3+ failures = architectural issue. Question the pattern, don't keep fixing. |
速查表
Cheat Sheet
| 阶段 | 关键活动 | 通过标准 |
|---|---|---|
| 1. 根因 | 阅读错误、复现、检查变更、收集证据 | 理解了什么出了问题以及为什么 |
| 2. 模式 | 找到正常示例、对比 | 识别出差异 |
| 3. 假设 | 提出理论、最小化验证 | 假设被验证或产生新假设 |
| 4. 实施 | 创建测试、修复、验证 | bug 已修复,测试通过 |
| Phase | Key Activities | Pass Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, collect evidence | Understand what went wrong and why |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, minimal verification | Hypothesis is verified or new hypothesis formed |
| 4. Implementation | Create test, fix, verify | Bug is fixed, tests pass |
当流程显示"找不到根因"
When the Process Says "No Root Cause Found"
如果系统化排查后发现问题确实是环境相关、时序相关或外部因素导致的:
- 你已经完成了流程
- 记录你排查了什么
- 实施适当的处理措施(重试、超时、错误提示)
- 添加监控/日志以便后续排查
但是: 95% 的"找不到根因"其实是排查不充分。
If systematic investigation reveals the issue is indeed environment-related, timing-related, or caused by external factors:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retries, timeouts, error messages)
- Add monitoring/logging for future investigation
However: 95% of "no root cause found" cases are due to insufficient investigation.
辅助技术
Supporting Techniques
以下技术是系统化调试的组成部分,可在本目录中找到:
- - 沿调用栈反向追踪 bug,找到最初的触发点
root-cause-tracing.md - - 找到根因后,在多个层级添加校验
defense-in-depth.md - - 用条件轮询替代硬编码等待时间
condition-based-waiting.md
相关技能:
- superpowers:test-driven-development - 用于创建失败测试用例(第四阶段,第 1 步)
- superpowers:verification-before-completion - 在宣称成功之前验证修复确实有效
The following techniques are part of systematic debugging and can be found in this directory:
- - Trace bugs backward along the call stack to find the initial trigger
root-cause-tracing.md - - After finding the root cause, add validation at multiple levels
defense-in-depth.md - - Replace hardcoded wait times with conditional polling
condition-based-waiting.md
Related Skills:
- superpowers:test-driven-development - Used to create failing test cases (Phase 4, Step 1)
- superpowers:verification-before-completion - Verify the fix actually works before declaring success
实际效果
Practical Results
调试实践中的数据:
- 系统化方法:15-30 分钟修复
- 随意修复方法:2-3 小时反复折腾
- 一次修复成功率:95% vs 40%
- 引入新 bug:几乎为零 vs 经常发生
Data from debugging practices:
- Systematic approach: 15-30 minutes to fix
- Random fix approach: 2-3 hours of trial and error
- First-fix success rate: 95% vs 40%
- New bugs introduced: Almost zero vs frequent occurrences