zeno-systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Overview
概述
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
随机修复既浪费时间又会引入新bug。快速补丁只会掩盖潜在问题。
核心原则: 在尝试修复前,务必找到根本原因。仅修复症状等同于失败。
违反此流程的任何环节,都违背了调试的本质。
The Iron Law
铁律
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未进行根本原因调查,不得提出修复方案如果尚未完成第一阶段,你不能提出修复方案。
When to Use
使用场景
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
适用于任何技术问题:
- 测试失败
- 生产环境中的bug
- 异常行为
- 性能问题
- 构建失败
- 集成问题
尤其在以下场景中使用:
- 面临时间压力时(紧急情况下很容易凭猜测行事)
- "只需快速修复一下"看似显而易见时
- 已经尝试过多种修复方法时
- 之前的修复无效时
- 你并未完全理解问题时
请勿跳过此流程的场景:
- 问题看似简单(简单bug也有根本原因)
- 你赶时间时(仓促行事必然导致返工)
- 经理要求立即修复时(系统化调试比盲目尝试更快)
The Four Phases
四个阶段
You MUST complete each phase before proceeding to the next.
你必须完成每个阶段后,才能进入下一阶段。
Phase 1: Root Cause Investigation
第一阶段:根本原因调查
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
-
Check Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component SystemsWHEN system has multiple components (CI → build → signing, API → service → database):BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentExample (multi-layer system):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer fails (secrets → workflow ✓, workflow → build ✗) -
Trace Data FlowWHEN error is deep in call stack:Seein this directory for the complete backward tracing technique.
root-cause-tracing.mdQuick version:- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
在尝试任何修复之前:
-
仔细阅读错误信息
- 不要跳过错误或警告信息
- 它们通常包含确切的解决方案
- 完整阅读堆栈跟踪信息
- 记录行号、文件路径、错误代码
-
稳定复现问题
- 你能否可靠地触发问题?
- 确切步骤是什么?
- 每次都会发生吗?
- 如果无法复现 → 收集更多数据,不要猜测
-
检查近期变更
- 哪些变更可能导致这个问题?
- Git diff、近期提交记录
- 新依赖项、配置变更
- 环境差异
-
在多组件系统中收集证据当系统包含多个组件时(CI → 构建 → 签名,API → 服务 → 数据库):在提出修复方案前,添加诊断工具:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific component示例(多层系统):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"这会揭示: 哪一层出现故障(密钥 → 工作流 ✓,工作流 → 构建 ✗) -
追踪数据流当错误位于调用栈深处时:请查看此目录中的,获取完整的反向追踪技术。
root-cause-tracing.md简化版:- 错误值源自何处?
- 是谁传入了错误值?
- 持续向上追踪,直到找到源头
- 在源头修复,而非修复症状
Phase 2: Pattern Analysis
第二阶段:模式分析
Find the pattern before fixing:
-
Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
-
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
修复前先找出模式:
-
寻找可行示例
- 在同一代码库中找到类似的可运行代码
- 哪些类似的代码可以正常运行?
-
与参考实现对比
- 如果是实现某种模式,请完整阅读参考实现
- 不要略读 —— 逐行阅读
- 在应用前完全理解该模式
-
识别差异
- 可运行代码与故障代码之间有哪些不同?
- 列出所有差异,无论多小
- 不要假设"这无关紧要"
-
理解依赖关系
- 它需要哪些其他组件?
- 需要哪些设置、配置、环境?
- 它有哪些假设前提?
Phase 3: Hypothesis and Testing
第三阶段:假设与测试
Scientific method:
-
Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
-
Verify Before Continuing
- Did it work? Yes → Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
-
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
科学方法:
-
形成单一假设
- 清晰表述:"我认为X是根本原因,因为Y"
- 写下来
- 要具体,不要模糊
-
最小化测试
- 做出最小的变更来验证假设
- 一次只改变一个变量
- 不要同时修复多个问题
-
验证后再继续
- 有效?是 → 进入第四阶段
- 无效?形成新的假设
- 不要叠加更多修复
-
当你不确定时
- 说"我不理解X"
- 不要假装知道
- 寻求帮助
- 进一步研究
Phase 4: Implementation
第四阶段:实施
Fix the root cause, not the symptom:
-
Create Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework
- MUST have before fixing
-
Implement Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify Fix
- Test passes now?
- No other tests broken?
- Issue actually resolved?
-
If Fix Doesn't Work
- STOP
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If ≥ 3: STOP and question the architecture (step 5 below)
- DON'T attempt Fix #4 without architectural discussion
-
If 3+ Fixes Failed: Question ArchitecturePattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs. continue fixing symptoms?
Discuss with your human partner before attempting more fixesThis is NOT a failed hypothesis - this is a wrong architecture.
修复根本原因,而非症状:
-
创建失败测试用例
- 最简单的复现方式
- 尽可能实现自动化测试
- 如果没有框架,编写一次性测试脚本
- 修复前必须完成
-
实施单一修复
- 针对已确定的根本原因
- 一次只做一项变更
- 不要顺便进行"改进"
- 不要捆绑重构
-
验证修复效果
- 测试现在通过了吗?
- 其他测试有没有被破坏?
- 问题真的解决了吗?
-
如果修复无效
- 停止操作
- 计数:你已经尝试了多少次修复?
- 如果 < 3:回到第一阶段,结合新信息重新分析
- 如果 ≥ 3:停止并质疑架构(见下文第5步)
- 未经架构讨论,不要尝试第四次修复
-
如果3次及以上修复失败:质疑架构表明存在架构问题的模式:
- 每次修复都会在不同地方暴露出新的共享状态/耦合/问题
- 修复需要"大规模重构"才能实现
- 每次修复都会在其他地方引发新症状
停止并质疑基本假设:- 这个模式从根本上合理吗?
- 我们是不是"纯粹因惯性而坚持"?
- 我们应该重构架构,还是继续修复症状?
在尝试更多修复前,与你的人类伙伴讨论这不是假设失败 —— 而是架构存在问题。
Red Flags - STOP and Follow Process
危险信号 - 停止并遵循流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
如果你发现自己有以下想法:
- "先快速修复,之后再调查"
- "试试改X看看行不行"
- "做多个变更,然后运行测试"
- "跳过测试,我手动验证"
- "可能是X,我来修复它"
- "我不完全理解,但这可能有效"
- "模式要求X,但我要换种方式调整"
- "主要问题有这些:[未调查就列出修复方案]"
- 在追踪数据流前就提出解决方案
- "再试一次修复"(已经尝试2次及以上)
- 每次修复都会在不同地方暴露出新问题
所有这些都意味着:停止操作。回到第一阶段。
如果3次及以上修复失败: 质疑架构(见第四阶段第5步)
your human partner's Signals You're Doing It Wrong
你的人类伙伴发出的"你做错了"的信号
Watch for these redirections:
- "Is that not happening?" - You assumed without verifying
- "Will it show us...?" - You should have added evidence gathering
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Your approach isn't working
When you see these: STOP. Return to Phase 1.
注意这些提示:
- "不是这样的吗?"——你未经验证就做出了假设
- "它会显示……吗?"——你应该添加证据收集步骤
- "别猜了"——你在未理解问题的情况下提出修复方案
- "深入思考这个问题"——质疑根本问题,而非仅关注症状
- "我们卡住了?"(沮丧)——你的方法行不通
当你看到这些信号时:停止操作。回到第一阶段。
Common Rationalizations
常见借口与现实
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
| 借口 | 现实 |
|---|---|
| "问题很简单,不需要流程" | 简单问题也有根本原因。此流程处理简单bug速度很快。 |
| "紧急情况,没时间走流程" | 系统化调试比盲目尝试更快。 |
| "先试试这个,之后再调查" | 第一次修复会定下模式。从一开始就做对。 |
| "确认修复有效后再写测试" | 未测试的修复无法持久。先写测试能验证问题。 |
| "同时做多个修复节省时间" | 无法确定哪个有效。会引入新bug。 |
| "参考内容太长,我调整一下模式" | 一知半解必然导致bug。请完整阅读。 |
| "我看到问题了,我来修复" | 看到症状≠理解根本原因。 |
| "再试一次修复"(2次及以上失败后) | 3次及以上失败=架构问题。质疑模式,不要继续修复。 |
Quick Reference
快速参考
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
| 阶段 | 关键活动 | 成功标准 |
|---|---|---|
| 1. 根本原因 | 阅读错误信息、复现问题、检查变更、收集证据 | 理解问题是什么以及为什么发生 |
| 2. 模式分析 | 寻找可行示例、对比参考实现 | 识别差异 |
| 3. 假设与测试 | 形成理论、最小化测试 | 假设得到确认或形成新假设 |
| 4. 实施 | 创建测试用例、修复、验证 | Bug已解决,测试通过 |
When Process Reveals "No Root Cause"
当流程显示"无根本原因"时
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
如果系统化调查发现问题确实是环境、时间依赖或外部因素导致:
- 你已完成流程
- 记录你调查的内容
- 实施适当的处理(重试、超时、错误提示)
- 添加监控/日志以便未来调查
但: 95%的"无根本原因"案例都是调查不完整导致的。
Supporting Techniques
支持技术
These techniques are part of systematic debugging and available in this directory:
- - Trace bugs backward through call stack to find original trigger
root-cause-tracing.md - - Add validation at multiple layers after finding root cause
defense-in-depth.md - - Replace arbitrary timeouts with condition polling
condition-based-waiting.md
Related skills:
- Use skill to verify fix before claiming success
zeno-validating-work
这些技术是系统化调试的一部分,可在此目录中找到:
- —— 通过调用栈反向追踪bug,找到最初的触发点
root-cause-tracing.md - —— 找到根本原因后,在多层添加验证
defense-in-depth.md - —— 用条件轮询替代任意超时
condition-based-waiting.md
相关技能:
- 使用技能在声称修复成功前验证效果
zeno-validating-work
Real-World Impact
实际影响
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
来自调试会话的数据:
- 系统化方法:15-30分钟修复
- 随机修复方法:2-3小时的盲目尝试
- 首次修复成功率:95% vs 40%
- 引入新bug的概率:几乎为0 vs 常见