systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Core principle: Find root cause before attempting fixes. Symptom fixes are failure.

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

核心原则： 尝试修复前先找到根因。仅修复症状属于无效操作。

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

Phase 1: Root Cause Investigation

阶段1：根因调查

BEFORE attempting ANY fix:

Read Error Messages Carefully
- Read stack traces completely
- Note line numbers, file paths, error codes
- Don't skip warnings
Reproduce Consistently
- What are the exact steps?
- If not reproducible → gather more data, don't guess
Check Recent Changes
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences

Gather Evidence in Multi-Component Systems

WHEN system has multiple components (CI → build → signing, API → service → database):

Add diagnostic instrumentation before proposing fixes:

For EACH component boundary:
  - Log what data enters/exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence → analyze → identify failing component

Example:

bash

# Layer 1: Workflow
echo "=== Secrets available: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing
security find-identity -v

Trace Data Flow
See
```
references/root-cause-tracing.md
```
for backward tracing technique.
Quick version: Where does bad value originate? Trace up call chain until you find the source. Fix at source.

尝试任何修复前：

仔细阅读错误信息
- 完整读取栈追踪信息
- 记录行号、文件路径、错误码
- 不要忽略警告信息
稳定复现问题
- 准确的复现步骤是什么？
- 如果无法复现 → 收集更多数据，不要主观猜测
检查近期变更
- Git diff、近期提交记录
- 新增依赖、配置变更
- 环境差异

在多组件系统中收集证据

当系统包含多个组件（CI → 构建 → 签名、API → 服务 → 数据库）时：

在提出修复方案前添加诊断埋点：

For EACH component boundary:
  - Log what data enters/exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence → analyze → identify failing component

示例：

bash

# Layer 1: Workflow
echo "=== Secrets available: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing
security find-identity -v

追踪数据流
参考
```
references/root-cause-tracing.md
```
了解反向追踪技术。
快速版本：异常值来自哪里？向上追溯调用链直到找到源头，从源头进行修复。

Phase 2: Pattern Analysis

阶段2：模式分析

Find Working Examples - Similar working code in codebase
Compare Against References - Read reference implementations COMPLETELY, don't skim
Identify Differences - List every difference, don't assume "that can't matter"
Understand Dependencies - Components, config, environment, assumptions

找到可运行示例 - 代码库中类似的可正常运行的代码
与参考实现对比 - 完整阅读参考实现，不要略读
识别差异 - 列出所有差异，不要假设「这不可能有影响」
理解依赖 - 组件、配置、环境、预设假设

Phase 3: Hypothesis and Testing

阶段3：假设与测试

Form Single Hypothesis - "I think X is root cause because Y" - be specific
Test Minimally - SMALLEST possible change, one variable at a time
Verify - Worked → Phase 4. Didn't work → form NEW hypothesis, don't stack fixes
When You Don't Know - Say so. Don't pretend.

提出单一假设 - 「我认为X是根因，因为Y」，表述要具体
最小化测试 - 尽可能小的变更，一次只修改一个变量
验证 - 生效 → 进入阶段4。不生效 → 提出新的假设，不要堆叠修复方案
不确定时直接说明 - 不要假装了解问题

Phase 4: Implementation

阶段4：实现

Create Failing Test Case
- Use the
```
test-driven-development
```
  skill
- MUST have before fixing
Implement Single Fix
- ONE change at a time
- No "while I'm here" improvements
Verify Fix
- Test passes? Other tests still pass? Issue resolved?
If Fix Doesn't Work
- Count attempts
- If < 3: Return to Phase 1 with new information
- If ≥ 3: Escalate (below)

创建失败测试用例
- 使用
```
test-driven-development
```
  skill
- 修复前必须完成该步骤
实现单一修复
- 一次只做一个变更
- 不要做「顺手」的优化
验证修复效果
- 测试是否通过？其他测试是否仍然正常？问题是否完全解决？
如果修复不生效
- 统计尝试次数
- 如果 < 3次：带着新收集的信息回到阶段1
- 如果 ≥ 3次：升级处理（见下文）

Escalation: 3+ Failed Fixes

升级处理：3次以上修复失败

Pattern indicating architectural problem:

Each fix reveals new problems elsewhere
Fixes require massive refactoring
Shared state/coupling keeps surfacing

Action: STOP. Question fundamentals:

Is this pattern fundamentally sound?
Are we continuing through inertia?
Refactor architecture vs. continue fixing symptoms?

Discuss with human partner before more fix attempts. This is wrong architecture, not failed hypothesis.

表明存在架构问题的特征：

每次修复都会在其他位置暴露新问题
修复需要进行大规模重构
共享状态/耦合问题不断出现

应对动作：停止，质疑基础假设：

这个模式从根本上是否合理？
我们是不是因为惯性才继续当前方案？
应该重构架构还是继续修复症状？

在继续尝试修复前和人类伙伴讨论，这属于架构错误，不是假设失败。

Red Flags → STOP and Return to Phase 1

危险信号 → 停止并回到阶段1

If you catch yourself thinking:

"Quick fix for now, investigate later"
"Just try changing X"
"I'll skip the test"
"It's probably X"
"Pattern says X but I'll adapt it differently"
Proposing solutions before tracing data flow
"One more fix" after 2+ failures

如果你发现自己有这些想法：

「先临时修复，之后再排查」
「就试试改下X」
「我就跳过测试吧」
「可能是X的问题」
「模式推荐X但我要改改用法」
在追踪数据流前就提出解决方案
2次以上失败后还想着「再修一次就行」

Human Signals You're Off Track

表明你偏离正轨的人类信号

"Is that not happening?" → You assumed without verifying
"Will it show us...?" → You should have added evidence gathering
"Stop guessing" → You're proposing fixes without understanding
"Ultrathink this" → Question fundamentals
Frustrated "We're stuck?" → Your approach isn't working

Response: Return to Phase 1.

「那没有发生吗？」 → 你没有验证就做了假设
「它会告诉我们…吗？」 → 你本应该先收集证据
「别猜了」 → 你在没搞懂的情况下就提出修复方案
「好好想清楚这个」 → 需要质疑基础假设
挫败的「我们卡住了？」 → 你的方法行不通

应对措施：回到阶段1。

Supporting Techniques

支持技术

Reference files in

references/

root-cause-tracing.md
- Trace bugs backward through call stack
defense-in-depth.md
- Add validation at multiple layers after finding root cause
condition-based-waiting.md
- Replace arbitrary timeouts with condition polling

Related skills:

test-driven-development
- Creating failing test case (Phase 4)
verification-before-completion
- Verify fix before claiming success

参考

references/

下的文件：

root-cause-tracing.md
- 通过调用栈反向追踪bug
defense-in-depth.md
- 找到根因后在多个层级添加校验
condition-based-waiting.md
- 用条件轮询代替任意超时