systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Core principle: Find root cause before attempting fixes. Symptom fixes are failure.
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
核心原则: 尝试修复前先找到根因。仅修复症状属于无效操作。
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

Phase 1: Root Cause Investigation

阶段1:根因调查

BEFORE attempting ANY fix:
  1. Read Error Messages Carefully
    • Read stack traces completely
    • Note line numbers, file paths, error codes
    • Don't skip warnings
  2. Reproduce Consistently
    • What are the exact steps?
    • If not reproducible → gather more data, don't guess
  3. Check Recent Changes
    • Git diff, recent commits
    • New dependencies, config changes
    • Environmental differences
  4. Gather Evidence in Multi-Component Systems
    WHEN system has multiple components (CI → build → signing, API → service → database):
    Add diagnostic instrumentation before proposing fixes:
    For EACH component boundary:
      - Log what data enters/exits component
      - Verify environment/config propagation
      - Check state at each layer
    
    Run once to gather evidence → analyze → identify failing component
    Example:
    bash
    # Layer 1: Workflow
    echo "=== Secrets available: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # Layer 2: Build script
    env | grep IDENTITY || echo "IDENTITY not in environment"
    
    # Layer 3: Signing
    security find-identity -v
  5. Trace Data Flow
    See
    references/root-cause-tracing.md
    for backward tracing technique.
    Quick version: Where does bad value originate? Trace up call chain until you find the source. Fix at source.
尝试任何修复前:
  1. 仔细阅读错误信息
    • 完整读取栈追踪信息
    • 记录行号、文件路径、错误码
    • 不要忽略警告信息
  2. 稳定复现问题
    • 准确的复现步骤是什么?
    • 如果无法复现 → 收集更多数据,不要主观猜测
  3. 检查近期变更
    • Git diff、近期提交记录
    • 新增依赖、配置变更
    • 环境差异
  4. 在多组件系统中收集证据
    当系统包含多个组件(CI → 构建 → 签名、API → 服务 → 数据库)时:
    在提出修复方案前添加诊断埋点:
    For EACH component boundary:
      - Log what data enters/exits component
      - Verify environment/config propagation
      - Check state at each layer
    
    Run once to gather evidence → analyze → identify failing component
    示例:
    bash
    # Layer 1: Workflow
    echo "=== Secrets available: ==="
    echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
    
    # Layer 2: Build script
    env | grep IDENTITY || echo "IDENTITY not in environment"
    
    # Layer 3: Signing
    security find-identity -v
  5. 追踪数据流
    参考
    references/root-cause-tracing.md
    了解反向追踪技术。
    快速版本:异常值来自哪里?向上追溯调用链直到找到源头,从源头进行修复。

Phase 2: Pattern Analysis

阶段2:模式分析

  1. Find Working Examples - Similar working code in codebase
  2. Compare Against References - Read reference implementations COMPLETELY, don't skim
  3. Identify Differences - List every difference, don't assume "that can't matter"
  4. Understand Dependencies - Components, config, environment, assumptions
  1. 找到可运行示例 - 代码库中类似的可正常运行的代码
  2. 与参考实现对比 - 完整阅读参考实现,不要略读
  3. 识别差异 - 列出所有差异,不要假设「这不可能有影响」
  4. 理解依赖 - 组件、配置、环境、预设假设

Phase 3: Hypothesis and Testing

阶段3:假设与测试

  1. Form Single Hypothesis - "I think X is root cause because Y" - be specific
  2. Test Minimally - SMALLEST possible change, one variable at a time
  3. Verify - Worked → Phase 4. Didn't work → form NEW hypothesis, don't stack fixes
  4. When You Don't Know - Say so. Don't pretend.
  1. 提出单一假设 - 「我认为X是根因,因为Y」,表述要具体
  2. 最小化测试 - 尽可能小的变更,一次只修改一个变量
  3. 验证 - 生效 → 进入阶段4。不生效 → 提出新的假设,不要堆叠修复方案
  4. 不确定时直接说明 - 不要假装了解问题

Phase 4: Implementation

阶段4:实现

  1. Create Failing Test Case
    • Use the
      test-driven-development
      skill
    • MUST have before fixing
  2. Implement Single Fix
    • ONE change at a time
    • No "while I'm here" improvements
  3. Verify Fix
    • Test passes? Other tests still pass? Issue resolved?
  4. If Fix Doesn't Work
    • Count attempts
    • If < 3: Return to Phase 1 with new information
    • If ≥ 3: Escalate (below)
  1. 创建失败测试用例
    • 使用
      test-driven-development
      skill
    • 修复前必须完成该步骤
  2. 实现单一修复
    • 一次只做一个变更
    • 不要做「顺手」的优化
  3. 验证修复效果
    • 测试是否通过?其他测试是否仍然正常?问题是否完全解决?
  4. 如果修复不生效
    • 统计尝试次数
    • 如果 < 3次:带着新收集的信息回到阶段1
    • 如果 ≥ 3次:升级处理(见下文)

Escalation: 3+ Failed Fixes

升级处理:3次以上修复失败

Pattern indicating architectural problem:
  • Each fix reveals new problems elsewhere
  • Fixes require massive refactoring
  • Shared state/coupling keeps surfacing
Action: STOP. Question fundamentals:
  • Is this pattern fundamentally sound?
  • Are we continuing through inertia?
  • Refactor architecture vs. continue fixing symptoms?
Discuss with human partner before more fix attempts. This is wrong architecture, not failed hypothesis.
表明存在架构问题的特征:
  • 每次修复都会在其他位置暴露新问题
  • 修复需要进行大规模重构
  • 共享状态/耦合问题不断出现
应对动作:停止,质疑基础假设:
  • 这个模式从根本上是否合理?
  • 我们是不是因为惯性才继续当前方案?
  • 应该重构架构还是继续修复症状?
在继续尝试修复前和人类伙伴讨论,这属于架构错误,不是假设失败。

Red Flags → STOP and Return to Phase 1

危险信号 → 停止并回到阶段1

If you catch yourself thinking:
  • "Quick fix for now, investigate later"
  • "Just try changing X"
  • "I'll skip the test"
  • "It's probably X"
  • "Pattern says X but I'll adapt it differently"
  • Proposing solutions before tracing data flow
  • "One more fix" after 2+ failures
如果你发现自己有这些想法:
  • 「先临时修复,之后再排查」
  • 「就试试改下X」
  • 「我就跳过测试吧」
  • 「可能是X的问题」
  • 「模式推荐X但我要改改用法」
  • 在追踪数据流前就提出解决方案
  • 2次以上失败后还想着「再修一次就行」

Human Signals You're Off Track

表明你偏离正轨的人类信号

  • "Is that not happening?" → You assumed without verifying
  • "Will it show us...?" → You should have added evidence gathering
  • "Stop guessing" → You're proposing fixes without understanding
  • "Ultrathink this" → Question fundamentals
  • Frustrated "We're stuck?" → Your approach isn't working
Response: Return to Phase 1.
  • 「那没有发生吗?」 → 你没有验证就做了假设
  • 「它会告诉我们…吗?」 → 你本应该先收集证据
  • 「别猜了」 → 你在没搞懂的情况下就提出修复方案
  • 「好好想清楚这个」 → 需要质疑基础假设
  • 挫败的「我们卡住了?」 → 你的方法行不通
应对措施:回到阶段1。

Supporting Techniques

支持技术

Reference files in
references/
:
  • root-cause-tracing.md
    - Trace bugs backward through call stack
  • defense-in-depth.md
    - Add validation at multiple layers after finding root cause
  • condition-based-waiting.md
    - Replace arbitrary timeouts with condition polling
Related skills:
  • test-driven-development
    - Creating failing test case (Phase 4)
  • verification-before-completion
    - Verify fix before claiming success
参考
references/
下的文件:
  • root-cause-tracing.md
    - 通过调用栈反向追踪bug
  • defense-in-depth.md
    - 找到根因后在多个层级添加校验
  • condition-based-waiting.md
    - 用条件轮询代替任意超时
相关技能:
  • test-driven-development
    - 创建失败测试用例(阶段4)
  • verification-before-completion
    - 在宣布成功前验证修复效果