debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debugging

调试

The Iron Law

铁则

Never propose a fix without first identifying the root cause. "Quick fix now, investigate later" is forbidden — it creates harder bugs.
在未确定根本原因之前,绝不要提出修复方案。禁止“先快速修复,后调查原因”的做法——这会导致更难解决的bug。

Root Cause Analysis

根因分析

Root cause identification is the core deliverable of debugging — not the fix itself. A fix without a confirmed root cause is guesswork.
  • Trace backward: Start at the symptom, walk the call chain in reverse to find where behavior diverges from expectation
  • Differential analysis: Compare working vs broken state across dimensions (code version, data, environment, timing, configuration)
  • Regression hunting: Use
    git bisect
    to pinpoint the exact commit that introduced the issue
  • Evidence-based: Document root cause with
    file:line
    references, log output, and concrete reproduction proof
  • Competing hypotheses: When the cause is ambiguous, generate multiple hypotheses and rank by evidence strength (see Escalation section below)
根因识别是调试的核心产出——而非修复本身。没有确认根因的修复只是猜测。
  • 反向追踪:从症状入手,反向遍历调用链,找到行为与预期偏离的节点
  • 差异分析:在多个维度(代码版本、数据、环境、时序、配置)对比正常状态与故障状态
  • 回归排查:使用
    git bisect
    定位引入问题的具体提交
  • 基于证据:通过
    file:line
    引用、日志输出和可复现的具体证据记录根因
  • 多假设验证:当原因不明确时,生成多个假设并根据证据强度排序(见下文升级处理部分)

Process

流程

1. Reproduce — make the bug consistent. If intermittent, run N times under stress or simulate poor conditions (slow network, low memory) until it triggers reliably.
2. Investigate — trace backward through the call chain from the symptom. Add diagnostic logging at each component boundary. Compare working vs broken state using a differential table (environment, version, data, timing — what changed?).
3. Hypothesize and test — one change at a time. If a hypothesis is wrong, fully revert before testing the next. Use
git bisect
to find regressions efficiently.
4. Fix and verify — create a failing test FIRST, then fix. Run the test. Confirm the original reproduction case passes. No completion claims without fresh verification evidence.
1. 重现 —— 让bug稳定复现。如果是间歇性问题,在压力下运行N次或模拟恶劣条件(慢网络、低内存),直到可以稳定触发。
2. 调查 —— 从症状反向追踪调用链。在每个组件边界添加诊断日志。使用差异表对比正常与故障状态(环境、版本、数据、时序——哪些发生了变化?)。
3. 假设与测试 —— 每次只做一处变更。如果假设错误,在测试下一个假设前完全回滚变更。使用
git bisect
高效排查回归问题。
4. 修复与验证 —— 先编写一个会失败的测试用例,再进行修复。运行测试。确认原始复现场景可以通过。没有新的验证证据,不得声称修复完成。

Three-Fix Threshold

三次修复阈值

After 3 failed fix attempts, STOP. The problem is likely architectural, not a surface bug. Step back and question assumptions about how the system works. Read the actual code path end-to-end instead of spot-checking.
经过3次修复尝试失败后,立即停止。问题很可能是架构层面的,而非表面bug。退一步,质疑你对系统工作原理的假设。完整阅读整个代码路径,而非只抽查部分代码。

Escalation: Competing Hypotheses

升级处理:多假设分析

When the cause is unclear across multiple components, use Analysis of Competing Hypotheses:
  • Generate hypotheses across failure modes: logic error, data issue, state problem, integration failure, resource exhaustion, environment
  • Investigate each with evidence: Direct (strong), Correlational (medium), Testimonial (weak)
  • Cite evidence with
    file:line
    references
  • Rank by confidence. If multiple hypotheses are equally supported, suspect compound causes.
当问题涉及多个组件且原因不明时,使用多假设分析方法:
  • 针对多种故障模式生成假设:逻辑错误、数据问题、状态异常、集成失败、资源耗尽、环境问题
  • 用证据验证每个假设:直接证据(强)、关联证据(中)、证言证据(弱)
  • 通过
    file:line
    引用标注证据
  • 按置信度排序。如果多个假设得到同等支持,怀疑是复合原因导致。

Intermittent Issues

间歇性问题

  • Track with correlation IDs across distributed components
  • Race conditions: look for shared mutable state, check-then-act patterns, missing locks. In async code (Node.js, Python asyncio): interleaved
    .then()
    chains, unguarded shared state between concurrent tasks, missing transaction isolation in DB operations
  • Deadlocks: check for circular lock acquisition (DB row locks held across multiple queries), circular
    await
    dependencies in async code, connection pool exhaustion blocking queries that would release other connections
  • Resource exhaustion: monitor memory growth, connection pool depletion, file descriptor leaks. Under load: check pool size vs concurrent request count, verify connections are returned on error paths (finally/dispose)
  • Timing-dependent: replace arbitrary
    sleep()
    with condition-based polling — wait for the actual state, not a duration
  • 通过分布式组件间的关联ID进行追踪
  • 竞态条件:查找共享可变状态、检查后执行模式、缺失锁。在异步代码(Node.js、Python asyncio)中:交错的
    .then()
    链、并发任务间未受保护的共享状态、数据库操作中缺失的事务隔离
  • 死锁:检查循环锁获取(跨多个查询持有的数据库行锁)、异步代码中的循环
    await
    依赖、连接池耗尽阻塞本应释放其他连接的查询
  • 资源耗尽:监控内存增长、连接池耗尽、文件描述符泄漏。在负载下:检查池大小与并发请求数的对比,验证连接在错误路径(finally/dispose)中是否被正确归还
  • 时序依赖问题:用基于条件的轮询替换任意的
    sleep()
    ——等待实际状态,而非固定时长

Defense-in-Depth Validation

纵深防御验证

After fixing, validate at every layer — not just where the bug appeared:
  • Entry: does invalid input get caught?
  • Business logic: does the fix handle edge cases?
  • Environment: does it work across configurations?
  • Instrumentation: add logging to detect recurrence
修复完成后,在每个层面进行验证——而非仅在bug出现的层面:
  • 入口层:是否能拦截无效输入?
  • 业务逻辑层:修复是否能处理边缘情况?
  • 环境层:是否能在不同配置下正常工作?
  • 监控层:添加日志以检测问题复发

Bug Triage

Bug分类

When multiple bugs exist, prioritize by:
  • Severity (data loss > crash > wrong output > cosmetic) separately from Priority (blocking release > customer-facing > internal)
  • Reproducibility: always > sometimes > once. "Sometimes" bugs need instrumentation before fixing.
  • Quick wins: if a fix is < 5 minutes and unblocks others, do it first
当存在多个bug时,按以下优先级排序:
  • 严重性(数据丢失 > 崩溃 > 输出错误 > 界面瑕疵)与优先级(阻塞发布 > 面向客户 > 内部使用)分开评估
  • 可复现性:总是可复现 > 有时可复现 > 仅出现一次。“有时可复现”的bug需要先添加监控再修复
  • 快速解决项:如果修复耗时<5分钟且能解除其他工作的阻塞,优先处理

Common Patterns

常见模式

  • Null/undefined access — trace where the value was expected to be set, check all code paths
  • Off-by-one — check
    <
    vs
    <=
    , array length vs last index, loop boundaries
  • Async ordering — missing
    await
    , unhandled promise rejection, callback firing before setup completes
  • Type coercion
    ==
    vs
    ===
    , string-to-number conversion, truthy/falsy edge cases
  • Timezone — always store UTC, convert at display. Check DST transitions.
  • 空值/未定义访问 —— 追踪预期应设置值的位置,检查所有代码路径
  • 差一错误 —— 检查
    <
    <=
    、数组长度与最后一个索引、循环边界
  • 异步顺序问题 —— 缺失
    await
    、未处理的Promise拒绝、回调在初始化完成前触发
  • 类型转换 ——
    ==
    ===
    、字符串转数字、真值/假值边缘情况
  • 时区问题 —— 始终以UTC存储,在展示时转换。检查夏令时转换。

Anti-Patterns

反模式

  • Shotgun debugging (random changes without hypothesis) — revert and think instead
  • Multiple simultaneous changes — isolate each change or you can't learn what worked
  • Fixing the symptom not the cause — the same bug will resurface differently
  • Ignoring intermittent failures ("works on my machine") — instrument and reproduce under load instead
  • 霰弹式调试(无假设的随机变更)—— 回滚变更并重新思考
  • 同时进行多处变更 —— 隔离每一处变更,否则无法确定哪项起作用
  • 修复症状而非根因 —— 相同的bug会以其他形式重现
  • 忽略间歇性故障(“在我机器上正常”)—— 添加监控并在负载下复现问题

Postmortem

事后总结

After resolving non-trivial bugs, document a lightweight postmortem:
  1. Timeline: when introduced, when detected, when resolved (include commit SHAs)
  2. Root cause: one sentence — the actual cause, not the symptom
  3. Impact: what broke, for how long, who was affected
  4. Fix: what changed and why this fix addresses the root cause
  5. Prevention: what test, monitor, or process change prevents recurrence
解决非 trivial 的bug后,记录轻量级的事后总结:
  1. 时间线:问题引入时间、发现时间、解决时间(包含提交SHA)
  2. 根因:一句话描述——实际原因,而非症状
  3. 影响:哪些功能受损、持续时长、受影响对象
  4. 修复方案:做了哪些变更,以及该修复为何能解决根因
  5. 预防措施:添加哪些测试、监控或流程变更以防止问题复发