debugging-systematic

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging Methodology

系统化调试方法论

Core Principle

核心原则

Debugging is not guessing. It is a scientific process: observe the symptom, form a hypothesis, predict the outcome of a test, and run the test. Repeat until the root cause is found.

调试不是猜谜,而是一个科学过程:观察症状、提出假设、预测测试结果、执行测试。重复该流程直至找到根本原因。

The Scientific Debugging Method

科学调试方法

Step 1: OBSERVE — Gather Evidence

步骤1:观察——收集证据

Before changing anything, collect all available information.
Checklist:
  • Read the FULL error message, not just the first line.
  • Read the FULL stack trace. Identify the frame in YOUR code (skip framework internals).
  • Note the exact input / request / action that triggered the bug.
  • Note the expected behavior vs. the actual behavior.
  • Check logs (application logs, server logs, browser console).
  • Note the environment (OS, runtime version, dependency versions, config).
  • Check if the bug is reproducible. If not, note the frequency and conditions.
Key question: Can I reproduce this bug reliably? If not, gathering more observations is the priority.
在进行任何修改前,收集所有可用信息。
检查清单:
  • 完整阅读错误信息,不要只看第一行。
  • 完整阅读堆栈跟踪,定位属于你的代码的调用帧(跳过框架内部代码)。
  • 记录触发bug的精确输入/请求/操作。
  • 记录预期行为与实际行为的差异。
  • 查看日志(应用日志、服务器日志、浏览器控制台)。
  • 记录运行环境(操作系统、运行时版本、依赖版本、配置)。
  • 确认bug是否可复现。如果不能,记录其出现频率和触发条件。
核心问题: 我能否稳定复现这个bug?如果不能,优先收集更多观察数据。

Step 2: HYPOTHESIZE — Form a Theory

步骤2:假设——提出理论

Based on the evidence, propose a specific, falsifiable explanation.
Good hypothesis format:
"The crash occurs because
user.profile
is null when the user has not completed onboarding, and line 47 accesses
user.profile.name
without a null check."
Bad hypothesis:
"Something is wrong with the user object."
Rules:
  • The hypothesis must explain ALL observed symptoms.
  • If it only explains some symptoms, it is incomplete — keep refining.
  • Generate multiple competing hypotheses and rank by likelihood.
基于收集到的证据,提出具体、可证伪的解释。
良好的假设格式:
"当用户未完成入职流程时,
user.profile
为null,而第47行代码在未做空值检查的情况下访问了
user.profile.name
,导致程序崩溃。"
糟糕的假设:
"用户对象出问题了。"
规则:
  • 假设必须能解释所有观察到的症状。
  • 如果只能解释部分症状,说明假设不完整——继续完善。
  • 生成多个竞争假设,并按可能性排序。

Step 3: PREDICT — Design an Experiment

步骤3:预测——设计实验

Before touching the code, predict what you will see IF your hypothesis is correct.
Examples:
  • "If the null profile hypothesis is correct, then logging
    user.profile
    on line 45 will print
    null
    for users created after January 1."
  • "If the race condition hypothesis is correct, then adding a 500ms delay before the second call will make the bug disappear."
在修改代码前,预测如果你的假设成立,会观察到什么结果。
示例:
  • "如果空profile的假设成立,那么在第45行打印
    user.profile
    时,1月1日后创建的用户会输出
    null
    。"
  • "如果竞态条件假设成立,那么在第二次调用前添加500ms延迟会让bug消失。"

Step 4: TEST — Run the Experiment

步骤4:测试——执行实验

Execute one change at a time. Observe whether the prediction matches.
Rules:
  • Change ONE thing at a time. Multiple changes make it impossible to attribute cause.
  • Revert experiments that did not help. Do not accumulate random changes.
  • Use version control: commit or stash working state before experimenting.
每次只修改一处内容,观察结果是否与预测一致。
规则:
  • 每次只改一处。多处修改会导致无法确定因果关系。
  • 恢复没有帮助的实验修改,不要累积无意义的变更。
  • 使用版本控制:在实验前提交或暂存当前工作状态。

Step 5: CONCLUDE

步骤5:结论

  • If the prediction matched, you have likely found the cause. Write a fix and a regression test.
  • If the prediction did not match, your hypothesis was wrong. Return to Step 2 with new information.

  • 如果预测与结果匹配,你很可能找到了原因。编写修复代码和回归测试。
  • 如果预测与结果不匹配,说明假设错误。带着新信息回到步骤2。

Binary Search Isolation

二分查找隔离法

When you have no idea where the bug lives, use binary search to narrow it down.
当你完全不知道bug位置时,使用二分查找缩小范围。

In Code

在代码中定位

  1. Identify the entry point and the crash / wrong output point.
  2. Add a log or assertion at the MIDPOINT of the code path.
  3. Is the state correct at the midpoint?
    • YES -> The bug is in the second half. Move your probe to the 75% point.
    • NO -> The bug is in the first half. Move your probe to the 25% point.
  4. Repeat until you have isolated the exact line or function.
  1. 确定代码的入口点和崩溃/输出异常的位置。
  2. 在代码路径的中点添加日志或断言。
  3. 中点处的状态是否正确?
    • 是 -> bug出现在后半段。将探测点移至75%的位置。
    • 否 -> bug出现在前半段。将探测点移至25%的位置。
  4. 重复操作直至定位到具体行或函数。

In Time (git bisect)

在时间线中定位(git bisect)

bash
git bisect start
git bisect bad              # Current commit is broken
git bisect good <commit>    # This commit was known to work
bash
git bisect start
git bisect bad              # 当前提交存在问题
git bisect good <commit>    # 该提交确认正常工作

Git checks out the midpoint. Test it.

Git会检出中间版本,进行测试

git bisect good # or git bisect bad
git bisect good # 或 git bisect bad

Repeat until git identifies the first bad commit.

重复操作直至Git定位到第一个引入问题的提交

git bisect reset
undefined
git bisect reset
undefined

In Data

在数据中定位

  1. Take the failing input dataset.
  2. Remove half the data.
  3. Does the bug still occur?
    • YES -> The bug is in the remaining half, or is independent of the removed data.
    • NO -> The trigger is in the removed half. Restore it and remove the other half.
  4. Repeat until you find the minimum failing input.

  1. 获取触发bug的输入数据集。
  2. 删除一半数据。
  3. bug是否仍然出现?
    • 是 -> bug存在于剩余数据中,或与删除的数据无关。
    • 否 -> 触发条件在删除的数据中。恢复该部分数据,删除另一半。
  4. 重复操作直至找到最小触发输入。

Reading Stack Traces

阅读堆栈跟踪

Anatomy of a Stack Trace

堆栈跟踪的结构

Error: Cannot read properties of null (reading 'name')     <-- Error type + message
    at getUserName (src/services/user.js:47:22)             <-- YOUR CODE (start here)
    at handleRequest (src/routes/profile.js:12:15)          <-- Caller
    at Layer.handle (node_modules/express/lib/router.js:5)  <-- Framework (skip)
    at next (node_modules/express/lib/router.js:13)         <-- Framework (skip)
Reading order:
  1. Read the error message at the top.
  2. Scan DOWN the trace to find the FIRST frame in YOUR code.
  3. Open that file at that line. This is your starting point.
  4. Read the frames above and below for context on the call chain.
Error: Cannot read properties of null (reading 'name')     <-- 错误类型 + 信息
    at getUserName (src/services/user.js:47:22)             <-- 你的代码(从此处开始排查)
    at handleRequest (src/routes/profile.js:12:15)          <-- 调用方
    at Layer.handle (node_modules/express/lib/router.js:5)  <-- 框架代码(跳过)
    at next (node_modules/express/lib/router.js:13)         <-- 框架代码(跳过)
阅读顺序:
  1. 阅读顶部的错误信息。
  2. 向下扫描找到第一个属于你的代码的调用帧。
  3. 打开对应文件的对应行,这是你的排查起点。
  4. 查看上下调用帧,了解调用链上下文。

Common Error Patterns

常见错误模式

Error MessageLikely Cause
Cannot read properties of null/undefined
Missing null check; variable not initialized
TypeError: X is not a function
Wrong import; typo in method name; variable shadowing
RangeError: Maximum call stack exceeded
Infinite recursion; circular reference
ECONNREFUSED
/
ETIMEDOUT
External service is down or misconfigured
ENOENT: no such file or directory
Wrong path; missing file; CWD is not what you expect
SyntaxError: Unexpected token
Malformed JSON; wrong file format; encoding issue

错误信息可能原因
Cannot read properties of null/undefined
缺少空值检查;变量未初始化
TypeError: X is not a function
导入错误;方法名拼写错误;变量遮蔽
RangeError: Maximum call stack exceeded
无限递归;循环引用
ECONNREFUSED
/
ETIMEDOUT
外部服务宕机或配置错误
ENOENT: no such file or directory
路径错误;文件缺失;当前工作目录不符合预期
SyntaxError: Unexpected token
JSON格式错误;文件格式错误;编码问题

Strategic Logging

策略性日志

What to Log

日志记录内容

[ENTRY]  Function name, input parameters (sanitized)
[STATE]  Key variable values at decision points
[BRANCH] Which branch of a conditional was taken
[EXIT]   Return value or thrown exception
[ENTRY]  函数名、输入参数(已脱敏)
[STATE]  决策点的关键变量值
[BRANCH] 条件判断进入了哪个分支
[EXIT]   返回值或抛出的异常

Logging Levels for Debugging

调试用日志级别

DEBUG  -> Detailed variable dumps (temporary, remove after fixing)
INFO   -> Flow markers ("entered function X", "querying database")
WARN   -> Unexpected but handled conditions ("retrying after timeout")
ERROR  -> Unhandled exceptions and failures
DEBUG  -> 详细变量输出(临时添加,修复后移除)
INFO   -> 流程标记("进入函数X"、"查询数据库")
WARN   -> 意外但已处理的情况("超时后重试")
ERROR  -> 未处理的异常和故障

Temporary Debug Logging Pattern

临时调试日志模式

python
undefined
python
undefined

Add a unique tag so you can find and remove all debug logs later

添加唯一标签,方便后续查找并移除所有调试日志

print("[DEBUG-ISSUE-123] user_id:", user_id, "profile:", profile)

When done, search for `[DEBUG-ISSUE-123]` and remove all instances.

---
print("[DEBUG-ISSUE-123] user_id:", user_id, "profile:", profile)

调试完成后,搜索`[DEBUG-ISSUE-123]`并移除所有相关日志。

---

Common Bug Categories

常见bug类别

1. Off-by-One Errors

1. 差一错误

  • Symptoms: Missing first/last element; array index out of bounds; fence-post errors.
  • Check: Loop bounds (
    <
    vs
    <=
    ), array indices (0-based vs 1-based), string slicing (inclusive vs exclusive end).
  • 症状: 缺失首尾元素;数组索引越界;栅栏柱错误。
  • 排查点: 循环边界(
    <
    vs
    <=
    )、数组索引(0基 vs 1基)、字符串切片(包含/不包含结束位置)。

2. Null / Undefined Reference

2. Null/Undefined引用错误

  • Symptoms:
    NullPointerException
    ,
    Cannot read properties of undefined
    .
  • Check: Optional chaining, uninitialized variables, missing return statements, database queries returning no rows.
  • 症状:
    NullPointerException
    Cannot read properties of undefined
  • 排查点: 可选链操作、未初始化变量、缺失返回语句、数据库查询无结果。

3. Race Conditions

3. 竞态条件

  • Symptoms: Works sometimes but not always; fails under load; test flakiness.
  • Check: Shared mutable state, async operations without await, missing locks/mutexes, event ordering assumptions.
  • Diagnostic: Add delays (
    sleep(500)
    ) to amplify timing windows. If the bug becomes more or less frequent, it is a race.
  • 症状: 有时正常有时异常;高负载下失败;测试不稳定。
  • 排查点: 共享可变状态、未使用await的异步操作、缺失锁/互斥量、事件顺序假设错误。
  • 诊断方法: 添加延迟(
    sleep(500)
    )放大时间窗口。如果bug出现频率变化,说明是竞态条件。

4. State Corruption

4. 状态损坏

  • Symptoms: Correct behavior initially, then gradually wrong; restart fixes it.
  • Check: Global/shared state, caches, singletons, mutable default arguments, in-place mutations of objects passed by reference.
  • 症状: 初始行为正常,随后逐渐异常;重启后恢复正常。
  • 排查点: 全局/共享状态、缓存、单例、可变默认参数、按引用传递对象的原地修改。

5. Resource Leaks

5. 资源泄漏

  • Symptoms: Slow degradation over time; eventual crash; "too many open files."
  • Check: Unclosed file handles, database connections, HTTP connections, event listeners not removed, intervals not cleared.
  • 症状: 性能随时间逐渐下降;最终崩溃;"too many open files"错误。
  • 排查点: 未关闭的文件句柄、数据库连接、HTTP连接、未移除的事件监听器、未清除的定时器。

6. Encoding / Serialization

6. 编码/序列化错误

  • Symptoms: Garbled text, wrong characters, parsing failures.
  • Check: UTF-8 vs Latin-1, JSON/XML escaping, URL encoding, base64 padding, line endings (CRLF vs LF).
  • 症状: 文本乱码、字符错误、解析失败。
  • 排查点: UTF-8 vs Latin-1、JSON/XML转义、URL编码、base64填充、换行符(CRLF vs LF)。

7. Configuration / Environment

7. 配置/环境错误

  • Symptoms: Works on one machine but not another; works locally but not in CI/production.
  • Check: Environment variables, file paths, dependency versions, OS differences, timezone, locale.

  • 症状: 在一台机器正常,另一台异常;本地正常,CI/生产环境异常。
  • 排查点: 环境变量、文件路径、依赖版本、操作系统差异、时区、区域设置。

Debugging Checklists by Scenario

场景化调试检查清单

Test Failure

测试失败

  • Read the assertion message. What was expected vs. actual?
  • Is the test itself correct? (Tests can have bugs too.)
  • Is the test isolated? Run it alone. Does it pass?
  • Are test fixtures up to date?
  • Did a recent change to shared code break this test?
  • 阅读断言信息,明确预期与实际结果的差异。
  • 测试用例本身是否正确?(测试用例也可能存在bug。)
  • 测试是否隔离?单独运行该测试,是否通过?
  • 测试夹具是否为最新版本?
  • 最近的共享代码变更是否导致该测试失败?

Runtime Exception

运行时异常

  • Read the full stack trace.
  • Reproduce it with the simplest possible input.
  • Check the exact line. What variable is null/wrong?
  • Trace the variable backward. Where was it set?
  • Is there a missing validation or null check?
  • 完整阅读堆栈跟踪。
  • 用最简单的输入复现bug。
  • 查看出错行,确认哪个变量为null或异常。
  • 反向追踪变量的赋值路径,看它是在哪里被设置的。
  • 是否缺少验证或空值检查?

Wrong Output (No Error)

输出异常(无错误提示)

  • Confirm the expected output is actually correct.
  • Add logging at the input and output of the function.
  • Binary search: where does the data go wrong?
  • Check type coercion (string "1" vs integer 1).
  • Check operator precedence and short-circuit evaluation.
  • 确认预期输出是否确实正确。
  • 在函数的输入和输出位置添加日志。
  • 二分查找:数据在哪个环节出现异常?
  • 检查类型转换(字符串"1" vs 整数1)。
  • 检查运算符优先级和短路求值逻辑。

Performance Problem

性能问题

  • Profile first, do not guess. Use a profiler tool.
  • Identify the hottest function or query.
  • Check for N+1 queries, unnecessary loops, missing indexes.
  • Check for memory leaks (heap snapshots over time).
  • Check for blocking I/O on the main thread.
  • 先分析,不要猜测。使用性能分析工具。
  • 定位最耗时的函数或查询。
  • 检查N+1查询、不必要的循环、缺失的索引。
  • 检查内存泄漏(对比不同时间点的堆快照)。
  • 检查主线程上的阻塞式I/O操作。

Intermittent / Flaky Bug

间歇性/不稳定bug

  • Run the test 100 times in a loop. What is the failure rate?
  • Check for race conditions (add delays to amplify).
  • Check for dependency on external state (clock, network, filesystem).
  • Check for test pollution (shared state between tests).
  • Check for non-deterministic inputs (random, timestamps, UUIDs).

  • 循环运行测试100次,统计失败率。
  • 检查竞态条件(添加延迟放大时间窗口)。
  • 检查是否依赖外部状态(时钟、网络、文件系统)。
  • 检查测试污染(测试间共享状态)。
  • 检查非确定性输入(随机值、时间戳、UUID)。

Post-Mortem Template

事后复盘模板

After fixing a non-trivial bug, document what happened:
markdown
undefined
修复非简单bug后,记录以下内容:
markdown
undefined

Bug Post-Mortem: [ISSUE-ID] [Short title]

Bug复盘:[问题ID] [简短标题]

Summary

摘要

One-sentence description of the bug and its impact.
一句话描述bug及其影响。

Timeline

时间线

  • [timestamp] Bug reported by [who/what]
  • [timestamp] Reproduced in [environment]
  • [timestamp] Root cause identified
  • [timestamp] Fix deployed
  • [时间戳] bug由[人员/系统]上报
  • [时间戳] 在[环境]中复现
  • [时间戳] 定位根本原因
  • [时间戳] 修复上线

Root Cause

根本原因

Detailed technical explanation of why the bug occurred.
详细的技术解释,说明bug产生的原因。

Fix

修复方案

What was changed and why. Link to the commit/PR.
修改了什么内容及原因。关联提交/PR链接。

Detection

发现方式

How was the bug caught? Could we have caught it earlier?
bug是如何被发现的?我们能否更早发现?

Prevention

预防措施

What process or technical change would prevent this class of bug?
  • Added regression test: [link]
  • Added monitoring/alerting: [details]
  • Updated documentation: [link]
  • Other: [details]
哪些流程或技术变更可以避免此类bug?
  • 添加回归测试:[链接]
  • 添加监控/告警:[详情]
  • 更新文档:[链接]
  • 其他:[详情]

Lessons Learned

经验总结

What did we learn that applies beyond this specific bug?

---
我们学到了哪些可推广到其他场景的经验?

---

Tools and Techniques

工具与技巧

Rubber Duck Debugging

橡皮鸭调试法

Explain the problem out loud (or in writing) step by step. Articulating assumptions often reveals which one is wrong.
逐步骤大声(或书面)解释问题。梳理假设的过程往往能暴露错误的假设。

Minimal Reproduction

最小复现案例

Strip away everything not related to the bug:
  1. Remove unrelated code paths.
  2. Hardcode inputs instead of reading from files/databases.
  3. Replace external services with stubs.
  4. Reduce data to the smallest set that triggers the bug.
The act of creating a minimal reproduction often reveals the cause.
剥离所有与bug无关的内容:
  1. 移除无关代码路径。
  2. 硬编码输入,而非从文件/数据库读取。
  3. 用存根替换外部服务。
  4. 将数据缩减到能触发bug的最小规模。
创建最小复现案例的过程本身往往能揭示bug的原因。

Diff Debugging

差异调试

If it used to work:
  1. Find the last known working state (commit, release, date).
  2. Diff the code between then and now.
  3. Read each change. Could it cause the symptom?
  4. Use
    git bisect
    for large ranges.
如果之前能正常工作:
  1. 找到最后一个确认正常的状态(提交、版本、日期)。
  2. 对比该状态与当前状态的代码差异。
  3. 逐一查看变更,判断是否可能导致症状。
  4. 范围较大时使用
    git bisect

Divide and Conquer with Comments

注释分割排查

  1. Comment out half the suspicious code.
  2. Does the bug persist?
    • YES -> It is in the uncommented half.
    • NO -> It is in the commented half.
  3. Repeat until isolated.

  1. 注释掉一半可疑代码。
  2. bug是否仍然存在?
    • 是 -> bug存在于未注释的代码中。
    • 否 -> bug存在于注释的代码中。
  3. 重复操作直至定位bug。

Key Principles

核心原则

  1. Reproduce first. If you cannot reproduce it, you cannot verify your fix.
  2. One change at a time. Multiple changes destroy your ability to reason about cause and effect.
  3. Read before you write. Spend more time reading code and logs than writing fixes.
  4. Trust nothing. Verify your assumptions. Print the value. Check the type. Read the documentation.
  5. The bug is in your code. It is almost never the compiler, the OS, or the framework. Check your code first.
  6. Write the regression test. A bug without a test is a bug that will return.
  1. 先复现。如果无法复现,就无法验证修复效果。
  2. 每次只改一处。多处变更会破坏因果关系的推理能力。
  3. 先读再写。花更多时间阅读代码和日志,而非编写修复代码。
  4. 不要想当然。验证你的假设,打印变量值、检查类型、阅读文档。
  5. bug大概率在你的代码中。几乎不会是编译器、操作系统或框架的问题。先检查自己的代码。
  6. 编写回归测试。没有测试覆盖的bug一定会再次出现。