debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Debugging

调试

The Iron Law

铁则

Never propose a fix without first identifying the root cause. "Quick fix now, investigate later" is forbidden — it creates harder bugs.

在未确定根本原因之前，绝不要提出修复方案。禁止“先快速修复，后调查原因”的做法——这会导致更难解决的bug。

Root Cause Analysis

根因分析

Root cause identification is the core deliverable of debugging — not the fix itself. A fix without a confirmed root cause is guesswork.

Trace backward: Start at the symptom, walk the call chain in reverse to find where behavior diverges from expectation
Differential analysis: Compare working vs broken state across dimensions (code version, data, environment, timing, configuration)
Regression hunting: Use
```
git bisect
```
to pinpoint the exact commit that introduced the issue
Evidence-based: Document root cause with
```
file:line
```
references, log output, and concrete reproduction proof
Competing hypotheses: When the cause is ambiguous, generate multiple hypotheses and rank by evidence strength (see Escalation section below)

根因识别是调试的核心产出——而非修复本身。没有确认根因的修复只是猜测。

反向追踪：从症状入手，反向遍历调用链，找到行为与预期偏离的节点
差异分析：在多个维度（代码版本、数据、环境、时序、配置）对比正常状态与故障状态
回归排查：使用
```
git bisect
```
定位引入问题的具体提交
基于证据：通过
```
file:line
```
引用、日志输出和可复现的具体证据记录根因
多假设验证：当原因不明确时，生成多个假设并根据证据强度排序（见下文升级处理部分）

Process

流程

1. Reproduce — make the bug consistent. If intermittent, run N times under stress or simulate poor conditions (slow network, low memory) until it triggers reliably.

2. Investigate — trace backward through the call chain from the symptom. Add diagnostic logging at each component boundary. Compare working vs broken state using a differential table (environment, version, data, timing — what changed?).

3. Hypothesize and test — one change at a time. If a hypothesis is wrong, fully revert before testing the next. Use

git bisect

to find regressions efficiently.

4. Fix and verify — create a failing test FIRST, then fix. Run the test. Confirm the original reproduction case passes. No completion claims without fresh verification evidence.

1. 重现 —— 让bug稳定复现。如果是间歇性问题，在压力下运行N次或模拟恶劣条件（慢网络、低内存），直到可以稳定触发。

2. 调查 —— 从症状反向追踪调用链。在每个组件边界添加诊断日志。使用差异表对比正常与故障状态（环境、版本、数据、时序——哪些发生了变化？）。

3. 假设与测试 —— 每次只做一处变更。如果假设错误，在测试下一个假设前完全回滚变更。使用

git bisect

高效排查回归问题。

4. 修复与验证 —— 先编写一个会失败的测试用例，再进行修复。运行测试。确认原始复现场景可以通过。没有新的验证证据，不得声称修复完成。

Three-Fix Threshold

三次修复阈值

After 3 failed fix attempts, STOP. The problem is likely architectural, not a surface bug. Step back and question assumptions about how the system works. Read the actual code path end-to-end instead of spot-checking.

经过3次修复尝试失败后，立即停止。问题很可能是架构层面的，而非表面bug。退一步，质疑你对系统工作原理的假设。完整阅读整个代码路径，而非只抽查部分代码。

Escalation: Competing Hypotheses

升级处理：多假设分析

When the cause is unclear across multiple components, use Analysis of Competing Hypotheses:

Generate hypotheses across failure modes: logic error, data issue, state problem, integration failure, resource exhaustion, environment
Investigate each with evidence: Direct (strong), Correlational (medium), Testimonial (weak)
Cite evidence with
```
file:line
```
references
Rank by confidence. If multiple hypotheses are equally supported, suspect compound causes.

当问题涉及多个组件且原因不明时，使用多假设分析方法：

针对多种故障模式生成假设：逻辑错误、数据问题、状态异常、集成失败、资源耗尽、环境问题
用证据验证每个假设：直接证据（强）、关联证据（中）、证言证据（弱）
通过
```
file:line
```
引用标注证据
按置信度排序。如果多个假设得到同等支持，怀疑是复合原因导致。

Intermittent Issues

间歇性问题

Track with correlation IDs across distributed components
Race conditions: look for shared mutable state, check-then-act patterns, missing locks. In async code (Node.js, Python asyncio): interleaved
```
.then()
```
chains, unguarded shared state between concurrent tasks, missing transaction isolation in DB operations
Deadlocks: check for circular lock acquisition (DB row locks held across multiple queries), circular
```
await
```
dependencies in async code, connection pool exhaustion blocking queries that would release other connections
Resource exhaustion: monitor memory growth, connection pool depletion, file descriptor leaks. Under load: check pool size vs concurrent request count, verify connections are returned on error paths (finally/dispose)
Timing-dependent: replace arbitrary
```
sleep()
```
with condition-based polling — wait for the actual state, not a duration

通过分布式组件间的关联ID进行追踪
竞态条件：查找共享可变状态、检查后执行模式、缺失锁。在异步代码（Node.js、Python asyncio）中：交错的
```
.then()
```
链、并发任务间未受保护的共享状态、数据库操作中缺失的事务隔离
死锁：检查循环锁获取（跨多个查询持有的数据库行锁）、异步代码中的循环
```
await
```
依赖、连接池耗尽阻塞本应释放其他连接的查询
资源耗尽：监控内存增长、连接池耗尽、文件描述符泄漏。在负载下：检查池大小与并发请求数的对比，验证连接在错误路径（finally/dispose）中是否被正确归还
时序依赖问题：用基于条件的轮询替换任意的
```
sleep()
```
——等待实际状态，而非固定时长

Defense-in-Depth Validation

纵深防御验证

After fixing, validate at every layer — not just where the bug appeared:

Entry: does invalid input get caught?
Business logic: does the fix handle edge cases?
Environment: does it work across configurations?
Instrumentation: add logging to detect recurrence

修复完成后，在每个层面进行验证——而非仅在bug出现的层面：

入口层：是否能拦截无效输入？
业务逻辑层：修复是否能处理边缘情况？
环境层：是否能在不同配置下正常工作？
监控层：添加日志以检测问题复发

Bug Triage

Bug分类

When multiple bugs exist, prioritize by:

Severity (data loss > crash > wrong output > cosmetic) separately from Priority (blocking release > customer-facing > internal)
Reproducibility: always > sometimes > once. "Sometimes" bugs need instrumentation before fixing.
Quick wins: if a fix is < 5 minutes and unblocks others, do it first

当存在多个bug时，按以下优先级排序：

严重性（数据丢失 > 崩溃 > 输出错误 > 界面瑕疵）与优先级（阻塞发布 > 面向客户 > 内部使用）分开评估
可复现性：总是可复现 > 有时可复现 > 仅出现一次。“有时可复现”的bug需要先添加监控再修复
快速解决项：如果修复耗时<5分钟且能解除其他工作的阻塞，优先处理

Common Patterns

常见模式

Null/undefined access — trace where the value was expected to be set, check all code paths
Off-by-one — check
```
<
```
vs
```
<=
```
, array length vs last index, loop boundaries
Async ordering — missing
```
await
```
, unhandled promise rejection, callback firing before setup completes
Type coercion —
```
==
```
vs
```
===
```
, string-to-number conversion, truthy/falsy edge cases
Timezone — always store UTC, convert at display. Check DST transitions.

空值/未定义访问 —— 追踪预期应设置值的位置，检查所有代码路径
差一错误 —— 检查
```
<
```
与
```
<=
```
、数组长度与最后一个索引、循环边界
异步顺序问题 —— 缺失
```
await
```
、未处理的Promise拒绝、回调在初始化完成前触发
类型转换 ——
```
==
```
与
```
===
```
、字符串转数字、真值/假值边缘情况
时区问题 —— 始终以UTC存储，在展示时转换。检查夏令时转换。

Anti-Patterns

反模式

Shotgun debugging (random changes without hypothesis) — revert and think instead
Multiple simultaneous changes — isolate each change or you can't learn what worked
Fixing the symptom not the cause — the same bug will resurface differently
Ignoring intermittent failures ("works on my machine") — instrument and reproduce under load instead

霰弹式调试（无假设的随机变更）—— 回滚变更并重新思考
同时进行多处变更 —— 隔离每一处变更，否则无法确定哪项起作用
修复症状而非根因 —— 相同的bug会以其他形式重现
忽略间歇性故障（“在我机器上正常”）—— 添加监控并在负载下复现问题

Postmortem

事后总结

After resolving non-trivial bugs, document a lightweight postmortem:

Timeline: when introduced, when detected, when resolved (include commit SHAs)
Root cause: one sentence — the actual cause, not the symptom
Impact: what broke, for how long, who was affected
Fix: what changed and why this fix addresses the root cause
Prevention: what test, monitor, or process change prevents recurrence

解决非 trivial 的bug后，记录轻量级的事后总结：

时间线：问题引入时间、发现时间、解决时间（包含提交SHA）
根因：一句话描述——实际原因，而非症状
影响：哪些功能受损、持续时长、受影响对象
修复方案：做了哪些变更，以及该修复为何能解决根因
预防措施：添加哪些测试、监控或流程变更以防止问题复发