systematic-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统性调试
When to Use
使用场景
Invoke this methodology automatically when:
- A test fails and the cause isn't immediately obvious
- Unexpected behavior occurs in production or development
- An error message doesn't directly point to the fix
- Multiple potential causes exist
当出现以下情况时,自动调用此方法论:
- 测试失败且原因无法立即明确
- 生产或开发环境中出现意外行为
- 错误信息无法直接指向修复方案
- 存在多个潜在原因
Core Principles
核心原则
- Hypothesize before acting - Form explicit hypotheses about root cause before changing code
- Test hypotheses systematically - Validate or eliminate each hypothesis with evidence
- Parallelize investigation - Use subagents for concurrent readonly exploration
- Preserve test integrity - Never weaken tests to make them pass
- 先假设后行动 - 在修改代码前,明确提出关于根本原因的假设
- 系统验证假设 - 用证据验证或排除每个假设
- 并行化调查 - 使用子代理进行并发只读探索
- 保持测试完整性 - 绝不通过弱化测试来使其通过
Debugging Scope Ladder
调试范围阶梯
Always prefer the smallest, most reproducible scope that demonstrates the bug. Work up the ladder only when the smaller scope can't reproduce or doesn't apply:
| Priority | Scope | When to Use | Command |
|---|---|---|---|
| 1 | Unit test | Logic errors, algorithm bugs, single-function issues | |
| 2 | Mocked unit test | Transport/ring logic needing isolation | Unit test with |
| 3 | Simulation test | Multi-node behavior, state machines, race conditions | |
| 4 | SimNetwork + FaultConfig | Fault tolerance, message loss, network partitions | SimNetwork with configured fault injection |
| 5 | fdev single-process | Quick multi-peer CI validation | |
| 6 | freenet-test-network | 20+ peer large-scale behavior | Docker-based |
| 7 | Real network | Issues that only manifest with real UDP/NAT/latency | Manual multi-peer test across machines |
Why this order matters:
- Lower scopes are faster, deterministic, and reproducible by anyone
- Higher scopes require more infrastructure, time, and may not be accessible to all contributors
- Gateway logs, aggregate telemetry, and production metrics are not available to every developer — don't assume access to these when designing reproduction steps
始终优先选择能复现Bug的最小、最可复现的范围。 仅当小范围无法复现或不适用时,才逐步扩大范围:
| 优先级 | 范围 | 适用场景 | 命令 |
|---|---|---|---|
| 1 | 单元测试 | 逻辑错误、算法Bug、单一函数问题 | |
| 2 | 模拟单元测试 | 需要隔离的传输/环逻辑 | 搭配 |
| 3 | 仿真测试 | 多节点行为、状态机、竞态条件 | |
| 4 | SimNetwork + FaultConfig | 容错性、消息丢失、网络分区 | 配置故障注入的SimNetwork |
| 5 | fdev单进程 | 快速多节点CI验证 | |
| 6 | freenet-test-network | 20+节点大规模行为 | 基于Docker的 |
| 7 | 真实网络 | 仅在真实UDP/NAT/延迟环境下出现的问题 | 跨机器手动多节点测试 |
此顺序的重要性:
- 低范围测试速度更快、具有确定性,且任何人都可复现
- 高范围测试需要更多基础设施、时间,且并非所有贡献者都能访问
- 网关日志、聚合遥测和生产指标并非所有开发者都能访问——设计复现步骤时不要假设能获取这些资源
Debugging Workflow
调试工作流
Phase 1: Reproduce and Isolate
阶段1:复现与隔离
- Reproduce the failure — Confirm the bug exists and is reproducible
- Use the scope ladder — Start at the smallest scope that can demonstrate the bug:
- Can you write a unit test? Try that first
- Needs multiple nodes? Use the simulation framework with a deterministic seed
- Only happens under fault conditions? Use with
SimNetworkFaultConfig - Can't reproduce in simulation? Then escalate to real network testing
- Record the seed — When using simulation tests, always record the seed value for reproducibility
- Gather initial evidence — Read error messages, logs, stack traces
Simulation-first approach for distributed bugs:
bash
undefined- 复现故障 — 确认Bug存在且可复现
- 使用范围阶梯 — 从能复现Bug的最小范围开始:
- 能否编写单元测试?优先尝试
- 需要多节点?使用带确定性种子的仿真框架
- 仅在故障条件下出现?使用搭配的
FaultConfigSimNetwork - 仿真环境无法复现?再升级到真实网络测试
- 记录种子值 — 使用仿真测试时,务必记录种子值以保证可复现性
- 收集初始证据 — 读取错误信息、日志、堆栈跟踪
分布式Bug优先采用仿真方法:
bash
undefinedRun simulation tests deterministically
以确定性方式运行仿真测试
cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1
cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1
With logging to observe event sequences
开启日志以观察事件序列
RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1
RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1
Reproduce with a specific seed
使用特定种子复现
cargo run -p fdev -- test --seed 0xDEADBEEF single-process
undefinedcargo run -p fdev -- test --seed 0xDEADBEEF single-process
undefinedPhase 2: Form Hypotheses
阶段2:构建假设
Before touching any code, explicitly list potential causes:
Hypotheses:
1. [Most likely] The X component isn't handling Y case
2. [Possible] Race condition between A and B
3. [Less likely] Configuration mismatch in ZRank by likelihood based on evidence. Avoid anchoring on the first idea.
Freenet-specific hypothesis patterns:
- State machine bugs — Invalid transitions in operations (CONNECT, GET, PUT, UPDATE, SUBSCRIBE)
- Ring/routing errors — Incorrect peer selection, distance calculations, topology issues
- Transport issues — UDP packet loss handling, encryption/decryption, connection lifecycle
- Contract execution — WASM sandbox issues, state verification failures
- Determinism violations — Code using instead of
std::time::Instant::now(), orTimeSourceinstead ofrand::random()GlobalRng - Silent failure / fire-and-forget — Spawned task dies with no error propagation (check: is the stored and polled? what happens if the task exits?), broadcast sent to zero targets with no warning, channel overflow silently dropping messages. Look for:
JoinHandlewithouttokio::spawn/.await,.abort(), missing logging on empty target setslet _ = sender.send() - Resource exhaustion — HashMap/Vec/channel entries inserted but never removed, causing unbounded memory growth or channel backpressure. Check: is there a cleanup path for every insert? Is cleanup triggered on both success AND failure/timeout? Run sustained operations and assert collection sizes stay bounded
- Incomplete wiring — Feature only works for some operation types (e.g., router feedback wired for GET but not subscribe/put/update). When debugging "X doesn't work for operation Y," check all enum variants in the dispatch path — commented-out arms, catch-alls, and missing match arms are common
_ => Irrelevant - TTL/timing race conditions — Two time-dependent operations where the first can expire before the second completes (e.g., transient TTL expires before CONNECT handshake, interest TTL expires before subscription renewal, broadcast fires before subscriptions complete). Check: what happens if operation A takes longer than timeout B?
- Regressions from "safe" changes — A seemingly harmless change (code simplification, removing a feature flag, changing defaults) breaks an invariant that nothing tests. When a recent commit looks innocent, check what implicit behaviors it removed
- Mock/test divergence — Bug can't be reproduced in tests because the mock runtime behaves differently from production. Check: does the mock skip side effects (e.g., BSC emission)? Does the test use a different code path than production (e.g., explicit subscribe vs background subscribe)? Does the mock socket behave differently from real UDP?
See Module-Specific Debugging Guide for detailed bug patterns, data collection strategies, and test approaches per module.
在修改任何代码前,明确列出潜在原因:
假设列表:
1. [最可能] X组件未处理Y场景
2. [可能] A与B之间存在竞态条件
3. [可能性较低] Z中的配置不匹配根据证据按可能性排序,避免锚定第一个想法。
Freenet特定假设模式:
- 状态机Bug — 操作(CONNECT、GET、PUT、UPDATE、SUBSCRIBE)中的无效转换
- 环/路由错误 — 错误的节点选择、距离计算、拓扑问题
- 传输层问题 — UDP丢包处理、加密/解密、连接生命周期
- 合约执行 — WASM沙箱问题、状态验证失败
- 确定性破坏 — 代码使用而非
std::time::Instant::now(),或使用TimeSource而非rand::random()GlobalRng - 静默失败/一发即忘 — 衍生任务无错误传播就终止(检查:是否存储并轮询?任务退出时会发生什么?)、广播发送到零个目标且无警告、通道溢出静默丢弃消息。需关注:未使用
JoinHandle/.await的.abort()、tokio::spawn、空目标集缺失日志let _ = sender.send() - 资源耗尽 — HashMap/Vec/通道条目仅插入未移除,导致内存无限增长或通道背压。检查:每个插入操作是否有清理路径?清理是否在成功、失败/超时场景下都触发?运行持续操作并断言集合大小保持可控
- ** wiring不完整** — 功能仅对部分操作类型生效(例如,路由器反馈仅为GET wired,而非subscribe/put/update)。调试“X对操作Y无效”时,检查调度路径中的所有枚举变体——注释掉的分支、捕获分支、缺失的匹配分支都是常见问题
_ => Irrelevant - TTL/时序竞态条件 — 两个依赖时间的操作中,第一个操作可能在第二个完成前过期(例如,临时TTL在CONNECT握手前过期、兴趣TTL在订阅续期前过期、广播在订阅完成前触发)。检查:如果操作A耗时超过超时B会发生什么?
- “安全”变更导致的回归 — 看似无害的变更(代码简化、移除功能标志、修改默认值)破坏了未被测试的不变量。当最近的提交看似无辜时,检查它移除了哪些隐式行为
- 模拟/测试差异 — 测试环境无法复现Bug,因为模拟运行时与生产环境行为不同。检查:模拟是否跳过了副作用(例如,BSC发射)?测试是否使用了与生产环境不同的代码路径(例如,显式订阅 vs 后台订阅)?模拟套接字与真实UDP行为是否不同?
有关各模块的详细Bug模式、数据收集策略和测试方法,请参阅模块特定调试指南。
Phase 3: Investigate Systematically
阶段3:系统调查
For each hypothesis:
- Identify what evidence would confirm or refute it
- Gather that evidence (logs, code reading, adding debug output)
- Update hypothesis ranking based on findings
- Move to next hypothesis if current one is eliminated
Freenet-specific data gathering:
| What You Need | How to Get It | Access |
|---|---|---|
| Event sequences | | Everyone |
| Network message patterns | | Everyone |
| Convergence behavior | | Everyone |
| Virtual time state | | Everyone |
| Git history of affected code | | Everyone |
| Fault injection results | SimNetwork + FaultConfig, then inspect stats | Everyone |
| Gateway logs | Access to running gateway node | Limited — not all contributors |
| Aggregate telemetry | Production monitoring dashboards | Limited — core team only |
| Real network packet captures | Physical access to test machines | Limited — specific environments |
For module-specific data gathering techniques, see Module-Specific Debugging Guide — it covers observation APIs, event capture, targets, and fault injection per module.
#[freenet_test]RUST_LOGParallel investigation with subagents:
Use agents with instructions for independent, readonly investigations. Spawn multiple in parallel, each with a specific focus.
general-purposecodebase-investigatorSpawn investigators in parallel using Task tool (subagent_type="general-purpose"):
1. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
Search for similar error handling patterns in the codebase related to [bug description]"
2. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
Check git history for recent changes to [affected module/files]"
3. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
Read and analyze [test file] and related fixtures for [component]"Guidelines:
- Each investigator focuses on one hypothesis or evidence type
- Only parallelize readonly tasks — code changes must be sequential
- Investigators report findings; you synthesize and decide next steps
针对每个假设:
- 确定能证实或推翻该假设的证据
- 收集证据(日志、代码阅读、添加调试输出)
- 根据发现更新假设优先级
- 当前假设被排除后,转向下一个假设
Freenet特定数据收集方法:
| 所需信息 | 获取方式 | 访问权限 |
|---|---|---|
| 事件序列 | 仿真测试中使用 | 所有开发者 |
| 网络消息模式 | 仿真测试中使用 | 所有开发者 |
| 收敛行为 | | 所有开发者 |
| 虚拟时间状态 | | 所有开发者 |
| 受影响代码的Git历史 | | 所有开发者 |
| 故障注入结果 | 使用SimNetwork + FaultConfig,然后检查统计数据 | 所有开发者 |
| 网关日志 | 访问运行中的网关节点 | 受限——并非所有贡献者都能访问 |
| 聚合遥测 | 生产环境监控仪表盘 | 受限——仅核心团队可访问 |
| 真实网络数据包捕获 | 物理访问测试机器 | 受限——仅特定环境可访问 |
有关模块特定的数据收集技术,请参阅模块特定调试指南——其中涵盖了观察API、事件捕获、目标和各模块的故障注入方法。
#[freenet_test]RUST_LOG利用子代理进行并行调查:
使用代理搭配指令进行独立的只读调查。可并行生成多个代理,每个聚焦特定方向。
general-purposecodebase-investigator使用Task工具并行生成调查员(subagent_type="general-purpose"):
1. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
在代码库中搜索与[Bug描述]相关的类似错误处理模式"
2. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
检查[受影响模块/文件]的近期Git历史变更"
3. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
阅读并分析[测试文件]及[组件]的相关测试夹具"指南:
- 每个调查员聚焦一个假设或一类证据
- 仅并行化只读任务——代码变更必须按顺序进行
- 调查员报告结果,由你进行综合分析并决定下一步行动
Phase 4: Fix and Verify
阶段4:修复与验证
- Fix the root cause — Not symptoms
- Verify with deterministic reproduction — Re-run the failing test with the same seed
- Check for regressions —
cargo test -p freenet - Consider edge cases — Does the fix handle similar scenarios?
- Verify determinism — If you added new code, ensure it uses and
TimeSource(notGlobalRng/std::timedirectly)rand
- 修复根本原因 — 而非症状
- 通过确定性复现验证 — 使用相同种子重新运行失败的测试
- 检查回归 —
cargo test -p freenet - 考虑边缘场景 — 修复是否能处理类似场景?
- 验证确定性 — 如果添加了新代码,确保使用和
TimeSource(而非直接使用GlobalRng/std::time)rand
Phase 5: Test Coverage Analysis
阶段5:测试覆盖率分析
Always ask: "Why didn't CI catch this?"
Freenet has multiple test layers:
| Layer | Scope | What It Catches |
|---|---|---|
| Unit tests (~1000) | Individual functions | Logic errors, algorithm bugs |
| Integration tests (~80) | Component interactions | Interface mismatches, data flow bugs |
| Simulation tests | Multi-node deterministic | State machine bugs, race conditions, protocol errors |
| Quick multi-peer | Basic distributed behavior |
| 20+ peers in Docker | Scale-dependent bugs, realistic network behavior |
| Real network tests | Physical machines | NAT traversal, real latency, UDP behavior |
If a bug reached production or manual testing, there's a gap. Investigate:
-
Which test layer should have caught this?
- Logic error → unit test
- Component interaction bug → integration test
- Distributed/state machine behavior → simulation test with
#[freenet_test] - Fault tolerance → SimNetwork with FaultConfig
- Scale-dependent → freenet-test-network
-
Why didn't the existing tests catch it?
- Tests use different topology/configuration than production
- Tests mock components that exhibit the bug in real usage
- Simulation doesn't inject the right fault conditions
- Test assertions too weak to detect the failure
- Determinism violation — code path bypasses /
TimeSourceGlobalRng
-
Document the gap — Include in the issue/PR:
- What test would have caught this
- Why existing tests didn't
- Whether a new test should be added to prevent regression
务必问自己:“为什么CI没有发现这个问题?”
Freenet拥有多层测试:
| 层级 | 范围 | 可捕获问题 |
|---|---|---|
| 单元测试(约1000个) | 单个函数 | 逻辑错误、算法Bug |
| 集成测试(约80个) | 组件交互 | 接口不匹配、数据流Bug |
| 仿真测试 | 多节点确定性环境 | 状态机Bug、竞态条件、协议错误 |
| 快速多节点 | 基础分布式行为 |
| Docker中20+节点 | 规模相关Bug、真实网络行为 |
| 真实网络测试 | 物理机器 | NAT穿透、真实延迟、UDP行为 |
如果Bug进入生产环境或需要手动测试,说明存在测试缺口。请调查:
-
哪个测试层级本应发现这个问题?
- 逻辑错误 → 单元测试
- 组件交互Bug → 集成测试
- 分布式/状态机行为 → 带的仿真测试
#[freenet_test] - 容错性 → 搭配FaultConfig的SimNetwork
- 规模相关 → freenet-test-network
-
现有测试为什么没有发现?
- 测试使用的拓扑/配置与生产环境不同
- 测试模拟了在真实场景中会出现Bug的组件
- 仿真环境未注入正确的故障条件
- 测试断言太弱,无法检测到失败
- 确定性破坏——代码路径绕过了/
TimeSourceGlobalRng
-
记录缺口 — 在Issue/PR中包含:
- 本应发现此问题的测试类型
- 现有测试未发现的原因
- 是否应添加新测试以防止回归
Anti-Patterns to Avoid
需避免的反模式
Jumping to conclusions
急于下结论
- Wrong: See error, immediately change code that seems related
- Right: Form hypothesis, gather evidence, then act
- 错误做法: 看到错误后立即修改看似相关的代码
- 正确做法: 先构建假设,收集证据,再采取行动
Tunnel vision
隧道视野
- Wrong: Spend hours on one theory despite contradicting evidence
- Right: Set time bounds, pivot when evidence points elsewhere
- 错误做法: 尽管存在矛盾证据,仍花费数小时在一个理论上
- 正确做法: 设置时间限制,当证据指向其他方向时及时转向
Weakening tests
弱化测试
- Wrong: Test fails, reduce assertions or add exceptions to make it pass
- Right: Understand why the test expects what it does, fix the code to meet that expectation
- Exception: The test itself has a bug or tests incorrect behavior (rare, requires clear justification)
- 错误做法: 测试失败时,减少断言或添加例外使其通过
- 正确做法: 理解测试的预期,修改代码以满足预期
- 例外情况: 测试本身存在Bug或测试了错误的行为(罕见,需要明确理由)
Sequential investigation when parallel is possible
可并行时却串行调查
- Wrong: Read file A, wait, read file B, wait, read file C
- Right: Spawn agents to read A, B, C concurrently, synthesize findings
codebase-investigator
- 错误做法: 先阅读文件A,等待,再阅读文件B,等待,再阅读文件C
- 正确做法: 生成代理并发读取A、B、C,综合结果
codebase-investigator
Fixing without understanding
未理解就修复
- Wrong: Copy a fix from Stack Overflow that makes the error go away
- Right: Understand why the fix works and whether it addresses root cause
- 错误做法: 复制Stack Overflow上的修复方案,只要错误消失即可
- 正确做法: 理解修复的原理,确认其是否解决根本原因
Skipping the scope ladder
跳过范围阶梯
- Wrong: Jump straight to real network debugging when the bug could be reproduced in a unit test
- Right: Start small — unit test, then simulation, then real network
- 错误做法: 当Bug可在单元测试中复现时,直接跳到真实网络调试
- 正确做法: 从小范围开始——单元测试,再仿真,最后真实网络
Breaking determinism
破坏确定性
- Wrong: Use or
std::time::Instant::now()in core logicrand::random() - Right: Use trait and
TimeSourceso simulation tests remain reproducibleGlobalRng
- 错误做法: 在核心逻辑中使用或
std::time::Instant::now()rand::random() - 正确做法: 使用trait和
TimeSource,确保仿真测试保持可复现性GlobalRng
Assuming data access
假设数据可访问
- Wrong: "Check the gateway logs to see what happened" (not everyone has gateway access)
- Right: Design reproduction steps using simulation tests and that any contributor can run
RUST_LOG
- 错误做法: “查看网关日志了解发生了什么”(并非所有人都能访问网关)
- 正确做法: 设计使用仿真测试和的复现步骤,确保所有贡献者都能运行
RUST_LOG
Checklist Before Declaring "Fixed"
宣布“修复完成”前的检查清单
- Root cause identified and understood
- Fix addresses root cause, not symptoms
- Original failure no longer reproduces
- No new test failures introduced
- Test added if one didn't exist (when practical)
- No test assertions weakened or disabled
- Answered "why didn't CI catch this?" and documented the test gap
- 已识别并理解根本原因
- 修复针对根本原因,而非症状
- 原始故障不再复现
- 未引入新的测试失败
- (可行时)添加了原本不存在的测试
- 未弱化或禁用测试断言
- 已回答“为什么CI没有发现这个问题?”并记录了测试缺口