systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统性调试

When to Use

使用场景

Invoke this methodology automatically when:
  • A test fails and the cause isn't immediately obvious
  • Unexpected behavior occurs in production or development
  • An error message doesn't directly point to the fix
  • Multiple potential causes exist
当出现以下情况时,自动调用此方法论:
  • 测试失败且原因无法立即明确
  • 生产或开发环境中出现意外行为
  • 错误信息无法直接指向修复方案
  • 存在多个潜在原因

Core Principles

核心原则

  1. Hypothesize before acting - Form explicit hypotheses about root cause before changing code
  2. Test hypotheses systematically - Validate or eliminate each hypothesis with evidence
  3. Parallelize investigation - Use subagents for concurrent readonly exploration
  4. Preserve test integrity - Never weaken tests to make them pass
  1. 先假设后行动 - 在修改代码前,明确提出关于根本原因的假设
  2. 系统验证假设 - 用证据验证或排除每个假设
  3. 并行化调查 - 使用子代理进行并发只读探索
  4. 保持测试完整性 - 绝不通过弱化测试来使其通过

Debugging Scope Ladder

调试范围阶梯

Always prefer the smallest, most reproducible scope that demonstrates the bug. Work up the ladder only when the smaller scope can't reproduce or doesn't apply:
PriorityScopeWhen to UseCommand
1Unit testLogic errors, algorithm bugs, single-function issues
cargo test -p freenet -- specific_test
2Mocked unit testTransport/ring logic needing isolationUnit test with
MockNetworkBridge
/
MockRing
3Simulation testMulti-node behavior, state machines, race conditions
cargo test -p freenet --test simulation_integration -- --test-threads=1
4SimNetwork + FaultConfigFault tolerance, message loss, network partitionsSimNetwork with configured fault injection
5fdev single-processQuick multi-peer CI validation
cargo run -p fdev -- test --seed 42 single-process
6freenet-test-network20+ peer large-scale behaviorDocker-based
freenet-test-network
7Real networkIssues that only manifest with real UDP/NAT/latencyManual multi-peer test across machines
Why this order matters:
  • Lower scopes are faster, deterministic, and reproducible by anyone
  • Higher scopes require more infrastructure, time, and may not be accessible to all contributors
  • Gateway logs, aggregate telemetry, and production metrics are not available to every developer — don't assume access to these when designing reproduction steps
始终优先选择能复现Bug的最小、最可复现的范围。 仅当小范围无法复现或不适用时,才逐步扩大范围:
优先级范围适用场景命令
1单元测试逻辑错误、算法Bug、单一函数问题
cargo test -p freenet -- specific_test
2模拟单元测试需要隔离的传输/环逻辑搭配
MockNetworkBridge
/
MockRing
的单元测试
3仿真测试多节点行为、状态机、竞态条件
cargo test -p freenet --test simulation_integration -- --test-threads=1
4SimNetwork + FaultConfig容错性、消息丢失、网络分区配置故障注入的SimNetwork
5fdev单进程快速多节点CI验证
cargo run -p fdev -- test --seed 42 single-process
6freenet-test-network20+节点大规模行为基于Docker的
freenet-test-network
7真实网络仅在真实UDP/NAT/延迟环境下出现的问题跨机器手动多节点测试
此顺序的重要性:
  • 低范围测试速度更快、具有确定性,且任何人都可复现
  • 高范围测试需要更多基础设施、时间,且并非所有贡献者都能访问
  • 网关日志、聚合遥测和生产指标并非所有开发者都能访问——设计复现步骤时不要假设能获取这些资源

Debugging Workflow

调试工作流

Phase 1: Reproduce and Isolate

阶段1:复现与隔离

  1. Reproduce the failure — Confirm the bug exists and is reproducible
  2. Use the scope ladder — Start at the smallest scope that can demonstrate the bug:
    • Can you write a unit test? Try that first
    • Needs multiple nodes? Use the simulation framework with a deterministic seed
    • Only happens under fault conditions? Use
      SimNetwork
      with
      FaultConfig
    • Can't reproduce in simulation? Then escalate to real network testing
  3. Record the seed — When using simulation tests, always record the seed value for reproducibility
  4. Gather initial evidence — Read error messages, logs, stack traces
Simulation-first approach for distributed bugs:
bash
undefined
  1. 复现故障 — 确认Bug存在且可复现
  2. 使用范围阶梯 — 从能复现Bug的最小范围开始:
    • 能否编写单元测试?优先尝试
    • 需要多节点?使用带确定性种子的仿真框架
    • 仅在故障条件下出现?使用搭配
      FaultConfig
      SimNetwork
    • 仿真环境无法复现?再升级到真实网络测试
  3. 记录种子值 — 使用仿真测试时,务必记录种子值以保证可复现性
  4. 收集初始证据 — 读取错误信息、日志、堆栈跟踪
分布式Bug优先采用仿真方法:
bash
undefined

Run simulation tests deterministically

以确定性方式运行仿真测试

cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1
cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1

With logging to observe event sequences

开启日志以观察事件序列

RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1
RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1

Reproduce with a specific seed

使用特定种子复现

cargo run -p fdev -- test --seed 0xDEADBEEF single-process
undefined
cargo run -p fdev -- test --seed 0xDEADBEEF single-process
undefined

Phase 2: Form Hypotheses

阶段2:构建假设

Before touching any code, explicitly list potential causes:
Hypotheses:
1. [Most likely] The X component isn't handling Y case
2. [Possible] Race condition between A and B
3. [Less likely] Configuration mismatch in Z
Rank by likelihood based on evidence. Avoid anchoring on the first idea.
Freenet-specific hypothesis patterns:
  • State machine bugs — Invalid transitions in operations (CONNECT, GET, PUT, UPDATE, SUBSCRIBE)
  • Ring/routing errors — Incorrect peer selection, distance calculations, topology issues
  • Transport issues — UDP packet loss handling, encryption/decryption, connection lifecycle
  • Contract execution — WASM sandbox issues, state verification failures
  • Determinism violations — Code using
    std::time::Instant::now()
    instead of
    TimeSource
    , or
    rand::random()
    instead of
    GlobalRng
  • Silent failure / fire-and-forget — Spawned task dies with no error propagation (check: is the
    JoinHandle
    stored and polled? what happens if the task exits?), broadcast sent to zero targets with no warning, channel overflow silently dropping messages. Look for:
    tokio::spawn
    without
    .await
    /
    .abort()
    ,
    let _ = sender.send()
    , missing logging on empty target sets
  • Resource exhaustion — HashMap/Vec/channel entries inserted but never removed, causing unbounded memory growth or channel backpressure. Check: is there a cleanup path for every insert? Is cleanup triggered on both success AND failure/timeout? Run sustained operations and assert collection sizes stay bounded
  • Incomplete wiring — Feature only works for some operation types (e.g., router feedback wired for GET but not subscribe/put/update). When debugging "X doesn't work for operation Y," check all enum variants in the dispatch path — commented-out arms,
    _ => Irrelevant
    catch-alls, and missing match arms are common
  • TTL/timing race conditions — Two time-dependent operations where the first can expire before the second completes (e.g., transient TTL expires before CONNECT handshake, interest TTL expires before subscription renewal, broadcast fires before subscriptions complete). Check: what happens if operation A takes longer than timeout B?
  • Regressions from "safe" changes — A seemingly harmless change (code simplification, removing a feature flag, changing defaults) breaks an invariant that nothing tests. When a recent commit looks innocent, check what implicit behaviors it removed
  • Mock/test divergence — Bug can't be reproduced in tests because the mock runtime behaves differently from production. Check: does the mock skip side effects (e.g., BSC emission)? Does the test use a different code path than production (e.g., explicit subscribe vs background subscribe)? Does the mock socket behave differently from real UDP?
See Module-Specific Debugging Guide for detailed bug patterns, data collection strategies, and test approaches per module.
在修改任何代码前,明确列出潜在原因:
假设列表:
1. [最可能] X组件未处理Y场景
2. [可能] A与B之间存在竞态条件
3. [可能性较低] Z中的配置不匹配
根据证据按可能性排序,避免锚定第一个想法。
Freenet特定假设模式:
  • 状态机Bug — 操作(CONNECT、GET、PUT、UPDATE、SUBSCRIBE)中的无效转换
  • 环/路由错误 — 错误的节点选择、距离计算、拓扑问题
  • 传输层问题 — UDP丢包处理、加密/解密、连接生命周期
  • 合约执行 — WASM沙箱问题、状态验证失败
  • 确定性破坏 — 代码使用
    std::time::Instant::now()
    而非
    TimeSource
    ,或使用
    rand::random()
    而非
    GlobalRng
  • 静默失败/一发即忘 — 衍生任务无错误传播就终止(检查:是否存储并轮询
    JoinHandle
    ?任务退出时会发生什么?)、广播发送到零个目标且无警告、通道溢出静默丢弃消息。需关注:未使用
    .await
    /
    .abort()
    tokio::spawn
    let _ = sender.send()
    、空目标集缺失日志
  • 资源耗尽 — HashMap/Vec/通道条目仅插入未移除,导致内存无限增长或通道背压。检查:每个插入操作是否有清理路径?清理是否在成功、失败/超时场景下都触发?运行持续操作并断言集合大小保持可控
  • ** wiring不完整** — 功能仅对部分操作类型生效(例如,路由器反馈仅为GET wired,而非subscribe/put/update)。调试“X对操作Y无效”时,检查调度路径中的所有枚举变体——注释掉的分支、
    _ => Irrelevant
    捕获分支、缺失的匹配分支都是常见问题
  • TTL/时序竞态条件 — 两个依赖时间的操作中,第一个操作可能在第二个完成前过期(例如,临时TTL在CONNECT握手前过期、兴趣TTL在订阅续期前过期、广播在订阅完成前触发)。检查:如果操作A耗时超过超时B会发生什么?
  • “安全”变更导致的回归 — 看似无害的变更(代码简化、移除功能标志、修改默认值)破坏了未被测试的不变量。当最近的提交看似无辜时,检查它移除了哪些隐式行为
  • 模拟/测试差异 — 测试环境无法复现Bug,因为模拟运行时与生产环境行为不同。检查:模拟是否跳过了副作用(例如,BSC发射)?测试是否使用了与生产环境不同的代码路径(例如,显式订阅 vs 后台订阅)?模拟套接字与真实UDP行为是否不同?
有关各模块的详细Bug模式、数据收集策略和测试方法,请参阅模块特定调试指南

Phase 3: Investigate Systematically

阶段3:系统调查

For each hypothesis:
  1. Identify what evidence would confirm or refute it
  2. Gather that evidence (logs, code reading, adding debug output)
  3. Update hypothesis ranking based on findings
  4. Move to next hypothesis if current one is eliminated
Freenet-specific data gathering:
What You NeedHow to Get ItAccess
Event sequences
RUST_LOG=info
+
--nocapture
on simulation tests
Everyone
Network message patterns
sim.get_network_stats()
in simulation tests
Everyone
Convergence behavior
sim.await_convergence(timeout, poll, min_contracts)
Everyone
Virtual time state
sim.virtual_time().now_nanos()
Everyone
Git history of affected code
git log --oneline -20 -- path/to/file.rs
Everyone
Fault injection resultsSimNetwork + FaultConfig, then inspect statsEveryone
Gateway logsAccess to running gateway nodeLimited — not all contributors
Aggregate telemetryProduction monitoring dashboardsLimited — core team only
Real network packet capturesPhysical access to test machinesLimited — specific environments
For module-specific data gathering techniques, see Module-Specific Debugging Guide — it covers observation APIs,
#[freenet_test]
event capture,
RUST_LOG
targets, and fault injection per module.
Parallel investigation with subagents:
Use
general-purpose
agents with
codebase-investigator
instructions for independent, readonly investigations. Spawn multiple in parallel, each with a specific focus.
Spawn investigators in parallel using Task tool (subagent_type="general-purpose"):

1. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Search for similar error handling patterns in the codebase related to [bug description]"

2. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Check git history for recent changes to [affected module/files]"

3. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Read and analyze [test file] and related fixtures for [component]"
Guidelines:
  • Each investigator focuses on one hypothesis or evidence type
  • Only parallelize readonly tasks — code changes must be sequential
  • Investigators report findings; you synthesize and decide next steps
针对每个假设:
  1. 确定能证实或推翻该假设的证据
  2. 收集证据(日志、代码阅读、添加调试输出)
  3. 根据发现更新假设优先级
  4. 当前假设被排除后,转向下一个假设
Freenet特定数据收集方法:
所需信息获取方式访问权限
事件序列仿真测试中使用
RUST_LOG=info
+
--nocapture
所有开发者
网络消息模式仿真测试中使用
sim.get_network_stats()
所有开发者
收敛行为
sim.await_convergence(timeout, poll, min_contracts)
所有开发者
虚拟时间状态
sim.virtual_time().now_nanos()
所有开发者
受影响代码的Git历史
git log --oneline -20 -- path/to/file.rs
所有开发者
故障注入结果使用SimNetwork + FaultConfig,然后检查统计数据所有开发者
网关日志访问运行中的网关节点受限——并非所有贡献者都能访问
聚合遥测生产环境监控仪表盘受限——仅核心团队可访问
真实网络数据包捕获物理访问测试机器受限——仅特定环境可访问
有关模块特定的数据收集技术,请参阅模块特定调试指南——其中涵盖了观察API、
#[freenet_test]
事件捕获、
RUST_LOG
目标和各模块的故障注入方法。
利用子代理进行并行调查:
使用
general-purpose
代理搭配
codebase-investigator
指令进行独立的只读调查。可并行生成多个代理,每个聚焦特定方向。
使用Task工具并行生成调查员(subagent_type="general-purpose"):

1. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    在代码库中搜索与[Bug描述]相关的类似错误处理模式"

2. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    检查[受影响模块/文件]的近期Git历史变更"

3. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    阅读并分析[测试文件]及[组件]的相关测试夹具"
指南:
  • 每个调查员聚焦一个假设或一类证据
  • 仅并行化只读任务——代码变更必须按顺序进行
  • 调查员报告结果,由你进行综合分析并决定下一步行动

Phase 4: Fix and Verify

阶段4:修复与验证

  1. Fix the root cause — Not symptoms
  2. Verify with deterministic reproduction — Re-run the failing test with the same seed
  3. Check for regressions
    cargo test -p freenet
  4. Consider edge cases — Does the fix handle similar scenarios?
  5. Verify determinism — If you added new code, ensure it uses
    TimeSource
    and
    GlobalRng
    (not
    std::time
    /
    rand
    directly)
  1. 修复根本原因 — 而非症状
  2. 通过确定性复现验证 — 使用相同种子重新运行失败的测试
  3. 检查回归
    cargo test -p freenet
  4. 考虑边缘场景 — 修复是否能处理类似场景?
  5. 验证确定性 — 如果添加了新代码,确保使用
    TimeSource
    GlobalRng
    (而非直接使用
    std::time
    /
    rand

Phase 5: Test Coverage Analysis

阶段5:测试覆盖率分析

Always ask: "Why didn't CI catch this?"
Freenet has multiple test layers:
LayerScopeWhat It Catches
Unit tests (~1000)Individual functionsLogic errors, algorithm bugs
Integration tests (~80)Component interactionsInterface mismatches, data flow bugs
Simulation testsMulti-node deterministicState machine bugs, race conditions, protocol errors
fdev single-process
Quick multi-peerBasic distributed behavior
freenet-test-network
20+ peers in DockerScale-dependent bugs, realistic network behavior
Real network testsPhysical machinesNAT traversal, real latency, UDP behavior
If a bug reached production or manual testing, there's a gap. Investigate:
  1. Which test layer should have caught this?
    • Logic error → unit test
    • Component interaction bug → integration test
    • Distributed/state machine behavior → simulation test with
      #[freenet_test]
    • Fault tolerance → SimNetwork with FaultConfig
    • Scale-dependent → freenet-test-network
  2. Why didn't the existing tests catch it?
    • Tests use different topology/configuration than production
    • Tests mock components that exhibit the bug in real usage
    • Simulation doesn't inject the right fault conditions
    • Test assertions too weak to detect the failure
    • Determinism violation — code path bypasses
      TimeSource
      /
      GlobalRng
  3. Document the gap — Include in the issue/PR:
    • What test would have caught this
    • Why existing tests didn't
    • Whether a new test should be added to prevent regression
务必问自己:“为什么CI没有发现这个问题?”
Freenet拥有多层测试:
层级范围可捕获问题
单元测试(约1000个)单个函数逻辑错误、算法Bug
集成测试(约80个)组件交互接口不匹配、数据流Bug
仿真测试多节点确定性环境状态机Bug、竞态条件、协议错误
fdev单进程
快速多节点基础分布式行为
freenet-test-network
Docker中20+节点规模相关Bug、真实网络行为
真实网络测试物理机器NAT穿透、真实延迟、UDP行为
如果Bug进入生产环境或需要手动测试,说明存在测试缺口。请调查:
  1. 哪个测试层级本应发现这个问题?
    • 逻辑错误 → 单元测试
    • 组件交互Bug → 集成测试
    • 分布式/状态机行为 → 带
      #[freenet_test]
      的仿真测试
    • 容错性 → 搭配FaultConfig的SimNetwork
    • 规模相关 → freenet-test-network
  2. 现有测试为什么没有发现?
    • 测试使用的拓扑/配置与生产环境不同
    • 测试模拟了在真实场景中会出现Bug的组件
    • 仿真环境未注入正确的故障条件
    • 测试断言太弱,无法检测到失败
    • 确定性破坏——代码路径绕过了
      TimeSource
      /
      GlobalRng
  3. 记录缺口 — 在Issue/PR中包含:
    • 本应发现此问题的测试类型
    • 现有测试未发现的原因
    • 是否应添加新测试以防止回归

Anti-Patterns to Avoid

需避免的反模式

Jumping to conclusions

急于下结论

  • Wrong: See error, immediately change code that seems related
  • Right: Form hypothesis, gather evidence, then act
  • 错误做法: 看到错误后立即修改看似相关的代码
  • 正确做法: 先构建假设,收集证据,再采取行动

Tunnel vision

隧道视野

  • Wrong: Spend hours on one theory despite contradicting evidence
  • Right: Set time bounds, pivot when evidence points elsewhere
  • 错误做法: 尽管存在矛盾证据,仍花费数小时在一个理论上
  • 正确做法: 设置时间限制,当证据指向其他方向时及时转向

Weakening tests

弱化测试

  • Wrong: Test fails, reduce assertions or add exceptions to make it pass
  • Right: Understand why the test expects what it does, fix the code to meet that expectation
  • Exception: The test itself has a bug or tests incorrect behavior (rare, requires clear justification)
  • 错误做法: 测试失败时,减少断言或添加例外使其通过
  • 正确做法: 理解测试的预期,修改代码以满足预期
  • 例外情况: 测试本身存在Bug或测试了错误的行为(罕见,需要明确理由)

Sequential investigation when parallel is possible

可并行时却串行调查

  • Wrong: Read file A, wait, read file B, wait, read file C
  • Right: Spawn
    codebase-investigator
    agents to read A, B, C concurrently, synthesize findings
  • 错误做法: 先阅读文件A,等待,再阅读文件B,等待,再阅读文件C
  • 正确做法: 生成
    codebase-investigator
    代理并发读取A、B、C,综合结果

Fixing without understanding

未理解就修复

  • Wrong: Copy a fix from Stack Overflow that makes the error go away
  • Right: Understand why the fix works and whether it addresses root cause
  • 错误做法: 复制Stack Overflow上的修复方案,只要错误消失即可
  • 正确做法: 理解修复的原理,确认其是否解决根本原因

Skipping the scope ladder

跳过范围阶梯

  • Wrong: Jump straight to real network debugging when the bug could be reproduced in a unit test
  • Right: Start small — unit test, then simulation, then real network
  • 错误做法: 当Bug可在单元测试中复现时,直接跳到真实网络调试
  • 正确做法: 从小范围开始——单元测试,再仿真,最后真实网络

Breaking determinism

破坏确定性

  • Wrong: Use
    std::time::Instant::now()
    or
    rand::random()
    in core logic
  • Right: Use
    TimeSource
    trait and
    GlobalRng
    so simulation tests remain reproducible
  • 错误做法: 在核心逻辑中使用
    std::time::Instant::now()
    rand::random()
  • 正确做法: 使用
    TimeSource
    trait和
    GlobalRng
    ,确保仿真测试保持可复现性

Assuming data access

假设数据可访问

  • Wrong: "Check the gateway logs to see what happened" (not everyone has gateway access)
  • Right: Design reproduction steps using simulation tests and
    RUST_LOG
    that any contributor can run
  • 错误做法: “查看网关日志了解发生了什么”(并非所有人都能访问网关)
  • 正确做法: 设计使用仿真测试和
    RUST_LOG
    的复现步骤,确保所有贡献者都能运行

Checklist Before Declaring "Fixed"

宣布“修复完成”前的检查清单

  • Root cause identified and understood
  • Fix addresses root cause, not symptoms
  • Original failure no longer reproduces
  • No new test failures introduced
  • Test added if one didn't exist (when practical)
  • No test assertions weakened or disabled
  • Answered "why didn't CI catch this?" and documented the test gap
  • 已识别并理解根本原因
  • 修复针对根本原因,而非症状
  • 原始故障不再复现
  • 未引入新的测试失败
  • (可行时)添加了原本不存在的测试
  • 未弱化或禁用测试断言
  • 已回答“为什么CI没有发现这个问题?”并记录了测试缺口