systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systematic Debugging

系统性调试

When to Use

使用场景

Invoke this methodology automatically when:

A test fails and the cause isn't immediately obvious
Unexpected behavior occurs in production or development
An error message doesn't directly point to the fix
Multiple potential causes exist

当出现以下情况时，自动调用此方法论：

测试失败且原因无法立即明确
生产或开发环境中出现意外行为
错误信息无法直接指向修复方案
存在多个潜在原因

Core Principles

核心原则

Hypothesize before acting - Form explicit hypotheses about root cause before changing code
Test hypotheses systematically - Validate or eliminate each hypothesis with evidence
Parallelize investigation - Use subagents for concurrent readonly exploration
Preserve test integrity - Never weaken tests to make them pass

先假设后行动 - 在修改代码前，明确提出关于根本原因的假设
系统验证假设 - 用证据验证或排除每个假设
并行化调查 - 使用子代理进行并发只读探索
保持测试完整性 - 绝不通过弱化测试来使其通过

Debugging Scope Ladder

调试范围阶梯

Always prefer the smallest, most reproducible scope that demonstrates the bug. Work up the ladder only when the smaller scope can't reproduce or doesn't apply:

Priority	Scope	When to Use	Command
1	Unit test	Logic errors, algorithm bugs, single-function issues	`cargo test -p freenet -- specific_test`
2	Mocked unit test	Transport/ring logic needing isolation	Unit test with `MockNetworkBridge` / `MockRing`
3	Simulation test	Multi-node behavior, state machines, race conditions	`cargo test -p freenet --test simulation_integration -- --test-threads=1`
4	SimNetwork + FaultConfig	Fault tolerance, message loss, network partitions	SimNetwork with configured fault injection
5	fdev single-process	Quick multi-peer CI validation	`cargo run -p fdev -- test --seed 42 single-process`
6	freenet-test-network	20+ peer large-scale behavior	Docker-based `freenet-test-network`
7	Real network	Issues that only manifest with real UDP/NAT/latency	Manual multi-peer test across machines

Why this order matters:

Lower scopes are faster, deterministic, and reproducible by anyone
Higher scopes require more infrastructure, time, and may not be accessible to all contributors
Gateway logs, aggregate telemetry, and production metrics are not available to every developer — don't assume access to these when designing reproduction steps

始终优先选择能复现Bug的最小、最可复现的范围。 仅当小范围无法复现或不适用时，才逐步扩大范围：

优先级	范围	适用场景	命令
1	单元测试	逻辑错误、算法Bug、单一函数问题	`cargo test -p freenet -- specific_test`
2	模拟单元测试	需要隔离的传输/环逻辑	搭配 `MockNetworkBridge` / `MockRing` 的单元测试
3	仿真测试	多节点行为、状态机、竞态条件	`cargo test -p freenet --test simulation_integration -- --test-threads=1`
4	SimNetwork + FaultConfig	容错性、消息丢失、网络分区	配置故障注入的SimNetwork
5	fdev单进程	快速多节点CI验证	`cargo run -p fdev -- test --seed 42 single-process`
6	freenet-test-network	20+节点大规模行为	基于Docker的 `freenet-test-network`
7	真实网络	仅在真实UDP/NAT/延迟环境下出现的问题	跨机器手动多节点测试

此顺序的重要性：

低范围测试速度更快、具有确定性，且任何人都可复现
高范围测试需要更多基础设施、时间，且并非所有贡献者都能访问
网关日志、聚合遥测和生产指标并非所有开发者都能访问——设计复现步骤时不要假设能获取这些资源

Debugging Workflow

调试工作流

Phase 1: Reproduce and Isolate

阶段1：复现与隔离

Reproduce the failure — Confirm the bug exists and is reproducible
Use the scope ladder — Start at the smallest scope that can demonstrate the bug:
- Can you write a unit test? Try that first
- Needs multiple nodes? Use the simulation framework with a deterministic seed
- Only happens under fault conditions? Use
```
SimNetwork
```
  with
```
FaultConfig
```
- Can't reproduce in simulation? Then escalate to real network testing
Record the seed — When using simulation tests, always record the seed value for reproducibility
Gather initial evidence — Read error messages, logs, stack traces

Simulation-first approach for distributed bugs:

bash

undefined

复现故障 — 确认Bug存在且可复现
使用范围阶梯 — 从能复现Bug的最小范围开始：
- 能否编写单元测试？优先尝试
- 需要多节点？使用带确定性种子的仿真框架
- 仅在故障条件下出现？使用搭配
```
FaultConfig
```
  的
```
SimNetwork
```
- 仿真环境无法复现？再升级到真实网络测试
记录种子值 — 使用仿真测试时，务必记录种子值以保证可复现性
收集初始证据 — 读取错误信息、日志、堆栈跟踪

分布式Bug优先采用仿真方法：

bash

undefined

Run simulation tests deterministically

以确定性方式运行仿真测试

cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1

With logging to observe event sequences

开启日志以观察事件序列

RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1

Reproduce with a specific seed

使用特定种子复现

cargo run -p fdev -- test --seed 0xDEADBEEF single-process

undefined

cargo run -p fdev -- test --seed 0xDEADBEEF single-process

undefined

Phase 2: Form Hypotheses

阶段2：构建假设

Before touching any code, explicitly list potential causes:

Hypotheses:
1. [Most likely] The X component isn't handling Y case
2. [Possible] Race condition between A and B
3. [Less likely] Configuration mismatch in Z

Rank by likelihood based on evidence. Avoid anchoring on the first idea.

Freenet-specific hypothesis patterns:

State machine bugs — Invalid transitions in operations (CONNECT, GET, PUT, UPDATE, SUBSCRIBE)
Ring/routing errors — Incorrect peer selection, distance calculations, topology issues
Transport issues — UDP packet loss handling, encryption/decryption, connection lifecycle
Contract execution — WASM sandbox issues, state verification failures
Determinism violations — Code using
```
std::time::Instant::now()
```
instead of
```
TimeSource
```
, or
```
rand::random()
```
instead of
```
GlobalRng
```
Silent failure / fire-and-forget — Spawned task dies with no error propagation (check: is the
```
JoinHandle
```
stored and polled? what happens if the task exits?), broadcast sent to zero targets with no warning, channel overflow silently dropping messages. Look for:
```
tokio::spawn
```
without
```
.await
```
/
```
.abort()
```
,
```
let _ = sender.send()
```
, missing logging on empty target sets
Resource exhaustion — HashMap/Vec/channel entries inserted but never removed, causing unbounded memory growth or channel backpressure. Check: is there a cleanup path for every insert? Is cleanup triggered on both success AND failure/timeout? Run sustained operations and assert collection sizes stay bounded
Incomplete wiring — Feature only works for some operation types (e.g., router feedback wired for GET but not subscribe/put/update). When debugging "X doesn't work for operation Y," check all enum variants in the dispatch path — commented-out arms,
```
_ => Irrelevant
```
catch-alls, and missing match arms are common
TTL/timing race conditions — Two time-dependent operations where the first can expire before the second completes (e.g., transient TTL expires before CONNECT handshake, interest TTL expires before subscription renewal, broadcast fires before subscriptions complete). Check: what happens if operation A takes longer than timeout B?
Regressions from "safe" changes — A seemingly harmless change (code simplification, removing a feature flag, changing defaults) breaks an invariant that nothing tests. When a recent commit looks innocent, check what implicit behaviors it removed
Mock/test divergence — Bug can't be reproduced in tests because the mock runtime behaves differently from production. Check: does the mock skip side effects (e.g., BSC emission)? Does the test use a different code path than production (e.g., explicit subscribe vs background subscribe)? Does the mock socket behave differently from real UDP?

See Module-Specific Debugging Guide for detailed bug patterns, data collection strategies, and test approaches per module.

在修改任何代码前，明确列出潜在原因：

假设列表：
1. [最可能] X组件未处理Y场景
2. [可能] A与B之间存在竞态条件
3. [可能性较低] Z中的配置不匹配

根据证据按可能性排序，避免锚定第一个想法。

Freenet特定假设模式：

状态机Bug — 操作（CONNECT、GET、PUT、UPDATE、SUBSCRIBE）中的无效转换
环/路由错误 — 错误的节点选择、距离计算、拓扑问题
传输层问题 — UDP丢包处理、加密/解密、连接生命周期
合约执行 — WASM沙箱问题、状态验证失败

确定性破坏 — 代码使用

std::time::Instant::now()

而非

TimeSource

，或使用

rand::random()

而非

GlobalRng

静默失败/一发即忘 — 衍生任务无错误传播就终止（检查：是否存储并轮询
```
JoinHandle
```
？任务退出时会发生什么？）、广播发送到零个目标且无警告、通道溢出静默丢弃消息。需关注：未使用
```
.await
```
/
```
.abort()
```
的
```
tokio::spawn
```
、
```
let _ = sender.send()
```
、空目标集缺失日志
资源耗尽 — HashMap/Vec/通道条目仅插入未移除，导致内存无限增长或通道背压。检查：每个插入操作是否有清理路径？清理是否在成功、失败/超时场景下都触发？运行持续操作并断言集合大小保持可控
** wiring不完整** — 功能仅对部分操作类型生效（例如，路由器反馈仅为GET wired，而非subscribe/put/update）。调试“X对操作Y无效”时，检查调度路径中的所有枚举变体——注释掉的分支、
```
_ => Irrelevant
```
捕获分支、缺失的匹配分支都是常见问题
TTL/时序竞态条件 — 两个依赖时间的操作中，第一个操作可能在第二个完成前过期（例如，临时TTL在CONNECT握手前过期、兴趣TTL在订阅续期前过期、广播在订阅完成前触发）。检查：如果操作A耗时超过超时B会发生什么？
“安全”变更导致的回归 — 看似无害的变更（代码简化、移除功能标志、修改默认值）破坏了未被测试的不变量。当最近的提交看似无辜时，检查它移除了哪些隐式行为
模拟/测试差异 — 测试环境无法复现Bug，因为模拟运行时与生产环境行为不同。检查：模拟是否跳过了副作用（例如，BSC发射）？测试是否使用了与生产环境不同的代码路径（例如，显式订阅 vs 后台订阅）？模拟套接字与真实UDP行为是否不同？

有关各模块的详细Bug模式、数据收集策略和测试方法，请参阅模块特定调试指南。

Phase 3: Investigate Systematically

阶段3：系统调查

For each hypothesis:

Identify what evidence would confirm or refute it
Gather that evidence (logs, code reading, adding debug output)
Update hypothesis ranking based on findings
Move to next hypothesis if current one is eliminated

Freenet-specific data gathering:

What You Need	How to Get It	Access
Event sequences	`RUST_LOG=info` + `--nocapture` on simulation tests	Everyone
Network message patterns	`sim.get_network_stats()` in simulation tests	Everyone
Convergence behavior	`sim.await_convergence(timeout, poll, min_contracts)`	Everyone
Virtual time state	`sim.virtual_time().now_nanos()`	Everyone
Git history of affected code	`git log --oneline -20 -- path/to/file.rs`	Everyone
Fault injection results	SimNetwork + FaultConfig, then inspect stats	Everyone
Gateway logs	Access to running gateway node	Limited — not all contributors
Aggregate telemetry	Production monitoring dashboards	Limited — core team only
Real network packet captures	Physical access to test machines	Limited — specific environments

For module-specific data gathering techniques, see Module-Specific Debugging Guide — it covers observation APIs,

#[freenet_test]

event capture,

RUST_LOG

targets, and fault injection per module.

Parallel investigation with subagents:

Use

general-purpose

agents with

codebase-investigator

instructions for independent, readonly investigations. Spawn multiple in parallel, each with a specific focus.

Spawn investigators in parallel using Task tool (subagent_type="general-purpose"):

1. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Search for similar error handling patterns in the codebase related to [bug description]"

2. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Check git history for recent changes to [affected module/files]"

3. "You are a codebase-investigator. [Include agents/codebase-investigator.md instructions]
    Read and analyze [test file] and related fixtures for [component]"

Guidelines:

Each investigator focuses on one hypothesis or evidence type
Only parallelize readonly tasks — code changes must be sequential
Investigators report findings; you synthesize and decide next steps

针对每个假设：

确定能证实或推翻该假设的证据
收集证据（日志、代码阅读、添加调试输出）
根据发现更新假设优先级
当前假设被排除后，转向下一个假设

Freenet特定数据收集方法：

所需信息	获取方式	访问权限
事件序列	仿真测试中使用 `RUST_LOG=info` + `--nocapture`	所有开发者
网络消息模式	仿真测试中使用 `sim.get_network_stats()`	所有开发者
收敛行为	`sim.await_convergence(timeout, poll, min_contracts)`	所有开发者
虚拟时间状态	`sim.virtual_time().now_nanos()`	所有开发者
受影响代码的Git历史	`git log --oneline -20 -- path/to/file.rs`	所有开发者
故障注入结果	使用SimNetwork + FaultConfig，然后检查统计数据	所有开发者
网关日志	访问运行中的网关节点	受限——并非所有贡献者都能访问
聚合遥测	生产环境监控仪表盘	受限——仅核心团队可访问
真实网络数据包捕获	物理访问测试机器	受限——仅特定环境可访问

有关模块特定的数据收集技术，请参阅模块特定调试指南——其中涵盖了观察API、

#[freenet_test]

事件捕获、

RUST_LOG

目标和各模块的故障注入方法。

利用子代理进行并行调查：

使用

general-purpose

代理搭配

codebase-investigator

指令进行独立的只读调查。可并行生成多个代理，每个聚焦特定方向。

使用Task工具并行生成调查员（subagent_type="general-purpose"）：

1. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    在代码库中搜索与[Bug描述]相关的类似错误处理模式"

2. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    检查[受影响模块/文件]的近期Git历史变更"

3. "你是代码库调查员。[包含agents/codebase-investigator.md指令]
    阅读并分析[测试文件]及[组件]的相关测试夹具"

指南：

每个调查员聚焦一个假设或一类证据
仅并行化只读任务——代码变更必须按顺序进行
调查员报告结果，由你进行综合分析并决定下一步行动

Phase 4: Fix and Verify

阶段4：修复与验证

Fix the root cause — Not symptoms
Verify with deterministic reproduction — Re-run the failing test with the same seed
Check for regressions —
```
cargo test -p freenet
```
Consider edge cases — Does the fix handle similar scenarios?
Verify determinism — If you added new code, ensure it uses
```
TimeSource
```
and
```
GlobalRng
```
(not
```
std::time
```
/
```
rand
```
directly)

修复根本原因 — 而非症状
通过确定性复现验证 — 使用相同种子重新运行失败的测试
检查回归 —
```
cargo test -p freenet
```
考虑边缘场景 — 修复是否能处理类似场景？
验证确定性 — 如果添加了新代码，确保使用
```
TimeSource
```
和
```
GlobalRng
```
（而非直接使用
```
std::time
```
/
```
rand
```
）

Phase 5: Test Coverage Analysis

阶段5：测试覆盖率分析

Always ask: "Why didn't CI catch this?"

Freenet has multiple test layers:

Layer	Scope	What It Catches
Unit tests (~1000)	Individual functions	Logic errors, algorithm bugs
Integration tests (~80)	Component interactions	Interface mismatches, data flow bugs
Simulation tests	Multi-node deterministic	State machine bugs, race conditions, protocol errors
`fdev single-process`	Quick multi-peer	Basic distributed behavior
`freenet-test-network`	20+ peers in Docker	Scale-dependent bugs, realistic network behavior
Real network tests	Physical machines	NAT traversal, real latency, UDP behavior

If a bug reached production or manual testing, there's a gap. Investigate:

Which test layer should have caught this?
- Logic error → unit test
- Component interaction bug → integration test
- Distributed/state machine behavior → simulation test with
```
#[freenet_test]
```
- Fault tolerance → SimNetwork with FaultConfig
- Scale-dependent → freenet-test-network
Why didn't the existing tests catch it?
- Tests use different topology/configuration than production
- Tests mock components that exhibit the bug in real usage
- Simulation doesn't inject the right fault conditions
- Test assertions too weak to detect the failure
- Determinism violation — code path bypasses
```
TimeSource
```
  /
```
GlobalRng
```
Document the gap — Include in the issue/PR:
- What test would have caught this
- Why existing tests didn't
- Whether a new test should be added to prevent regression

务必问自己：“为什么CI没有发现这个问题？”

Freenet拥有多层测试：

层级	范围	可捕获问题
单元测试（约1000个）	单个函数	逻辑错误、算法Bug
集成测试（约80个）	组件交互	接口不匹配、数据流Bug
仿真测试	多节点确定性环境	状态机Bug、竞态条件、协议错误
`fdev单进程`	快速多节点	基础分布式行为
`freenet-test-network`	Docker中20+节点	规模相关Bug、真实网络行为
真实网络测试	物理机器	NAT穿透、真实延迟、UDP行为

如果Bug进入生产环境或需要手动测试，说明存在测试缺口。请调查：

哪个测试层级本应发现这个问题？
- 逻辑错误 → 单元测试
- 组件交互Bug → 集成测试
- 分布式/状态机行为 → 带
```
#[freenet_test]
```
  的仿真测试
- 容错性 → 搭配FaultConfig的SimNetwork
- 规模相关 → freenet-test-network
现有测试为什么没有发现？
- 测试使用的拓扑/配置与生产环境不同
- 测试模拟了在真实场景中会出现Bug的组件
- 仿真环境未注入正确的故障条件
- 测试断言太弱，无法检测到失败
- 确定性破坏——代码路径绕过了
```
TimeSource
```
  /
```
GlobalRng
```
记录缺口 — 在Issue/PR中包含：
- 本应发现此问题的测试类型
- 现有测试未发现的原因
- 是否应添加新测试以防止回归

Anti-Patterns to Avoid

需避免的反模式

Jumping to conclusions

急于下结论

Wrong: See error, immediately change code that seems related
Right: Form hypothesis, gather evidence, then act

错误做法： 看到错误后立即修改看似相关的代码
正确做法： 先构建假设，收集证据，再采取行动

Tunnel vision

隧道视野

Wrong: Spend hours on one theory despite contradicting evidence
Right: Set time bounds, pivot when evidence points elsewhere

错误做法： 尽管存在矛盾证据，仍花费数小时在一个理论上
正确做法： 设置时间限制，当证据指向其他方向时及时转向

Weakening tests

弱化测试

Wrong: Test fails, reduce assertions or add exceptions to make it pass
Right: Understand why the test expects what it does, fix the code to meet that expectation
Exception: The test itself has a bug or tests incorrect behavior (rare, requires clear justification)

错误做法： 测试失败时，减少断言或添加例外使其通过
正确做法： 理解测试的预期，修改代码以满足预期
例外情况： 测试本身存在Bug或测试了错误的行为（罕见，需要明确理由）

Sequential investigation when parallel is possible

可并行时却串行调查

Wrong: Read file A, wait, read file B, wait, read file C
Right: Spawn
```
codebase-investigator
```
agents to read A, B, C concurrently, synthesize findings

错误做法： 先阅读文件A，等待，再阅读文件B，等待，再阅读文件C
正确做法： 生成
```
codebase-investigator
```
代理并发读取A、B、C，综合结果

Fixing without understanding

未理解就修复

Wrong: Copy a fix from Stack Overflow that makes the error go away
Right: Understand why the fix works and whether it addresses root cause

错误做法： 复制Stack Overflow上的修复方案，只要错误消失即可
正确做法： 理解修复的原理，确认其是否解决根本原因

Skipping the scope ladder

跳过范围阶梯

Wrong: Jump straight to real network debugging when the bug could be reproduced in a unit test
Right: Start small — unit test, then simulation, then real network

错误做法： 当Bug可在单元测试中复现时，直接跳到真实网络调试
正确做法： 从小范围开始——单元测试，再仿真，最后真实网络

Breaking determinism

破坏确定性

Wrong: Use
```
std::time::Instant::now()
```
or
```
rand::random()
```
in core logic
Right: Use
```
TimeSource
```
trait and
```
GlobalRng
```
so simulation tests remain reproducible

错误做法： 在核心逻辑中使用
```
std::time::Instant::now()
```
或
```
rand::random()
```
正确做法： 使用
```
TimeSource
```
trait和
```
GlobalRng
```
，确保仿真测试保持可复现性

Assuming data access

假设数据可访问

Wrong: "Check the gateway logs to see what happened" (not everyone has gateway access)
Right: Design reproduction steps using simulation tests and
```
RUST_LOG
```
that any contributor can run

错误做法： “查看网关日志了解发生了什么”（并非所有人都能访问网关）
正确做法： 设计使用仿真测试和
```
RUST_LOG
```
的复现步骤，确保所有贡献者都能运行