evanflow-tdd

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

EvanFlow: TDD

EvanFlow：TDD

Vocabulary

词汇表

See

evanflow

meta-skill. Key terms: vertical slice, behavior through public interface, deep module.

查看

evanflow

元技能。关键术语：vertical slice（垂直切片）、behavior through public interface（基于公共接口的行为验证）、deep module（深度模块）。

Core Principle

核心原则

Tests verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't break unless behavior changes.

Good test: "user can perform action X within their weekly rate limit" — describes capability.

Bad test: "calls

createX()

with status

'QUEUED'

then queues a job" — describes mechanics. Renames break it.

测试通过公共接口验证行为，而非实现细节。代码可以完全变更；除非行为发生变化，否则测试不应失败。

好的测试："用户可在每周速率限制内执行操作X"——描述的是功能能力。

差的测试："调用

createX()

并传入状态

'QUEUED'

，然后将任务加入队列"——描述的是实现机制。重命名操作就会导致测试失败。

Anti-Pattern: Horizontal Slices

反模式：水平切片

DO NOT write all tests first then all implementation. That produces tests of imagined behavior, not actual behavior. They become insensitive to real changes.

DO vertical slices: one test → one implementation → repeat. Each test responds to what you learned from the previous cycle.

切勿先编写所有测试再完成所有实现。这种方式生成的测试针对的是设想的行为，而非实际的行为。它们对真实的变化不敏感。

应该采用垂直切片：一个测试→一个实现→重复循环。每个测试都基于你在上一循环中获得的经验来编写。

When to Use

适用场景

Any production code change (new feature, bug fix, behavior change, refactor with behavior implications)
All new code in your backend's routers and services
All new code in your frontend that has testable logic (not pure-presentation components)

任何生产代码变更（新功能、漏洞修复、行为变更、涉及行为影响的重构）
后端路由和服务中的所有新代码
前端中所有具有可测试逻辑的新代码（非纯展示组件）

When to Skip (with explicit user approval)

可跳过场景（需获得用户明确批准）

Throwaway prototypes
Generated code (e.g.,
```
database.types.ts
```
)
Configuration files
Pure-presentation React components with no logic

一次性原型
生成代码（如
```
database.types.ts
```
）
配置文件
无逻辑的纯展示React组件

The Flow

流程步骤

1. Embedded Grill — "What to Test"

1. 前置确认——「测试什么」

Before writing any test, confirm with the user:

"Which behaviors matter most? We can't test everything."
"What's the public interface — what will callers actually use?"
"Are there opportunities to make this a deep module (small interface, complex internals)?"
"Where do tests need to integrate with real services (DB, payment provider, email provider) vs. where can we test in isolation?"

Anti-tailoring check (vertical slicing's biggest risk): before each new test, ask:

"Am I pinning behavior the spec/contract names, or am I pinning the impl I've already imagined?"
"Could I write this next test knowing only the public contract, before reading any of the impl I just wrote?"
"If a different impl satisfied the same contract, would this test still pass?"

If the test only makes sense given your specific impl, it's an internals test wearing a behavior costume. Rewrite it against the contract, or drop it.

Default to integration-style tests against real services (real DB, real queue, real cache) where feasible. Mocked dependencies frequently mask divergence between test and production behavior. Document any project-specific exception in your CLAUDE.md.

在编写任何测试之前，与用户确认：

"哪些行为最为重要？我们无法测试所有内容。"
"公共接口是什么——调用方实际会使用什么？"
"是否有机会将其设计为deep module（小接口、复杂内部实现）？"
"测试需要在哪些地方与真实服务（数据库、支付提供商、邮件提供商）集成，哪些地方可以隔离测试？"

反定制检查（垂直切片的最大风险）：在编写每个新测试之前，询问自己：

"我是在固定规范/合约中定义的行为，还是在固定我已经设想好的实现？"
"我能否仅了解公共合约，在未查看刚编写的任何实现代码的情况下，编写下一个测试？"
"如果另一种实现满足相同的合约，这个测试是否仍然会通过？"

如果测试只有在特定实现下才有意义，那它就是披着行为测试外衣的内部实现测试。请针对合约重写测试，或者放弃该测试。

默认优先选择针对真实服务的集成式测试（真实数据库、真实队列、真实缓存），在可行的情况下。模拟依赖常常会掩盖测试环境与生产环境之间的行为差异。在CLAUDE.md中记录任何项目特定的例外情况。

2. Tracer Bullet

2. 探路测试

Write ONE test for ONE behavior end-to-end. Prove the path works.

RED:      Write test → run → confirm it fails for the RIGHT reason
GREEN:    Write minimal code → run → confirm it passes
REFACTOR: Clean the impl + the test you just wrote, while it's fresh and green

The REFACTOR step is non-optional and per-cycle — it happens with the test you just wrote as your safety net, not after all tests are done. Refactoring cold code days later is a different (weaker) activity; that lives in

evanflow-iterate

为一个行为编写一个端到端测试。验证路径可行。

RED:      编写测试 → 运行 → 确认测试因正确的原因失败
GREEN:    编写最少代码 → 运行 → 确认测试通过
REFACTOR: 清理实现代码 + 刚编写的测试，趁代码还新鲜且测试已通过

REFACTOR步骤是必不可少的，且每个循环都要执行——它以刚编写的测试作为安全网，而非在所有测试完成后再进行。几天后再重构冷代码是另一种（更弱的）活动；这类重构属于

evanflow-iterate

的范畴。

3. Incremental Loop

3. 增量循环

For each remaining behavior, repeat the full RED-GREEN-REFACTOR cycle:

RED:      Write next test → fails for the right reason
GREEN:    Minimal code to pass → passes
REFACTOR: Clean before moving on (see checklist below)

Rules:

One test at a time
Only enough code to pass the current test
Don't anticipate future tests
Tests focus on observable behavior, not internals
Never skip REFACTOR. "I'll clean it up later" is how dead code, duplication, and shallow modules accumulate.

针对每个剩余行为，重复完整的RED-GREEN-REFACTOR循环：

RED:      编写下一个测试 → 因正确的原因失败
GREEN:    编写最少代码以通过测试 → 测试通过
REFACTOR: 在继续之前进行清理（见下方检查清单）

规则：

一次编写一个测试
仅编写足够通过当前测试的代码
不要提前为未来的测试做准备
测试聚焦于可观察的行为，而非内部实现
绝不要跳过REFACTOR步骤。"我稍后再清理"是死代码、重复代码和浅层模块堆积的原因。

4. Per-Cycle Refactor Checklist

4. 每循环重构检查清单

After each GREEN, before writing the next failing test, scan the just-touched code:

Duplication — extract if used twice with the same intent (not just structurally similar)
Naming — does the new name match what the code does? Rename now, while the test pins behavior
Deletion test — does the new module/function earn its existence, or did GREEN add bloat?
Deep-module check — small interface hiding the complexity, or shallow wrapper leaking it?
Test cleanliness — does the test still describe behavior crisply? Names, setup, assertion all clear?

Run tests after each refactor step. Never refactor while RED — get to GREEN first.

If a refactor would change behavior, stop: that's a new test, not a refactor.

每次完成GREEN步骤后，在编写下一个失败测试之前，检查刚修改的代码：

重复代码 —— 如果相同意图的代码出现两次（不只是结构相似），则提取复用
命名 —— 新名称是否符合代码的实际功能？趁测试已固定行为，立即重命名
删除测试 —— 新模块/函数是否有存在的必要，还是GREEN步骤添加了冗余代码？
深度模块检查 —— 是隐藏复杂实现的小接口，还是暴露内部的浅层包装？
测试整洁度 —— 测试是否仍能清晰描述行为？命名、前置设置、断言是否都清晰？

每次重构步骤后运行测试。绝不要在RED状态下重构——先达到GREEN状态。

如果重构会改变行为，请停止：这属于新测试的范畴，而非重构。

5. Macro Refactor (deferred to

evanflow-iterate

)

5. 宏观重构（推迟至

evanflow-iterate

）

Cross-cutting refactors that span the whole feature (extracting a shared module across multiple cycles, pulling out a deeper abstraction, restructuring the file layout) belong in

evanflow-iterate

's self-review pass — after all per-cycle refactors are done. Don't conflate the two: per-cycle refactor uses a fresh test as safety net; macro refactor uses the whole test suite.

跨整个功能的重构（在多个循环中提取共享模块、构建更深层的抽象、调整文件布局）属于

evanflow-iterate

的自我审查阶段——在所有每循环重构完成之后进行。不要混淆两者：每循环重构以新鲜的测试作为安全网；宏观重构以整个测试套件作为安全网。

Per-Cycle Checklist

每循环检查清单

[ ] Test describes behavior, not implementation
[ ] Test uses public interface only
[ ] Test would survive an internal refactor (rename, restructure)
[ ] Code is minimal for this test
[ ] No speculative features added
[ ] Test fails for the right reason before code is written
[ ] ASSERTION IS CORRECT — see warning below

[ ] 测试描述的是行为，而非实现
[ ] 测试仅使用公共接口
[ ] 测试在内部重构（重命名、结构调整）后仍能通过
[ ] 代码仅满足当前测试的最小需求
[ ] 未添加推测性功能
[ ] 在编写代码之前，测试因正确的原因失败
[ ] 断言正确 —— 见下方警告

⚠️ Assertion-Correctness Warning

⚠️ 断言正确性警告

Industry research (HumanEval evaluation across four LLMs) found that over 62% of LLM-generated test assertions were incorrect. This is the single most likely failure mode in LLM-driven TDD: the test passes, but it's testing the wrong thing.

Before writing any test assertion, verify:

Does this assertion match what the user actually wants? Don't assert on behavior you imagined — assert on behavior the spec/contract names.
Is this the assertion's most-precise form? "result is truthy" is weaker than "result equals 42". Loose assertions catch wrong things and miss right things.
Would this assertion still pass if the code was subtly wrong? Mentally introduce a one-character bug — does the assertion catch it? If not, the assertion is too weak.
Are you asserting on the right field? A common failure: asserting
```
response.status
```
when the meaningful field is
```
response.body.error
```
.
For computed values: did you compute the expected value correctly? Don't trust your own arithmetic — verify by hand or another path.

When in doubt about what to assert, STOP and ask the user rather than guess. An asserted-on-the-wrong-thing test is worse than no test — it provides false confidence.

行业研究（针对四个LLM的HumanEval评估）发现，超过62%的LLM生成的测试断言是错误的。 这是LLM驱动TDD中最可能出现的失败模式：测试通过了，但测试的是错误的内容。

在编写任何测试断言之前，请验证：

该断言是否符合用户的实际需求？ 不要断言你设想的行为——断言规范/合约中定义的行为。
这是断言最精确的形式吗？ "结果为真"比"结果等于42"更弱。松散的断言会错误地捕获正确内容，遗漏错误内容。
如果代码存在细微错误，该断言是否仍会通过？ 假设引入一个单字符错误——断言是否能捕获它？如果不能，说明断言太弱。
你是否在断言正确的字段？ 常见错误：当有意义的字段是
```
response.body.error
```
时，却断言
```
response.status
```
。
对于计算值：你是否正确计算了预期值？ 不要相信自己的计算——手动验证或通过其他方式确认。

如果不确定要断言什么，停止并询问用户，而非猜测。断言错误内容的测试比没有测试更糟糕——它会提供虚假的信心。

Hard Rules

严格规则

Vertical slices only. Never write all tests first.
REFACTOR is per-cycle, not deferred. Every GREEN is followed by a refactor pass on the just-written code, with the fresh test as safety net. Deferring all refactor to the end strips the safety net and is the most common way TDD-shaped code ends up with TDD-shaped scars.
Test behavior, not internals. If a rename breaks a test but behavior didn't change, the test was wrong.
Watch the test fail. If you didn't see RED, you don't know it tests the right thing.
Never auto-commit. TDD cycle is RED-GREEN-REFACTOR, not RED-GREEN-REFACTOR-COMMIT.
Default to real services for integration tests. Mocked databases routinely diverge from production behavior — prefer a test DB unless your project documents a specific exception.

仅使用垂直切片。切勿先编写所有测试。
REFACTOR是每循环必做的，而非推迟执行。每次GREEN步骤后，都要对刚编写的代码进行重构，以新鲜的测试作为安全网。将所有重构推迟到最后会失去安全网，这是TDD风格代码最终留下TDD痕迹的最常见原因。
测试行为，而非内部实现。如果重命名导致测试失败但行为未变，说明测试编写有误。
确认测试失败。如果你没有看到RED状态，就无法确定测试是否针对正确的内容。
绝不自动提交。TDD循环是RED-GREEN-REFACTOR，而非RED-GREEN-REFACTOR-COMMIT。
集成测试默认使用真实服务。模拟数据库通常会与生产环境行为存在差异——除非项目文档记录了特定例外情况，否则优先使用测试数据库。

Hand-offs

交接流程

Tests + impl complete for the task → return to
```
evanflow-executing-plans
```
to mark task done
Discovered the interface is wrong →
```
evanflow-design-interface
```
to redesign
Discovered deeper architectural issue →
```
evanflow-improve-architecture
```

任务的测试和实现完成 → 返回
```
evanflow-executing-plans
```
标记任务完成
发现接口设计错误 → 进入
```
evanflow-design-interface
```
重新设计
发现更深层次的架构问题 → 进入
```
evanflow-improve-architecture
```

evanflow-tdd

Original

Translation

EvanFlow: TDD

EvanFlow：TDD

Vocabulary

词汇表

Core Principle

核心原则

Anti-Pattern: Horizontal Slices

反模式：水平切片

When to Use

适用场景

When to Skip (with explicit user approval)

可跳过场景（需获得用户明确批准）

The Flow

流程步骤

1. Embedded Grill — "What to Test"

1. 前置确认——「测试什么」

2. Tracer Bullet

2. 探路测试

3. Incremental Loop

3. 增量循环

4. Per-Cycle Refactor Checklist

4. 每循环重构检查清单

5. Macro Refactor (deferred to
`evanflow-iterate`
)

5. 宏观重构（推迟至
`evanflow-iterate`
）

Per-Cycle Checklist

每循环检查清单

⚠️ Assertion-Correctness Warning

⚠️ 断言正确性警告

Hard Rules

严格规则

Hand-offs

交接流程

evanflow-tdd

Original

Translation

EvanFlow: TDD

EvanFlow：TDD

Vocabulary

词汇表

Core Principle

核心原则

Anti-Pattern: Horizontal Slices

反模式：水平切片

When to Use

适用场景

When to Skip (with explicit user approval)

可跳过场景（需获得用户明确批准）

The Flow

流程步骤

1. Embedded Grill — "What to Test"

1. 前置确认——「测试什么」

2. Tracer Bullet

2. 探路测试

3. Incremental Loop

3. 增量循环

4. Per-Cycle Refactor Checklist

4. 每循环重构检查清单

5. Macro Refactor (deferred to evanflow-iterate)

5. 宏观重构（推迟至evanflow-iterate）

Per-Cycle Checklist

每循环检查清单

⚠️ Assertion-Correctness Warning

⚠️ 断言正确性警告

Hard Rules

严格规则

Hand-offs

交接流程

5. Macro Refactor (deferred to
`evanflow-iterate`
)

5. 宏观重构（推迟至
`evanflow-iterate`
）