evanflow-tdd
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEvanFlow: TDD
EvanFlow:TDD
Vocabulary
词汇表
See meta-skill. Key terms: vertical slice, behavior through public interface, deep module.
evanflow查看元技能。关键术语:vertical slice(垂直切片)、behavior through public interface(基于公共接口的行为验证)、deep module(深度模块)。
evanflowCore Principle
核心原则
Tests verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't break unless behavior changes.
Good test: "user can perform action X within their weekly rate limit" — describes capability.
Bad test: "calls with status then queues a job" — describes mechanics. Renames break it.
createX()'QUEUED'测试通过公共接口验证行为,而非实现细节。代码可以完全变更;除非行为发生变化,否则测试不应失败。
好的测试:"用户可在每周速率限制内执行操作X"——描述的是功能能力。
差的测试:"调用并传入状态,然后将任务加入队列"——描述的是实现机制。重命名操作就会导致测试失败。
createX()'QUEUED'Anti-Pattern: Horizontal Slices
反模式:水平切片
DO NOT write all tests first then all implementation. That produces tests of imagined behavior, not actual behavior. They become insensitive to real changes.
DO vertical slices: one test → one implementation → repeat. Each test responds to what you learned from the previous cycle.
切勿先编写所有测试再完成所有实现。这种方式生成的测试针对的是设想的行为,而非实际的行为。它们对真实的变化不敏感。
应该采用垂直切片:一个测试→一个实现→重复循环。每个测试都基于你在上一循环中获得的经验来编写。
When to Use
适用场景
- Any production code change (new feature, bug fix, behavior change, refactor with behavior implications)
- All new code in your backend's routers and services
- All new code in your frontend that has testable logic (not pure-presentation components)
- 任何生产代码变更(新功能、漏洞修复、行为变更、涉及行为影响的重构)
- 后端路由和服务中的所有新代码
- 前端中所有具有可测试逻辑的新代码(非纯展示组件)
When to Skip (with explicit user approval)
可跳过场景(需获得用户明确批准)
- Throwaway prototypes
- Generated code (e.g., )
database.types.ts - Configuration files
- Pure-presentation React components with no logic
- 一次性原型
- 生成代码(如)
database.types.ts - 配置文件
- 无逻辑的纯展示React组件
The Flow
流程步骤
1. Embedded Grill — "What to Test"
1. 前置确认——「测试什么」
Before writing any test, confirm with the user:
- "Which behaviors matter most? We can't test everything."
- "What's the public interface — what will callers actually use?"
- "Are there opportunities to make this a deep module (small interface, complex internals)?"
- "Where do tests need to integrate with real services (DB, payment provider, email provider) vs. where can we test in isolation?"
Anti-tailoring check (vertical slicing's biggest risk): before each new test, ask:
- "Am I pinning behavior the spec/contract names, or am I pinning the impl I've already imagined?"
- "Could I write this next test knowing only the public contract, before reading any of the impl I just wrote?"
- "If a different impl satisfied the same contract, would this test still pass?"
If the test only makes sense given your specific impl, it's an internals test wearing a behavior costume. Rewrite it against the contract, or drop it.
Default to integration-style tests against real services (real DB, real queue, real cache) where feasible. Mocked dependencies frequently mask divergence between test and production behavior. Document any project-specific exception in your CLAUDE.md.
在编写任何测试之前,与用户确认:
- "哪些行为最为重要?我们无法测试所有内容。"
- "公共接口是什么——调用方实际会使用什么?"
- "是否有机会将其设计为deep module(小接口、复杂内部实现)?"
- "测试需要在哪些地方与真实服务(数据库、支付提供商、邮件提供商)集成,哪些地方可以隔离测试?"
反定制检查(垂直切片的最大风险):在编写每个新测试之前,询问自己:
- "我是在固定规范/合约中定义的行为,还是在固定我已经设想好的实现?"
- "我能否仅了解公共合约,在未查看刚编写的任何实现代码的情况下,编写下一个测试?"
- "如果另一种实现满足相同的合约,这个测试是否仍然会通过?"
如果测试只有在特定实现下才有意义,那它就是披着行为测试外衣的内部实现测试。请针对合约重写测试,或者放弃该测试。
默认优先选择针对真实服务的集成式测试(真实数据库、真实队列、真实缓存),在可行的情况下。模拟依赖常常会掩盖测试环境与生产环境之间的行为差异。在CLAUDE.md中记录任何项目特定的例外情况。
2. Tracer Bullet
2. 探路测试
Write ONE test for ONE behavior end-to-end. Prove the path works.
RED: Write test → run → confirm it fails for the RIGHT reason
GREEN: Write minimal code → run → confirm it passes
REFACTOR: Clean the impl + the test you just wrote, while it's fresh and greenThe REFACTOR step is non-optional and per-cycle — it happens with the test you just wrote as your safety net, not after all tests are done. Refactoring cold code days later is a different (weaker) activity; that lives in .
evanflow-iterate为一个行为编写一个端到端测试。验证路径可行。
RED: 编写测试 → 运行 → 确认测试因正确的原因失败
GREEN: 编写最少代码 → 运行 → 确认测试通过
REFACTOR: 清理实现代码 + 刚编写的测试,趁代码还新鲜且测试已通过REFACTOR步骤是必不可少的,且每个循环都要执行——它以刚编写的测试作为安全网,而非在所有测试完成后再进行。几天后再重构冷代码是另一种(更弱的)活动;这类重构属于的范畴。
evanflow-iterate3. Incremental Loop
3. 增量循环
For each remaining behavior, repeat the full RED-GREEN-REFACTOR cycle:
RED: Write next test → fails for the right reason
GREEN: Minimal code to pass → passes
REFACTOR: Clean before moving on (see checklist below)Rules:
- One test at a time
- Only enough code to pass the current test
- Don't anticipate future tests
- Tests focus on observable behavior, not internals
- Never skip REFACTOR. "I'll clean it up later" is how dead code, duplication, and shallow modules accumulate.
针对每个剩余行为,重复完整的RED-GREEN-REFACTOR循环:
RED: 编写下一个测试 → 因正确的原因失败
GREEN: 编写最少代码以通过测试 → 测试通过
REFACTOR: 在继续之前进行清理(见下方检查清单)规则:
- 一次编写一个测试
- 仅编写足够通过当前测试的代码
- 不要提前为未来的测试做准备
- 测试聚焦于可观察的行为,而非内部实现
- 绝不要跳过REFACTOR步骤。"我稍后再清理"是死代码、重复代码和浅层模块堆积的原因。
4. Per-Cycle Refactor Checklist
4. 每循环重构检查清单
After each GREEN, before writing the next failing test, scan the just-touched code:
- Duplication — extract if used twice with the same intent (not just structurally similar)
- Naming — does the new name match what the code does? Rename now, while the test pins behavior
- Deletion test — does the new module/function earn its existence, or did GREEN add bloat?
- Deep-module check — small interface hiding the complexity, or shallow wrapper leaking it?
- Test cleanliness — does the test still describe behavior crisply? Names, setup, assertion all clear?
Run tests after each refactor step. Never refactor while RED — get to GREEN first.
If a refactor would change behavior, stop: that's a new test, not a refactor.
每次完成GREEN步骤后,在编写下一个失败测试之前,检查刚修改的代码:
- 重复代码 —— 如果相同意图的代码出现两次(不只是结构相似),则提取复用
- 命名 —— 新名称是否符合代码的实际功能?趁测试已固定行为,立即重命名
- 删除测试 —— 新模块/函数是否有存在的必要,还是GREEN步骤添加了冗余代码?
- 深度模块检查 —— 是隐藏复杂实现的小接口,还是暴露内部的浅层包装?
- 测试整洁度 —— 测试是否仍能清晰描述行为?命名、前置设置、断言是否都清晰?
每次重构步骤后运行测试。绝不要在RED状态下重构——先达到GREEN状态。
如果重构会改变行为,请停止:这属于新测试的范畴,而非重构。
5. Macro Refactor (deferred to evanflow-iterate
)
evanflow-iterate5. 宏观重构(推迟至evanflow-iterate
)
evanflow-iterateCross-cutting refactors that span the whole feature (extracting a shared module across multiple cycles, pulling out a deeper abstraction, restructuring the file layout) belong in 's self-review pass — after all per-cycle refactors are done. Don't conflate the two: per-cycle refactor uses a fresh test as safety net; macro refactor uses the whole test suite.
evanflow-iterate跨整个功能的重构(在多个循环中提取共享模块、构建更深层的抽象、调整文件布局)属于的自我审查阶段——在所有每循环重构完成之后进行。不要混淆两者:每循环重构以新鲜的测试作为安全网;宏观重构以整个测试套件作为安全网。
evanflow-iteratePer-Cycle Checklist
每循环检查清单
[ ] Test describes behavior, not implementation
[ ] Test uses public interface only
[ ] Test would survive an internal refactor (rename, restructure)
[ ] Code is minimal for this test
[ ] No speculative features added
[ ] Test fails for the right reason before code is written
[ ] ASSERTION IS CORRECT — see warning below[ ] 测试描述的是行为,而非实现
[ ] 测试仅使用公共接口
[ ] 测试在内部重构(重命名、结构调整)后仍能通过
[ ] 代码仅满足当前测试的最小需求
[ ] 未添加推测性功能
[ ] 在编写代码之前,测试因正确的原因失败
[ ] 断言正确 —— 见下方警告⚠️ Assertion-Correctness Warning
⚠️ 断言正确性警告
Industry research (HumanEval evaluation across four LLMs) found that over 62% of LLM-generated test assertions were incorrect. This is the single most likely failure mode in LLM-driven TDD: the test passes, but it's testing the wrong thing.
Before writing any test assertion, verify:
- Does this assertion match what the user actually wants? Don't assert on behavior you imagined — assert on behavior the spec/contract names.
- Is this the assertion's most-precise form? "result is truthy" is weaker than "result equals 42". Loose assertions catch wrong things and miss right things.
- Would this assertion still pass if the code was subtly wrong? Mentally introduce a one-character bug — does the assertion catch it? If not, the assertion is too weak.
- Are you asserting on the right field? A common failure: asserting when the meaningful field is
response.status.response.body.error - For computed values: did you compute the expected value correctly? Don't trust your own arithmetic — verify by hand or another path.
When in doubt about what to assert, STOP and ask the user rather than guess. An asserted-on-the-wrong-thing test is worse than no test — it provides false confidence.
行业研究(针对四个LLM的HumanEval评估)发现,超过62%的LLM生成的测试断言是错误的。 这是LLM驱动TDD中最可能出现的失败模式:测试通过了,但测试的是错误的内容。
在编写任何测试断言之前,请验证:
- 该断言是否符合用户的实际需求? 不要断言你设想的行为——断言规范/合约中定义的行为。
- 这是断言最精确的形式吗? "结果为真"比"结果等于42"更弱。松散的断言会错误地捕获正确内容,遗漏错误内容。
- 如果代码存在细微错误,该断言是否仍会通过? 假设引入一个单字符错误——断言是否能捕获它?如果不能,说明断言太弱。
- 你是否在断言正确的字段? 常见错误:当有意义的字段是时,却断言
response.body.error。response.status - 对于计算值:你是否正确计算了预期值? 不要相信自己的计算——手动验证或通过其他方式确认。
如果不确定要断言什么,停止并询问用户,而非猜测。断言错误内容的测试比没有测试更糟糕——它会提供虚假的信心。
Hard Rules
严格规则
- Vertical slices only. Never write all tests first.
- REFACTOR is per-cycle, not deferred. Every GREEN is followed by a refactor pass on the just-written code, with the fresh test as safety net. Deferring all refactor to the end strips the safety net and is the most common way TDD-shaped code ends up with TDD-shaped scars.
- Test behavior, not internals. If a rename breaks a test but behavior didn't change, the test was wrong.
- Watch the test fail. If you didn't see RED, you don't know it tests the right thing.
- Never auto-commit. TDD cycle is RED-GREEN-REFACTOR, not RED-GREEN-REFACTOR-COMMIT.
- Default to real services for integration tests. Mocked databases routinely diverge from production behavior — prefer a test DB unless your project documents a specific exception.
- 仅使用垂直切片。切勿先编写所有测试。
- REFACTOR是每循环必做的,而非推迟执行。每次GREEN步骤后,都要对刚编写的代码进行重构,以新鲜的测试作为安全网。将所有重构推迟到最后会失去安全网,这是TDD风格代码最终留下TDD痕迹的最常见原因。
- 测试行为,而非内部实现。如果重命名导致测试失败但行为未变,说明测试编写有误。
- 确认测试失败。如果你没有看到RED状态,就无法确定测试是否针对正确的内容。
- 绝不自动提交。TDD循环是RED-GREEN-REFACTOR,而非RED-GREEN-REFACTOR-COMMIT。
- 集成测试默认使用真实服务。模拟数据库通常会与生产环境行为存在差异——除非项目文档记录了特定例外情况,否则优先使用测试数据库。
Hand-offs
交接流程
- Tests + impl complete for the task → return to to mark task done
evanflow-executing-plans - Discovered the interface is wrong → to redesign
evanflow-design-interface - Discovered deeper architectural issue →
evanflow-improve-architecture
- 任务的测试和实现完成 → 返回标记任务完成
evanflow-executing-plans - 发现接口设计错误 → 进入重新设计
evanflow-design-interface - 发现更深层次的架构问题 → 进入
evanflow-improve-architecture