review-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

You are reviewing test code written alongside a feature implementation or bug fix. Ensure the tests are well-designed, thorough, and maintainable — not just that they pass. Tests that merely mirror implementation details create false confidence and become a maintenance burden during refactoring.

你正在评审与功能开发或Bug修复同步编写的测试代码。需确保测试代码设计精良、覆盖全面且易于维护——而不仅仅是能通过测试。那些仅复刻实现细节的测试会带来虚假的安全感，并在重构时成为维护负担。

Review Scope

评审范围

Identify what to review:

Find changed test files on the current branch (relative to the base branch).
Find the production code those tests cover — trace imports, function calls, and file naming conventions to map tests to their targets.
Find related existing tests for the same modules or functions that weren't changed — these may need updates or reveal gaps.

Read test files first, before production code. If you can infer the feature's requirements and edge cases from the tests alone, that's a sign the tests are well-written. If you need to read the implementation to understand what the tests are doing, that's a finding worth reporting.

明确评审内容：

找出当前分支（相对基准分支）中变更的测试文件
找出这些测试覆盖的生产代码——通过追踪导入关系、函数调用和文件命名规范，将测试映射到对应的目标代码
找出同一模块或函数未变更的相关现有测试——这些测试可能需要更新，或能暴露测试缺口

先阅读测试文件，再看生产代码。如果仅通过测试就能推断出功能需求和边缘场景，说明测试代码编写良好。如果需要阅读实现代码才能理解测试的作用，这是需要报告的问题。

What Makes a Test Valuable

测试代码的价值标准

Weigh each test against four qualities:

Regression protection — Does this test actually catch bugs? A test that exercises trivial code or skips complex branches protects against nothing. Check: does the test touch business-critical logic, or only verify the happy path of a simple getter?

Refactoring resilience — Will this test break when someone restructures code without changing behavior? Tests coupled to internal method names, call sequences, or private state punish every cleanup with false failures, eroding trust in the suite.

Fast feedback — Unit tests should run in milliseconds. If a test hits the filesystem, network, or database unnecessarily, that's a design issue. But don't confuse speed with value — an integration test verifying a real database query is better than a fast unit test that mocks everything and verifies nothing.

Maintainability — Can someone unfamiliar with this code read the test and understand what it verifies and why? Tests with sprawling setup, cryptic names, or deeply nested mocking fail this check.

从四个维度衡量每一项测试：

回归防护能力——该测试能否真正发现Bug？仅测试琐碎代码或跳过复杂分支的测试毫无防护作用。检查：测试是否涉及业务核心逻辑，还是仅验证简单 getter 的正常流程？

重构韧性——当有人在不改变行为的前提下重构代码时，该测试是否会失效？与内部方法名、调用顺序或私有状态耦合的测试，会在每次代码清理时产生错误失败，削弱对测试套件的信任。

快速反馈性——单元测试应在毫秒级运行。如果测试不必要地访问文件系统、网络或数据库，这是设计问题。但不要混淆速度与价值——验证真实数据库查询的集成测试，比完全Mock所有依赖、却未验证任何实际逻辑的快速单元测试更有价值。

可维护性——不熟悉代码的人能否读懂测试，理解它验证的内容及原因？包含冗长初始化、晦涩命名或深度嵌套Mock的测试不符合这一标准。

Review Areas

评审重点

1. Assertion Completeness

1. 断言完整性

The most common weakness in generated tests: asserting only the obvious output and missing the full "blast radius" of a state change.

When a test triggers an action, ask what else changed. If a test adds an item to a cart, does it only check the item count? Or does it also verify the price calculation, subtotal update, and that other items are unaffected?

Flag when:

A test asserts a return value but ignores side effects
A test checks that an operation succeeded but not that it produced the right result
A test verifies the happy path but skips boundary conditions (empty inputs, nulls, maximum values, off-by-one)
Error-path tests only check that an error was thrown, not the error message, type, or cleanup behavior

生成测试最常见的缺陷：仅断言明显输出，忽略状态变更的完整“影响范围”。

当测试触发某个操作时，要思考还有哪些内容发生了变化。如果测试向购物车添加商品，是否仅检查商品数量？还是同时验证价格计算、小计更新以及其他商品未受影响？

需标记的情况：

测试断言返回值，但忽略副作用
测试仅检查操作成功，未验证产生的结果是否正确
测试仅验证正常流程，跳过边界条件（空输入、null、最大值、差一错误）
错误路径测试仅检查是否抛出错误，未验证错误信息、类型或清理行为

2. Test Structure

2. 测试结构

Each test should have exactly one Arrange-Act-Assert cycle. If a test acts and asserts multiple times in sequence, it's testing multiple behaviors and should be split.

Flag when:

A test has multiple act phases (testing a workflow, not a behavior)
The arrange phase is so large the actual behavior being tested is buried
Assertions appear inside setup helpers or utility functions (hides what's being verified)
Test logic contains conditionals or loops — tests should be straight-line code

每个测试应恰好包含一个Arrange-Act-Assert周期。如果一个测试连续执行多次操作和断言，说明它在测试多个行为，应拆分。

需标记的情况：

测试包含多个操作阶段（测试工作流而非单一行为）
初始化阶段过于庞大，导致被测试的实际行为被掩盖
断言出现在初始化工具或辅助函数中（隐藏了被验证的内容）
测试逻辑包含条件判断或循环——测试应是直线型代码

3. Fixture and State Management

3. 测试数据与状态管理

How test data is created determines whether the suite is maintainable at scale. Inline setup (data created in the test body) is fine for simple tests but painful when constructor signatures change across many tests. Implicit setup (shared

beforeEach

setUp

blocks) eliminates duplication but obscures what each test actually depends on. Delegated setup (factory functions or builders called explicitly) keeps tests readable while centralizing construction logic.

Flag when:

A shared setup block creates objects most tests don't use
Tests depend on external files, database state, or globals defined elsewhere ("Mystery Guest" — everything a test needs should be visible in its body or one function call away)
Tests assume resources exist without creating them (files, database records, environment variables)
Tests mutate shared state without cleanup, causing order-dependent failures

测试数据的创建方式决定了测试套件在大规模场景下的可维护性。内联初始化（在测试体内创建数据）适用于简单测试，但当构造函数签名在多个测试中变更时会非常麻烦。隐式初始化（共享

beforeEach

setUp

块）消除了重复，但模糊了每个测试实际依赖的内容。委托式初始化（显式调用工厂函数或构建器）在集中管理构造逻辑的同时，保持了测试的可读性。

需标记的情况：

共享初始化块创建了大多数测试不需要的对象
测试依赖外部文件、数据库状态或其他地方定义的全局变量（“神秘访客”——测试所需的所有内容应在其体内可见，或通过一次函数调用获取）
测试假设资源已存在但未创建（文件、数据库记录、环境变量）
测试修改共享状态但未清理，导致依赖执行顺序的失败

4. Mocking Boundaries

4. Mock边界

Mocks are essential for isolation, but overuse turns tests into mirrors of the implementation.

The key principle: mock at architectural boundaries, not at every function call. Use stubs (canned data) for query-type dependencies and mocks (interaction verification) only for commands with side effects like sending emails or writing to external systems.

Respect the codebase's existing mocking convention. Flag inconsistency within the project, not deviation from a universal rule.

Flag when:

A test mocks types it doesn't own (third-party libraries, framework internals) — when the library updates, the mock passes against outdated assumptions. Prefer: thin adapters, real implementations in integration tests, or official testing utilities (MemoryRouter, test databases, in-memory caches).
Mock density is high — 4+ mock configurations suggests too many responsibilities in the production code
A test verifies call counts or argument sequences on internal methods (couples the test to how the code works, not what it does)
Stubs are being verified (checking a stub was called tests implementation, not behavior)

Mock对于隔离测试至关重要，但过度使用会让测试沦为实现细节的镜像。

核心原则： 在架构边界处使用Mock，而非每个函数调用都Mock。对查询类依赖使用Stub（固定数据），仅对有副作用的命令（如发送邮件、写入外部系统）使用Mock（交互验证）。

遵循代码库现有的Mock约定。标记项目内部的不一致性，而非偏离通用规则的情况。

需标记的情况：

测试Mock了不属于自身的类型（第三方库、框架内部）——当库更新时，Mock会基于过时的假设通过测试。更优选择：轻量适配器、集成测试中使用真实实现，或官方测试工具（MemoryRouter、测试数据库、内存缓存）
Mock密度过高——4个及以上Mock配置表明生产代码职责过多
测试验证内部方法的调用次数或参数顺序（将测试与代码的实现方式耦合，而非功能本身）
对Stub进行验证（检查Stub是否被调用是测试实现，而非行为）

5. Test Smells

5. 测试坏味道

Smell	What It Looks Like	Why It Matters
Assertion Roulette	Multiple assertions with no failure messages	Can't tell which assertion broke without debugging
Eager Test	One test exercises several unrelated methods	Failures are ambiguous — which behavior broke?
Lazy Test	Multiple tests call the same method with identical inputs	Redundant maintenance cost, no coverage gain
Sleepy Test	Hard-coded `sleep()` / `Sys.sleep()` / `setTimeout()`	Flaky in CI, slow everywhere. Use polling or explicit waits.
Rotten Green	Assertions inside `try` / `tryCatch` or conditional branches	Test always passes because the assertion is never reached
Sensitive Equality	Asserting against `toString()` / `print()` output	Breaks on formatting changes; assert structural properties instead
Print Statement	`print()` / `console.log()` instead of assertions	Debugging leftovers that verify nothing
Snapshot Abuse	Snapshots as a substitute for behavioral assertions	Any change triggers failure, developers blindly update. Good uses: one snapshot of an HTML component's structure (not every prop combination), error message text, CLI output. Bad: snapshotting entire objects or rendering every variant.
Implementation Mirror	Expected values computed using the same logic as production code	Test and production code will always agree — even when both are wrong. Hardcode expected values from a known-good source.

坏味道	表现形式	影响
断言轮盘	多个断言未添加失败信息	无法直接判断哪个断言失败，需调试
贪婪测试	一个测试执行多个无关方法	失败原因模糊——无法确定是哪个行为出了问题
冗余测试	多个测试用相同输入调用同一方法	增加维护成本，未提升测试覆盖
休眠测试	硬编码 `sleep()` / `Sys.sleep()` / `setTimeout()`	在CI环境中不稳定，且整体运行缓慢。应使用轮询或显式等待
虚假通过	断言位于 `try` / `tryCatch` 或条件分支内	测试总是通过，因为断言从未被执行
敏感相等性	断言 `toString()` / `print()` 输出	格式变更会导致测试失败；应断言结构属性
打印语句残留	使用 `print()` / `console.log()` 而非断言	调试残留代码，未验证任何内容
快照滥用	用快照替代行为断言	任何变更都会触发失败，开发者会盲目更新。合理场景：HTML组件结构的单一快照（而非所有属性组合）、错误信息文本、CLI输出。不合理场景：快照整个对象或所有渲染变体
实现镜像	使用与生产代码相同的逻辑计算预期值	测试与生产代码始终一致——即使两者都出错。应从已知正确的来源硬编码预期值

6. Naming and Readability

6. 命名与可读性

Test names should describe behavior, not implementation. A well-named test suite reads like a feature specification.

Flag when:

Test names reference internal method names (e.g.,
```
test_processData_returns_true
```
) — break on rename
Test names are generic (
```
test1
```
,
```
test_it_works
```
,
```
test_basic
```
)
The name doesn't communicate the scenario or expected outcome

Prefer behavioral names:

test_expired_subscription_blocks_access

delivery_with_past_date_is_invalid

empty_cart_shows_zero_total

测试名称应描述行为，而非实现。命名良好的测试套件读起来就像功能规格说明书。

需标记的情况：

测试名称引用内部方法名（如
```
test_processData_returns_true
```
）——重命名时会失效
测试名称过于通用（
```
test1
```
、
```
test_it_works
```
、
```
test_basic
```
）
名称未传达测试场景或预期结果

推荐使用行为式命名：

test_expired_subscription_blocks_access

、

delivery_with_past_date_is_invalid

、

empty_cart_shows_zero_total

。

7. Coverage Gaps

7. 覆盖缺口

Cross-reference production code changes against the test suite:

Are there branches or conditions no test exercises?
Are error paths tested? (Not just "does it throw" but "does it throw the right thing and clean up properly")
Are edge cases covered? (Empty collections, null/NA/None/undefined, boundary values, concurrent access if applicable)
If production code changed existing behavior, were existing tests updated?

Walk through the implementation and note every decision point — each

if

match

switch

, error handler, or early return. Check whether the test suite exercises both sides of that decision.

When reviewing R tests using

testthat

, check if the

testing-r-packages

skill is available and invoke it for R-specific conventions and patterns.

将生产代码变更与测试套件进行关联检查：

是否存在未被任何测试覆盖的分支或条件？
错误路径是否被测试？（不仅是“是否抛出错误”，还要验证“是否抛出正确的错误并完成清理”）
边缘场景是否被覆盖？（空集合、null/NA/None/undefined、边界值、并发访问<如适用>）
如果生产代码变更了现有行为，现有测试是否已更新？

遍历实现代码，记录每个决策点——每个

if

、

match

、

switch

、错误处理器或提前返回。检查测试套件是否覆盖了每个决策的正反两面。

当评审使用

testthat

的R语言测试时，检查是否有

testing-r-packages

技能可用，并调用该技能获取R语言特定的约定和模式。

Response Format

响应格式

undefined

undefined

Summary

总结

[Overall assessment: How well do these tests protect the codebase?]

[整体评估：这些测试对代码库的防护效果如何？]

Critical Issues (Blocking)

关键问题（阻塞性）

[Tests that provide false confidence or will cause real problems.]

[会带来虚假安全感或引发实际问题的测试]

Required Changes

必要变更

[Design problems that weaken the test suite.]

[削弱测试套件的设计问题]

Strong Suggestions

强烈建议

[Improvements to test quality and maintainability.]

[提升测试质量与可维护性的改进方案]

Noted

备注

[Minor style or convention issues. Mention once, then move on.]

[轻微的风格或规范问题。只需提及一次即可]

Verdict

结论

Request Changes | Needs Discussion | Approve

请求变更 | 需要讨论 | 批准

Next Steps

下一步

[Options for proceeding]


Use `file:line` references for every finding. Quote the specific test code that demonstrates the issue and show what better code looks like.

[后续行动选项]


对每个问题使用`file:line`格式引用。展示存在问题的具体测试代码，并给出优化后的代码示例。

Next Steps

后续步骤

At the end of the review, offer the user these options:

Discuss and address findings: Use the AskUserQuestion tool to walk through the issues. Group by severity or topic, offer resolution options, and mark the recommended choice.

Fix the issues: Offer to apply fixes directly in priority order — blocking issues first, then required changes, then suggestions. Confirm before continuing after each group.

Add to a pull request: When reviewing in context of a PR, offer to post the review as a PR comment. Include attribution: "Review assisted by the review-testing skill."

If operating as a subagent, skip the next steps and output only the review findings.

评审结束后，为用户提供以下选项：

讨论并解决问题： 使用AskUserQuestion工具逐一梳理问题。按严重程度或主题分组，提供解决方案选项，并标记推荐方案。

修复问题： 按优先级直接修复问题——先处理阻塞性问题，再处理必要变更，最后处理建议项。完成每组修复后需确认再继续。

添加到Pull Request： 在PR场景下评审时，可将评审结果作为PR评论发布。需注明来源：“评审由review-testing skill协助完成。”

如果作为子Agent运行，跳过后续步骤，仅输出评审结果。