testing-boss
Comprehensive testing doctrine for software and AI systems — covers positive patterns, anti-patterns, gates for coding agents writing tests, CI discipline, and an LLM/agent evaluation primer. Use when authoring or reviewing tests, adding mocks, deciding test placement, generating tests via agents, debugging flaky CI, designing eval suites for LLM features, or rebuilding a brittle test suite. Contains 12 positive patterns (selector hierarchy, table-driven, builders, real-system gates), 25 anti-patterns across Brittleness, Flakiness, Mock-misuse, Process, and AI-specific families, 7 mandatory gates for agents writing tests, flaky-test taxonomy with quarantine workflow, contract / property / mutation testing patterns, and an oracle-ladder primer for LLM-as-judge and agent eval. Language-agnostic — pseudo-code only. Don't use for general code review, library-specific debugging unrelated to tests, non-testing CI pipeline design, or production observability.
NPX Install
npx skill4agent add pedronauck/skills testing-bossTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Testing Boss
test-antipatternsIron Laws
1. Test the behavior, never the mock.
2. Push every test to the lowest layer that can detect the failure.
3. When a test fails, fix production first — change the test only after writing why.
4. Real systems gate the merge. Mocks isolate; they do not validate.
5. Coverage is a flashlight. Mutation score is a quality probe. Neither is a target.
6. No test-only methods, branches, or flags leak into production code.Required Reading Router
| Task | MUST read |
|---|---|
| Deciding where a new test belongs (layer, file, owner) | |
| Writing a new test (any layer, any framework) | |
| Reviewing a test, smelling a problem, or fixing a brittle suite | |
| Generating tests via a coding agent (Claude Code, Codex, Cursor) | |
| Triaging flaky tests, designing CI gates, or picking contract/property/mutation patterns | |
| Designing evals for LLM/agent systems (RAG, tool use, prompt regression) | |
| Looking up the original source for any claim in this skill | |
Reference Index
- — placement doctrine (invariant + owning layer), pyramid vs trophy debate resolution, risk-based prioritization, coverage philosophy, test-boundary contracts.
references/foundations.md - — 12 cross-framework positive patterns with agnostic pseudo-code: selector hierarchy, condition-based waits, per-test isolation, table-driven, builders/factories, behavior-first assertions, boundary-only mocking.
references/patterns.md - — 25 anti-patterns across five families (Brittleness, Flakiness, Mock misuse, Process, AI-specific). Each entry: violation, why wrong, fix, gate question, evidence URL.
references/antipatterns.md - — seven mandatory gates with verbatim prompt blocks for any agent that generates tests: invariant first, owning layer, real execution, failure→fix production, no snapshot without contract, no assertion on self-set mock, negative companion.
references/ai-writes-tests.md - — flaky-test taxonomy, quarantine-plus-owner workflow, CI stage pyramid, contract / property / mutation / accessibility testing patterns, deterministic test architecture.
references/ci-automation.md - — eval-driven development primer, oracle ladder, LLM-as-judge biases and calibration, RAG metrics, agent trajectory vs outcome eval, benchmark pitfalls.
references/llm-eval.md - — consolidated bibliography (all URLs grouped by axis) for citation and audit.
references/sources.md
Decide before the first line of test code
- Name the invariant in one sentence before opening any test file. If the sentence is fuzzy, the invariant is not clear enough to test.
- Place the test at the lowest layer that can fail when the invariant breaks. A higher-layer test is justified only when the invariant requires real integration the lower layer cannot prove.
- Reject the test entirely when (likelihood-of-bug × blast-radius) is below the threshold worth ten minutes of maintenance. Not every line deserves a test.
references/foundations.mdPattern catalog (write tests that survive refactor)
- Query by behavior and accessible role, never by CSS selector or DOM index.
- Selector hierarchy: role → label → text → test-id → structural. Stop at the highest rung that disambiguates.
- Wait on observable conditions, never on wall-clock sleeps.
- Each test is independent and order-free; setup beats teardown.
- One behavior per test, but as many assertions as that behavior needs.
- Test names read as specifications: .
should <outcome> when <condition> given <state> - Table-driven / parameterized when only the inputs vary.
- Build test data via factories or builders; literal blobs only for the field under test.
- Mock at boundaries you do not control; real wiring for what you own.
- Real systems gate the final merge; contract tests bridge unit and E2E.
- Mutation score, not coverage percentage, measures suite strength.
- Page Object Model is a tool, not a religion — collapse it for small suites.
references/patterns.mdAnti-pattern families (do not do these)
- Brittle/implementation-detail selectors.
- Testing internal structure instead of observable behavior.
- Testing private methods directly.
- Snapshot-as-test (a snapshot replacing real assertions).
- Vague existence assertions (,
.should('exist')).toBeTruthy - Action without assertion.
sleeptest-antipatternsbeforeAllafterEachbeforeEachtest-antipatternsreferences/antipatterns.mdAI agents writing tests
- Invariant first — agent prints ,
INVARIANT: …,OWNING_LAYER: …before any test code.EXISTING_SUITE: … - Owning layer — extend an existing suite; reject new files without a named invariant.
- Real execution — every new test must run against a real DB / route / external integration at least once.
- Failure → fix production — on red, the next tool call reads the production code, not the test.
- No snapshot without contract — classify the artifact as or
PRODUCT_CONTRACT; the latter forbids snapshots.IMPLEMENTATION_DETAIL - No assertion on self-set mock — cannot assert on a value the same test body wrote into a mock.
- Negative companion — every positive assertion ships with a negative test for invalid input or failure mode.
references/ai-writes-tests.mdCI & flaky discipline
- Quarantine a flaky test the same hour it is detected. Assign a named human owner within 24 hours with a fix-by date. No anonymous quarantines.
- Track as a first-class operational metric. SLO < 1–2%; alert at > 5%. Retry without telemetry is debt accrual, not stability.
flaky_rate - Real systems at the final gate. Mock at unit; contract-test the boundary; real DB / queue / route at integration; near-zero mocks at E2E.
references/ci-automation.mdLLM and agent evaluation (Part 6, enxuta)
- Start small. Twenty unambiguous tasks drawn from real failures beat two hundred synthetic ones.
- Climb the oracle ladder: exact / schema / outcome-state checks before LLM-as-judge before human review. Use the cheapest oracle that catches the failure.
- LLM-as-judge needs calibration. Validate against humans (target ≥ 0.80 Spearman) before trusting any judge in CI. Always use a different model than the system under test.
- Agents need outcome checks. Trajectory grading punishes valid creativity; outcome-only grading misses ghost actions where the transcript claims success and nothing changed.
references/llm-eval.mdRed flags (cross-cutting)
- Mock setup is larger than the test logic.
- Test breaks when an internal method is renamed (not a public contract).
- Removing the assertion body leaves the test still green.
- Test fails when run with in isolation.
.only - ,
sleep, orThread.sleepappears anywhere.cy.wait(<number>) - Selector contains a CSS class, an index, or .
xpath - Test asserts a third-party site is reachable.
- Snapshot diffs are accepted in code review without reading them.
- Coverage percentage is the only quoted quality metric.
- Failing tests are auto-retried until green; nobody investigates.
- Skipped or quarantined tests have no named owner and no fix-by date.
- Test depends on ,
new Date(), or system locale.Math.random() - resets the database (move it to
afterEach).beforeEach - An AI-written test has six+ assertions and zero edge cases.
- The phrase "I'll mock this to be safe" appears anywhere in the diff.
When NOT to use this skill
- General code review unrelated to tests — use a code-review skill.
- Library-specific debugging where the test is just the reproduction — use the library's own debugging skill.
- Non-testing CI pipeline design (deploys, artifact promotion, secrets management).
- Production observability and alerting design — those are runtime concerns, not test concerns.
- Single-line typo fixes in existing tests — the doctrine is for non-trivial work.
Bottom line
A test that cannot fail is decorative. A test that fails for the wrong
reason is misleading. Build tests that fail for exactly one reason —
the reason the invariant was violated — and trust them when they do.
Mocks isolate. Real systems validate. Coverage shines a light. Mutation
score grades the suite. Agents will reach for the mock and the snapshot;
the gates here make them put both down.
Tests reveal bugs, not just pass.