Code ultrareview
<!-- canonical:label-hygiene:start -->
Critical — Label hygiene
Internal planning labels are author coordinates, not reader coordinates. Strip them from every shipped artifact this skill emits — code, comments, commit subjects/bodies, PR titles/descriptions, release notes, doc paragraphs, non-trivial comments.
- Workstream and task labels — , , , issue or ticket numbers, plan phase names from the source spec, issue body, or planning artifact. Translate to the domain noun (
Runs the battery script (WS-2)
→ ). <!-- noqa: internal-label -->
- Process language — "the rebuild", "the prior ", "carried verbatim from", "the cleanup pass", "the audit", "spec AC" standalone. Replace with the concrete fact (
carries the routing from the prior aggregation
→ routes via the merge keys in the synthesis module
). <!-- noqa: internal-label -->
- Plan-internal references — "as the brief says", "per the workstream", "from the forge artifact". Drop the reference; state the fact directly.
Carve-outs — literal
is legitimate where the skill IS the format authority (forge templates, apex rule documentation). Reviewer-facing dev docs (e.g.
under
) may reference deleted artifacts by their author-time names.
<!-- canonical:label-hygiene:end -->
Eight-axis judgment code review. Five-phase pipeline scope → tool battery → 8 parallel axis reviewers → Haiku validators → synthesis. Always runs at full strength. Distinct from Anthropic's remote
— same goal, in-session on the user's subscription.
<!-- canonical:writing-rules:start -->
Important — Writing rules
These rules govern every prose artifact this skill emits — READMEs, CHANGELOGs, commit messages, PR bodies, release notes, doc paragraphs, non-trivial comments. Apply them at draft time, verify before output.
- Match the surrounding style — punctuation, capitalization, backtick conventions, em-dash vs parens, bullet style.
- Every sentence changes the reader's understanding. Cut it otherwise.
- Front-load the verb — "Creates", not "This helps you create".
- Concrete over abstract. Lists for ≥3 enumerable items.
- Assert positively. Reserve negation for real constraints ().
- No marketing words: powerful, robust, seamlessly, leverage, unlock, comprehensive, delightful.
- No AI tells: delve, tapestry, intricate, pivotal, testament, underscore, crucial, garner, showcase, additionally, moreover, furthermore, indeed.
- After drafting English prose, invoke if installed.
<!-- canonical:writing-rules:end -->
Objective
Run the 8 axes — Correctness, Simplification, Tests, Documentation, Style, Intent, Design/API, Performance — as 8 parallel LLM subagents fed by deterministic tool findings from
. Coherence joins as a 9th axis when metadata files change. Sub-80 axis findings get re-scored by Haiku validators against the verbatim rubric in
references/anthropic-verbatim.md
. Findings synthesize into one report with deterministic dedup, inter-axis precedence, A2 no-silent-drop, and a verdict (Ship / Fix-then-ship / Needs work). The report ends with "What I did NOT check" so the coverage limits are explicit.
Parameters
| Flag | Behavior |
|---|
| Save the report + JSONL to ~/.claude/output/{project}/code-ultrareview/code-ultrareview-{slug}.{md,jsonl}
|
| Force no-save (overrides any ambient save mode) |
| Override the review base (skip auto-detection via ) |
| Override the scope classifier. Values: , , , , , , , , . Persistent per-repo override at (); the flag wins on conflict. Invalid value exits 2 |
| Activate the Intent-axis derivation sub-mode. may be , , an explicit path or directory, , gh:issue:<owner>/<repo>#<N>
, or a GitHub issue URL. Findings classify as GAP / SCOPE-ADD / DECISION-OVERRIDE / CONSISTENT |
| Run build verification on sub-80 axis findings BEFORE Haiku validators (Phase 3.5). Builds + runs the test command detected by ; confirmed findings get promoted (+30 confidence) and skip the validator phase |
| Run Stryker (JS/TS), Pitest (JVM), or mutmut (Python) on changed files only. Surviving mutants route to the Tests axis as 🟠 Medium |
| Opt-in writers: auto-apply low-risk fixes (manifest version sync, structured-field description sync with full-agreement guard, one failing test per confirmed bug). Diff preview + per-file confirmation before any write |
| Coherence axis compares README freeform paragraphs as well (default: structured fields only) |
| Comma-separated subset of axes to run (e.g. ). Default: all 8 + Coherence when triggered |
| List detected tools per repo_kind + print install commands for missing ones. Informational only, no install |
Lowercase enables, uppercase disables. No
— this skill is a producer, not a consumer.
bash
/code-ultrareview # full 8-axis review, print report
/code-ultrareview -s # save the report + JSONL for /apex -f
/code-ultrareview -b origin/main # review HEAD against an explicit base
/code-ultrareview --verify-build # promote sub-80 findings via real build verification
/code-ultrareview --reconcile @auto # add Intent derivation sub-mode with auto-detect
/code-ultrareview --apply-safe # full review + gated low-risk fixes
/code-ultrareview --preflight # list tools the battery would run, no review
/code-ultrareview --axes correctness,tests # subset of axes
The five phases
Phase 1 — Scope
Runs
. Deterministic, no LLM. Outputs
:
- Diff resolution — clean tree → ladder; dirty tree → + every untracked file inlined as added lines.
- Repo-kind classification — 8 kinds ( / / / / / / / ) + . Override via or .
- CLAUDE.md chain — root + nested in changed directories + + . Ordered root-to-deepest. Read by axis reviewers and validators.
- Coherence activation — any of ,
.claude-plugin/marketplace.json
, , , root , , , , in the diff → scope.json["activates_coherence"] = true
.
- Languages detection — from changed-file extensions; drives Phase 2 dispatch.
The output also feeds the report header lines
,
,
.
Phase 2 — Tool battery
Runs
. Deterministic CLIs feed
tagged by axis with
. Tools dispatch per
: npx-wrapped (
/
/
/
) and uvx-wrapped (
/
/
/
) tools wrap zero-install; native binaries (
/
/ Go
/
) fall back to PATH. Bundled
carries the universal N+1 and sync-I/O semgrep rules. Per-tool axis routing lives in
scripts/battery_ingest.py
. Full Tool → Axis → Install table in
.
Graceful skip. Missing tools emit
WARN: <tool> not found — install: <command>
to stderr and append to
scope.json["tools_skipped"]
; the skill continues. The battery NEVER auto-installs — no
,
,
,
, or
install runs.
Phase 2 extension — . dispatches Stryker (JS/TS), mutmut (Python), or pitest-maven (JVM) scoped to changed files only. Surviving mutants route to the Tests axis as 🟠 Medium with
(skips Phase 4 validators). Runtime can exceed 10 minutes per language; default 600 s timeout overridable via
. Graceful skip on missing tool or config. Details:
references/ultra-execution.md
.
Phase 3 — Axis review
The orchestrator prepares 8 per-axis bundles (+ Coherence when active) via
scripts/axis_dispatch.py prepare
, then launches every bundle as a parallel
in one message. Each subagent reads its axis brief, the rubric in
references/anthropic-verbatim.md
, the diff, and its filtered tool findings. Each emits canonical-schema JSONL on stdout. Subagents cannot spawn other subagents — the main thread launches both axis reviewers AND validators.
The 8 always-on axes:
Correctness ·
Simplification ·
Tests ·
Documentation ·
Style ·
Intent ·
Design/API ·
Performance. Each maps to
references/axes/<name>.md
for scope + repo-kind branches. Coherence is the conditional 9th — added when
scope.json["activates_coherence"]
is true; when inactive, the header surfaces
so the absence is visible. Full axis map, inter-axis precedence, and orchestration details (prepare CLI, bundle schema, no-silent-failure contract):
references/axes-overview.md
+
references/orchestration.md
.
Phase 4 — Validation
The orchestrator prepares per-finding validator bundles via
scripts/run_validators.py prepare
, then launches one Haiku
per finding in the same message — batched ≤10 parallel. Each validator receives the finding + diff context + the deepest matching CLAUDE.md snippet + the verbatim rubric, re-scores 0-100, and re-checks the cited CLAUDE.md rule actually exists in
(demotes with
CLAUDE.md rule not found at <path>
if not).
Confidence threshold = 80 (
scripts/synthesis_core.py:CONFIDENCE_THRESHOLD
). Tool-battery findings (confidence 100) skip the validator phase — they are deterministic. Validators stay read-only — no Write / Edit / Bash, no nested subagent spawn.
Typical runtime. 5-15 sub-80 findings → one batch → ~30-60s. 25+ findings spread over 2-3 batches stay under ~2 min. Latency is dominated by Haiku launch overhead, not inference.
A2 contract. No sub-80 finding silently dropped. Each one is promoted to ≥80, demoted with reason, or surfaced in
with the validator's reason text.
Validator prepare CLI, bundle schema, ingest pass details:
references/orchestration.md
.
Phase 3.5 — . Build verification runs BEFORE validators via
scripts/run_build_verify.py
(composing
+
synthesis_core.iterate_unverified
— +30 confidence, cap 95, floor 80). Sub-80 findings on
/
/
/
get promoted past the validator phase when the build fails. Other axes pass through unchanged. Details:
references/orchestration.md
.
Phase 5 — Synthesis
Runs
on top of
scripts/synthesis_core.py
primitives:
- Dedup by .
- Inter-axis precedence — when 2+ axes flag the same with the same finding wording, highest severity wins; ties resolve via
Correctness > Design/API > Simplification > Tests > Documentation > Style > Intent > Performance > Coherence
(scripts/synthesis_core.py:AXIS_PRIORITY
). Distinct findings at coincident lines (a Correctness null-deref and a Tests missing-assert on the same line) survive as separate entries.
- A2 routing — sub-80 stays in Unverified with the validator's reason.
- Verdict — / / (
scripts/synthesis_core.py:compute_verdict
).
- Report emission — markdown to terminal +
~/.claude/output/{project}/code-ultrareview/code-ultrareview-{slug}.md
. JSONL alongside with Conventional Comments labels ( / / / ).
The closing
"What I did NOT check" section is mandatory and always present, even when nothing was skipped — it lists security (defers to
), runtime performance / benchmarks (explicit non-goal), flaky test detection (explicit non-goal), and any tools from
scope.json["tools_skipped"]
.
Final report layout
templates/code-ultrareview.md
is the canonical wire format — every
section renders verbatim in template order with its emoji prefix; no rename, merge, reorder, or improvise.
Terminal echo is mandatory — the full canonical report prints to the chat-terminal on every invocation;
is purely additive (writes the same bytes to
~/.claude/output/{project}/code-ultrareview/code-ultrareview-{slug}.md
, byte-for-byte identical to terminal output). Severity marker mapping (🔴 High blocks ship · 🟠 Medium fix-soon · 🟢 Low nit · ⚠️ Unverified sub-80) lives in
scripts/synthesis_core.py:SEVERITY_MARKERS
.
Trust model
The skill ingests third-party content — CLAUDE.md files, PR bodies, planning artifacts (
), GitHub issue bodies — which can carry indirect prompt-injection. Axis reviewers and validators are read-only (no Write / Edit / Bash mutation). User review of the report is the trust boundary before any
write;
itself gates writes behind diff preview + per-file confirmation.
Rules
- Only new findings. Issues the diff introduces. Pre-existing findings carry the tier for context, never flip the verdict.
- No silent drop (A2). Sub-80 findings surface in with rationale
Sub-80 confidence ({score}) — verify locally before action.
- Fail loud. A phase that cannot run (unresolvable base, missing tool with no skip path, dependency failure) appears in the header or as a finding. Never silent.
- Cite precisely. Every finding carries ; CLAUDE.md findings quote the violated rule verbatim; permalinks use
https://github.com/<owner>/<repo>/blob/<full-sha>/<path>#L<n>-L<m>
(full SHA via ).
- Full report in chat every time. The complete report prints to the terminal on every invocation. writes the same bytes to disk; it never gates or summarises chat output.
- NEVER auto-install tools. Missing tools surface install commands in the report and
scope.json["tools_skipped"]
. The user installs them.
- NEVER modify code without . Default is read-only review. writers are surgical and per-file confirmed.
Deferrals
The closing "What I did NOT check" section always names these — explicit user-facing calibration of coverage:
- Security → or Anthropic's
claude-code-security-review
. Security is a distinct concern with its own deeper review pattern.
- Runtime performance / benchmarks → not covered. The Performance axis catches static patterns (N+1, sync I/O) but not runtime profiling.
- Flaky test detection → not covered. The Tests axis catches structural smells, not flake.
- Tools from
scope.json["tools_skipped"]
→ listed explicitly so the user sees what they sacrificed by not installing the native binaries.
Graceful degradation
- No CLAUDE.md / no — Style axis runs without baseline; the report says
Style axis: skipped — no rules baseline found
.
- No / no — every wrappable tool skips; only PATH binaries run.
- Missing native binary (, , , Go tools) — emits to stderr +
scope.json["tools_skipped"]
. The relevant axis loses its tool input but still runs LLM judgment.
- Unresolvable base — fail loud with the resolver's hint line. Do not guess.
- Unknown repo_kind — axes run with their branch (no specialization).
- Coherence inactive — when no metadata files change, the 9th axis simply does not launch. The report header says so the absence is visible.
Composition
Bridge to the fix pass after the report ships:
/apex -f ~/.claude/output/{project}/code-ultrareview/code-ultrareview-{slug}.md
— structured fix pass (requires ; pass the absolute path the report prints).
- — single-finding quick fix (takes a description, not a file).
Opt-in flag composition
The four opt-in flags layer orthogonally on the always-on pipeline: mutation tests join Phase 2 tool findings;
adds Phase 3.5;
enriches the Intent axis in Phase 3;
writers run post-synthesis with diff preview + per-file confirmation. Without a flag, its feature is off. Full composition matrix and per-flag details:
references/ultra-execution.md
.
What this skill is NOT
- Not a security audit. Defers to . The closing section makes this explicit on every report.
- Not a linter or formatter. The deterministic tool battery (npx/uvx-wrapped CLIs) handles linting and dead-code detection. The skill layers LLM judgment on top of those signals.
- Not Anthropic's remote . Distinct surface — this skill runs in-session on the user's subscription; runs in a remote sandbox and bills per run.
- Not a fix tool. Report-only by default. covers three surgical writers; everything else routes to or .
References
- Reviewer primitives —
references/anthropic-verbatim.md
(rubric + HIGH SIGNAL + false-positive taxonomy), references/axes-overview.md
(8 axes + Coherence + inter-axis precedence), references/axes/<name>.md
(per-axis briefs), references/orchestration.md
(Phase 3 + 4 + 3.5 prepare CLIs and bundle schemas).
- Opt-in flags —
references/ultra-execution.md
covers , , , in full.
- Scripts — (Phase 1), + (Phase 2), (Phase 3), (Phase 4), + + (Phase 5). Opt-in: , , ,
apply_safe/{version_sync,description_sync,failing_test_writer}.py
.
Gotchas
- Sub-80 findings can be dropped instead of surfaced in . The A2 contract (
scripts/synthesis_core.py:apply_a2
) is no-silent-drop. The model sometimes treats a sub-80 score as a rejection signal and omits the finding entirely. Fix: scan the section explicitly on every report; compare finding count to axis output to catch drops.
- First-run / downloads add latency. Cold start adds ~5s per tool the first time the battery touches it; subsequent runs are fast (cached at / ). The README install table documents this so users don't fear repeated downloads.
- Coherence activates silently on metadata changes. A single touch triggers the 9th subagent automatically. Watch for the line in the report header — it tells you the axis ran without you asking.
- skips silently on malformed planning artifacts. A forge or apex file with broken YAML frontmatter (unclosed , tab indentation, unquoted colons) is dropped from the auto-detect list. Verify with
head -20 ~/.claude/output/{project}/forge/forge-*.md
before relying on .