Code Review Orchestrator
READ-ONLY CONSTRAINT
THIS SKILL IS STRICTLY READ-ONLY WITH RESPECT TO THE CODEBASE BEING REVIEWED. NEVER modify, write, edit, or delete any file in the user's source tree. NEVER run commands with side effects on the reviewed project (no
, no
, no git commits, no database mutations). If a fix is needed, generate a copy-pasteable fix prompt that the user can run separately.
One documented exception — report artifact: after
finishes, the orchestrator writes a single markdown report to
./codeprobe-reports/<project>-<cmd>-<timestamp>.md
in the current working directory (e.g.
./codeprobe-reports/growth-engine-audit-2026-04-23-221047.md
). This is the skill's own output, not modification of reviewed code. No other writes are permitted anywhere.
Violations of this constraint are NEVER acceptable, regardless of user request.
1. Command Routing
Parse the user's input to extract a subcommand and target path. The input format is:
/codeprobe [subcommand] [path]
Routing Table
| Command | Behavior | Sub-skills Invoked |
|---|
| Full audit — visual health dashboard (category scores, codebase stats, hot spots) followed by detailed P0-P3 findings with fix prompts | All available sub-skills + |
| SOLID principles analysis only | |
/codeprobe security <path>
| Security audit only | |
| Code smells detection only | |
/codeprobe architecture <path>
| Architecture analysis only | |
/codeprobe patterns <path>
| Design patterns analysis only | |
/codeprobe performance <path>
| Performance audit only | |
| Error handling audit only | |
| Test quality audit only | |
/codeprobe framework <path>
| Framework best practices only | |
| Top 5 issues — run all sub-skills in scan mode, then generate full detail for top 5 | All available |
| PR-style review on changed files vs branch (default: ) | All relevant (Phase 3) |
| Generate report from last audit | (Phase 3 — TBD) |
Default Behaviors
- No subcommand given: Ask the user what they want. Present the available commands.
- No path given: Use the current working directory.
- Phase 3 stubs: If the user invokes or , respond: "This feature is coming in Phase 3. Available now: audit, solid, security, smells, architecture, patterns, performance, errors, tests, framework, quick."
2. Stack Auto-Detection
Before routing to any sub-skill, detect the technology stack at the target path. This informs which reference guides to load and pass to sub-skills.
Detection Procedure
- Use Glob to scan file extensions at the target path (recursive, reasonable depth).
- Apply the following detection rules — multiple stacks can match simultaneously:
| Signal | Stack Detected | Reference to Load |
|---|
| files | PHP / Laravel | references/php-laravel.md
|
| , , , files | JavaScript / TypeScript | references/javascript-typescript.md
|
| files | Python | |
| , files + present | React / Next.js | references/react-nextjs.md
|
| files or directory | SQL / Database | references/sql-database.md
|
| directory or API route patterns | API Design | |
- For each detected stack, attempt to load the corresponding reference file using Read. If the file does not exist yet (Phase 2+), skip silently.
- Collect all loaded references into a context bundle to pass to sub-skills.
Reference Loading
References are loaded from the
directory within this skill's own directory. Resolve the path relative to this SKILL.md file's location, NOT the user's project. Use Read with:
references/{reference-file}.md
(This resolves to the
folder next to this SKILL.md file.)
If a reference file does not exist, continue without it. Never fail the review because a reference is missing.
3. Config Loading
Check for a
file in the project root (the target path or its ancestor directories).
Config Schema
json
{
"severity_overrides": {
"long_method_loc": 50,
"large_class_loc": 500,
"deep_nesting_max": 4,
"max_constructor_deps": 6
},
"skip_categories": ["codeprobe-testing"],
"skip_rules": ["SPEC-GEN-001"],
"framework": "laravel",
"extra_references": [],
"report_format": "markdown"
}
Config Behavior
- If absent: All defaults apply. No error.
- : Pass to sub-skills so they adjust thresholds accordingly.
- : Do not invoke the listed sub-skills, even in or mode.
- : Pass to sub-skills so they suppress findings with matching IDs.
- : If set, skip auto-detection for that framework and force-load the corresponding reference. Other auto-detection still proceeds.
- : Additional reference file paths to load and pass to sub-skills.
- : Output format preference (default: ).
4. Sub-Skill Execution
Pre-Loading Phase (runs once before any sub-skill)
Before invoking any sub-skill, the orchestrator MUST pre-load all shared context:
-
Read the shared preamble from
(in this skill's directory). This contains the output contract, execution modes, and constraints shared by all sub-skills.
-
Read all source files at the target path:
- Use Glob to find all source files (, , , , , , , , , and config files like , , , , ).
- Read each file using Read.
- Size cap: If the codebase has more than 50 source files or total LOC exceeds 10,000 lines, do NOT pre-load all files. Instead, pass only the file listing (paths + line counts) and let sub-agents read files they need. Note this in the agent prompt: "Large codebase — file listing provided, use Read for files you need to inspect."
- Store all file contents as a map: .
-
Read all applicable reference files (already loaded during stack detection in Section 2). Store the content.
Invocation Protocol
For each sub-skill to run, spawn an Agent with a prompt that includes:
- The shared preamble (from ) — output contract, modes, constraints.
- The sub-skill name to invoke (e.g., ).
- The mode — one of or .
- Pre-loaded source files — the full content of every source file, formatted as:
=== FILE: {filepath} ===
{content}
=== END FILE ===
- Pre-loaded references — the content of all applicable reference files.
- Config overrides — severity overrides and skip rules from .
- Target path — so the sub-skill knows the project root for any targeted lookups.
- Sub-skill-specific pre-loaded script output (when applicable):
- For : before spawning the agent, run
python3 scripts/dependency_mapper.py <target_path>
via Bash and capture the JSON output. Pass it as an additional context block labeled === DEPENDENCY_GRAPH === ... === END DEPENDENCY_GRAPH ===
. The sub-skill uses this as the ground truth for cycle detection. If Python 3 is unavailable or the script fails, omit the block — the sub-skill falls back to LLM-based import tracing.
- For : if
scripts/complexity_scorer.py
output is available (optional), pass it as === COMPLEXITY_SCORES === ... === END COMPLEXITY_SCORES ===
.
The sub-skill's own SKILL.md contains only its domain-specific detection logic. All shared context (output format, modes, source code, references, pre-computed script data) comes from the orchestrator's prompt.
Collect findings returned by each sub-skill in the standard output contract format (Section 5).
Execution Modes
| Mode | Used By | Behavior |
|---|
| , , etc. | Run complete analysis, return all findings |
| | Count violations, identify top issues, return only counts + top 5 candidates |
Execution Order
- : Run sub-skills sequentially in this order: , , , , , , , , — all in mode. Before invoking , pre-compute the dependency graph via
scripts/dependency_mapper.py
and pass the JSON to the sub-skill as described in the Invocation Protocol (step 8). Collect all findings. Apply deduplication (Section 7A). Derive category scores from severity counts. Compute hot spots by aggregating findings per file and ranking by distinct-categories-flagged. Also run for codebase stats (skip gracefully if Python 3 unavailable).
- : Run all 9 sub-skills in mode. Collect candidate issues from all. Rank by severity (critical > major > minor > suggestion), then select top 5. Re-run relevant sub-skills in mode for just those 5 findings to get complete detail.
Available Sub-Skills
- — Security vulnerability detection
- — Error handling & resilience
- — SOLID principles analysis
- — Architecture analysis
- — Design patterns advisor
- — Performance & scalability
- — Code smell detection
- — Test quality & coverage
- — Framework-specific best practices
5. Output Contract
Every finding from every sub-skill MUST include these fields:
| Field | Required | Description |
|---|
| Yes | Unique identifier in format (e.g., , ) |
| Yes | One of: , , , |
| Yes | File path + line range (e.g., src/UserService.php:45-67
) |
| Yes | One sentence describing the issue |
| Yes | Concrete proof from the code — quote the relevant lines |
| Yes | What to do to fix it |
| Yes | A copy-pasteable prompt the user can give to Claude Code to apply the fix |
| No | Optional code snippet showing the improved version |
Finding Format Example
### SRP-001 | Major | `src/UserService.php:45-67`
**Problem:** UserService violates Single Responsibility — handles authentication, email sending, and database queries in one class.
**Evidence:**
> Lines 45-50: `public function authenticate($credentials) { ... }`
> Lines 52-60: `public function sendWelcomeEmail($user) { ... }`
> Lines 62-67: `public function findByUsername($name) { ... }`
**Suggestion:** Extract email logic into a dedicated `UserMailer` service and database queries into a `UserRepository`.
**Fix prompt:**
> Refactor `src/UserService.php` to follow Single Responsibility Principle: extract `sendWelcomeEmail()` into a new `UserMailer` class and `findByUsername()` into a `UserRepository` class. Keep `authenticate()` in `UserService` and inject the new dependencies.
6. Severity Levels
| Level | Priority | Meaning | Examples |
|---|
| Critical | P0 | Confirmed bugs, exploitable security vulnerabilities, or data loss/corruption risks that would cause harm in production | SQL injection with user input, missing auth on data-mutating endpoint, race condition causing data corruption, unhandled crash on a core path, missing DB transaction on multi-step writes |
| Major | P1 | Significant maintainability, reliability, or scalability problem that increases risk but is not an immediate production defect | Missing tests for critical business logic, large classes, code duplication, missing error handling on external calls, N+1 queries, missing input validation |
| Minor | P2 | Code smell, low risk, worth addressing for long-term health | Magic numbers, deep nesting, poor naming, missing edge case tests, verbose error details |
| Suggestion | P3 | Improvement idea, nice to have, no real risk if ignored | Pattern opportunities, style improvements, speculative generality |
Severity Guardrails
The following are NEVER Critical — classify as Major at most:
- Missing tests (even for critical business logic)
- Code duplication or large classes/files
- Code smells of any kind
- Framework convention violations
- Missing documentation, comments, or type annotations
Critical is reserved exclusively for:
- Confirmed bugs (code that produces wrong results or crashes)
- Exploitable security vulnerabilities (injection, auth bypass, IDOR with proof)
- Data loss or corruption risks (missing transactions, race conditions on writes)
- Sensitive data exposure (secrets in code, credentials in logs)
Sub-skills: do NOT escalate findings beyond the severity specified in your detection table. If your table says "Major," report it as Major even if the specific instance seems severe. The orchestrator's scoring formula accounts for finding counts at each level.
7. Scoring
After collecting all findings, compute scores per category and an overall score.
Category Score Formula
Each penalty component is capped to prevent a single severity level from dominating the score:
crit_penalty = min(50, critical_count * 15)
major_penalty = min(30, major_count * 6)
minor_penalty = min(10, minor_count * 2)
category_score = max(0, 100 - crit_penalty - major_penalty - minor_penalty)
Suggestions do not affect the score.
Rationale: Diminishing returns prevent a single severity from flooring the score. A category with 4 criticals scores 40 (not 0), reflecting problems exist but the code is not completely broken. The maximum total penalty from all three levels combined is 90, so a score of 0 requires extreme findings across all severities.
Category Weights
| Category | Weight |
|---|
| Security | 20% |
| SOLID | 15% |
| Architecture | 15% |
| Error Handling | 12% |
| Performance | 12% |
| Test Quality | 10% |
| Code Smells | 8% |
| Design Patterns | 4% |
| Framework | 4% |
All 9 categories are active. Weights sum to 100%.
Overall Score
overall = sum(category_score_i * weight_i for each active category)
If
in
excludes some categories, normalize by dividing by the sum of active weights:
overall = sum(category_score_i * weight_i for each active category) / sum(weight_i for each active category)
Clamp the result to the range [0, 100].
Score Interpretation
| Range | Status |
|---|
| 80-100 | Healthy |
| 60-79 | Needs Attention |
| 0-59 | Critical |
7A. Cross-Category Deduplication
Before computing scores, deduplicate findings that flag the same issue from multiple categories.
Deduplication Procedure
-
Group findings by location. Normalize each finding's
to
. Two findings overlap if they share the same file AND their line ranges overlap (i.e., start_line_A <= end_line_B AND start_line_B <= end_line_A).
-
For each group of overlapping findings from different categories:
a. Select a primary finding. Use this priority order:
- Security findings (SEC) take priority for anything involving auth, injection, or data exposure
- Error Handling findings (ERR) take priority for exception/validation issues
- Performance findings (PERF) take priority for query/caching issues
- SOLID findings (SRP/OCP/LSP/ISP/DIP) take priority for structural violations
- Architecture findings (ARCH) take priority for layer/boundary violations
- If still ambiguous, the category with the higher weight (Section 7) wins
b. Mark duplicates. For each non-primary finding in the group, append to its field:
[Duplicate of {primary_id} — counted there]
and change its severity to so it does not affect the score of its own category.
c. Cross-reference the primary. Append to the primary finding's field: Also flagged by: {list of duplicate category:id pairs}
-
Recount severity totals per category after deduplication, then proceed to scoring.
Examples
- "Refresh bypasses quota" found as SEC-007, ERR-011, FW-001 at same location: keep SEC-007, mark ERR-011 and FW-001 as duplicates (severity → suggestion).
- "God component" found as SRP-001, SMELL-001, ARCH-005 at same file: keep SRP-001 (SOLID priority for structural), mark others as duplicates.
- Same SRP violation found as SRP-001 and SMELL-001: keep SRP-001, mark SMELL-001 as duplicate.
8. Report Rendering
Render the final output based on the command used.
— Full Audit Report
Use the template at
templates/full-audit-report.md
(loaded via Read) to assemble the markdown for the saved report file. The terminal output is a summary —
the user MUST see a visual summary in the terminal, not only the "Report saved" line.
Terminal vs saved-file split (MANDATORY):
| Surface | What it shows | How it's produced |
|---|
| Terminal | Dashboard + executive summary + critical findings (full detail) + prioritized fix order (top 5) + "Report saved" line | Streamed markdown sections in the assistant response |
Saved file (./codeprobe-reports/<project>-<cmd>-<ts>.md
) | Everything: dashboard, exec summary, all critical/major findings, minor/suggestion counts, full prioritized fix order | Plain markdown written via tool |
The terminal must never be empty or reduced to just a save confirmation. If Claude only emitted the save line in a past run, that was a bug in how these instructions were followed — fix it by executing the flow below in order.
Step-by-step flow
A. Compute data (shared by both surfaces)
- Run sub-skills per Section 4, collect findings, deduplicate per Section 7A.
- Derive category scores, overall score, hot spots, codebase stats.
- Assemble an in-memory "report bundle":
{dashboard_data, exec_summary, critical[], major[], minor_counts[], suggestion_counts[], fix_order[]}
.
B. Render to terminal (user-facing — emit these directly as markdown in the assistant response, in order)
Do NOT pipe the dashboard through the Bash tool. The Bash tool collapses its output by default in Claude Code's UI (
+N lines (ctrl+r to expand)
), which hides the dashboard behind a fold. Emit the dashboard as ordinary assistant-response markdown so the user sees it immediately. Modern terminals (Ghostty, iTerm2, etc.) will apply their own markdown styling — bold, inline-code highlighting, emphasis — which is the only coloring available on this path. If no coloring shows, plain text is fine; do not try to force ANSI escapes into the response.
(
scripts/render_dashboard.py
exists for users running the audit directly from their own shell outside Claude Code, where its stdout is a real TTY. Do NOT invoke it from within the
flow.)
- Dashboard (markdown) — emit the dashboard block inline. Include: title line (
## Code Health Report — {project}
), **Overall Health:** {score}/100 [{status_label}]
, the 9-row Category Scores markdown table with columns Category | Score | Bar | Status
(bar wrapped in backticks, 20-char Unicode / proportional to score), codebase stats block (files, LOC, backend/frontend split, largest file, test ratio, comment ratio), and a hot-spots list (up to 3 entries). Status labels plain text in the Status column, no emoji, no brackets inside the table.
- Executive Summary — 2-3 sentences covering the most important findings.
- Critical findings — full detail — for each critical finding: ID, location, problem, evidence, suggestion, fix prompt. This is the highest-signal section; always show in the terminal.
- Prioritized Fix Order (top 5) — the first 5 entries from the full prioritized fix order. Reference the saved file for the complete list.
- Save confirmation line —
--> Report saved to ./codeprobe-reports/{project}-{cmd}-{YYYY-MM-DD-HHMMSS}.md
(no emoji; ASCII arrow). This is the LAST line in the terminal output.
Do NOT also stream the major-findings table, minor/suggestion counts, or the full fix order to the terminal — those would duplicate content that the saved file already carries and bloat the terminal output. The user can open the saved file for the complete picture.
C. Write saved-file markdown
- Build the full markdown using
templates/full-audit-report.md
placeholders — this includes sections that are NOT streamed to the terminal (major findings table, minor/suggestion counts, full prioritized fix order).
- Derive the filename as
{project}-{cmd}-{YYYY-MM-DD-HHMMSS}.md
:
- — resolve the target path to an absolute path (use the current working directory if the user passed no path); take its basename; if that basename points to a file, strip the extension; slugify it (lowercase; replace any run of with a single ; trim leading/trailing ); fall back to if the slug ends up empty.
- — the subcommand routed in Section 1 (, , , , , , , , , , ), lowercased.
- — current local time.
- Ensure exists (
mkdir -p ./codeprobe-reports
via Bash if missing).
- Write to
./codeprobe-reports/{project}-{cmd}-{YYYY-MM-DD-HHMMSS}.md
using the tool.
- If the write fails (read-only filesystem, permission denied, etc.), surface a short inline note in the terminal but do not re-emit the summary.
D. Terminal-output contract (hard requirements)
- The terminal MUST include (in this order): dashboard → exec summary → critical findings → top 5 fix order → "Report saved" line.
- If there are zero critical findings, show the dashboard, exec summary, a one-line "No critical issues found" note, top 5 fix order (will be major-only), and save confirmation — still non-empty.
- If the report save fails, still emit the full terminal summary — do not block the summary on the file write.
- If the template file is missing, render inline following the same section ordering; the save step (C) still applies.
Status thresholds (applied to overall health and each category score):
- 80-100 = "Healthy"
- 60-79 = "Needs Attention"
- 0-59 = "Critical"
Token budget guidance: For a codebase with ~100 findings, the tiered findings layout (major as table, minor/suggestion as counts only — all in the saved file) targets ~8,000-12,000 tokens on disk (vs ~40,000 with full detail for all findings). The terminal summary is much tighter because only the dashboard, exec summary, critical findings, and top 5 fix order are streamed there. The user can drill into specific categories with
etc. for full detail on any one category.
— Quick Review Summary
Use the template at
templates/quick-review-summary.md
(loaded via Read). If the template does not exist yet, render inline:
- Header: Project name, "Quick Review — Top 5 Issues".
- Top 5 Findings: Full detail for the 5 most impactful issues, each with fix prompt.
- Summary Counts: Total issues found by severity across all categories.
- Next Step: Suggest running for the complete picture.
9. Claude.ai Degraded Mode
Detect whether filesystem access is available. If the user has pasted or uploaded code rather than providing a file path, or if Read/Glob/Grep tools are unavailable:
- Switch to degraded mode: Analyze only the in-context code provided.
- Execute sub-skills sequentially on the pasted code (no parallel agents).
- Skip , , and all script-dependent steps (sub-skills fall back to LLM-only analysis — architecture loses deterministic cycle detection).
- Skip , , the Codebase Stats row of the audit dashboard, and the report-save step (no filesystem write). Still render scores, hot spots, and findings to the terminal.
- Inform the user: "Running in Claude.ai mode — some features like codebase statistics, diff review, and multi-file analysis are unavailable. Analyzing the provided code directly."
- Still produce findings in the standard output contract format.
- Still compute scores based on findings from available sub-skills.
10. Phase 3 Stubs
When the user invokes a command that routes to an unbuilt feature, respond with:
Not yet available. This feature is coming in Phase 3. Currently available commands:
- — Full code audit
- — SOLID principles check
/codeprobe security <path>
— Security audit
- — Code smells detection
/codeprobe architecture <path>
— Architecture analysis
/codeprobe patterns <path>
— Design patterns analysis
/codeprobe performance <path>
— Performance audit
- — Error handling audit
- — Test quality audit
/codeprobe framework <path>
— Framework best practices
- — Top 5 issues
11. Execution Flow Summary
When
is invoked, execute this sequence:
- Parse command: Extract subcommand and target path from user input.
- Validate command: Check routing table. If Phase 3 stub, respond with stub message.
- Resolve target path: Use provided path or default to current working directory.
- Load config: Check for at project root. Apply defaults if absent.
- Auto-detect stack: Scan target path for technology signals. Load matching references.
- Apply config overrides: If is set in config, adjust detection. Apply and .
- Execute sub-skills: Route to appropriate sub-skills based on command and mode.
- Collect findings: Aggregate all findings in the output contract format.
- Deduplicate findings: Apply the cross-category deduplication procedure (Section 7A). Adjust severity of duplicates to . Recount severity totals per category.
- Compute scores: Calculate per-category and overall scores using the post-deduplication severity counts and the formulas in Section 7.
- Render report: Format output using the appropriate template or inline format. Use the tiered output format for .
- Present to user: Display the final report.
Remember: This entire process is READ-ONLY. At no point do we modify any user files.