eval-agent-md
Original:🇺🇸 English
Translated
3 scripts
Behavioral compliance testing for any CLAUDE.md or agent definition file. Auto-generates test scenarios from your rules, runs them via LLM-as-judge scoring, and reports compliance. Optionally improves failing rules via automated mutation loop.
3installs
Sourceravnhq/ai-toolkit
Added on
NPX Install
npx skill4agent add ravnhq/ai-toolkit eval-agent-mdTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →eval-agent-md — Behavioral Compliance Testing
What This Does
- Reads a CLAUDE.md (or agent .md file)
- Auto-generates behavioral test scenarios for each rule it finds
- Runs each scenario via with LLM-as-judge scoring
claude -p - Reports a compliance score with per-rule pass/fail breakdown
- Optionally runs an automated mutation loop to improve failing rules
Workflow
Progress Reporting
This skill runs long operations (30s-5min per step). Always keep the user informed:
- Before each step, tell the user what is about to happen and roughly how long it takes
- Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
- After each step completes, give a brief transition summary before starting the next step
- Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)
Step 1: Locate the target file
Find the CLAUDE.md to test. Priority order:
- If user provided a path argument (e.g., ), use that
/eval-agent-md ./CLAUDE.md - If a project-level CLAUDE.md exists in the current working directory, use that
- Fall back to (user global)
~/.claude/CLAUDE.md - If none found, ask the user
Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."
Step 2: Generate test scenarios
Tell the user: "Generating test scenarios from [filename]... this calls and typically takes 30-60 seconds."
claude -p --model sonnetRun the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:
bash
[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]The script auto-detects the repository name from git and saves to (e.g., ). Override with or .
/tmp/eval-agent-md-<repo>-scenarios.yaml/tmp/eval-agent-md-my-project-scenarios.yaml--repo-name NAME-o PATHAfter generation, read the output file and show the user a summary:
- How many scenarios were generated
- Which rules each scenario tests
- A brief preview of each scenario's prompt
Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"
Step 3: Run behavioral tests
Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."
claude -pIMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress () in real time:
[1/N] scenario_id... PASS/FAIL (Xs)bash
[SKILL_DIR]/scripts/eval-behavioral.py \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--claude-md [TARGET_FILE] \
--runs 1 \
--model sonnetOptions the user can control:
- — runs per scenario for majority vote (default: 1, recommend 3 for reliability)
--runs N - — model for test subject (default: sonnet)
--model MODEL - — run across haiku/sonnet/opus and show comparison matrix
--compare-models
Step 4: Report results
Print a compliance report:
## Compliance Report — [filename]
Score: 8/10 (80%)
| Scenario | Rule | Verdict | Evidence |
|----------|------|---------|----------|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |
### Failing Rules
- [rule]: [what went wrong] — suggested fix: [brief suggestion]Step 5: Improve (optional)
If the user says "improve", "fix", or passed :
--improveTell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:
bash
[SKILL_DIR]/scripts/mutate-loop.py \
--target [TARGET_FILE] \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--max-iterations 3 \
--runs 3 \
--model sonnetThis is always dry-run by default. Show the user each suggested mutation and ask before applying.
Arguments
Parse the user's invocation for these optional arguments:
/eval-agent-md- — target file (positional, e.g.,
[path])/eval-agent-md ./CLAUDE.md - — run mutation loop after testing
--improve - — runs per scenario (default: 1)
--runs N - — model for test subject (default: sonnet)
--model MODEL - — cross-model comparison (haiku/sonnet/opus)
--compare-models - — hint that the target is an agent definition file (adjusts generation style)
--agent
Examples
Positive Trigger
User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."
Expected behavior: Use workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.
eval-agent-mdNon-Trigger
User: "Add a new linting rule to our ESLint config."
Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.
Troubleshooting
Scenario Generation Fails
- Error: exits with non-zero status or produces empty output.
generate-scenarios.py - Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
- Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.
Low Compliance Score Despite Correct Rules
- Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
- Cause: Single-run mode () is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
--runs 1 - Solution: Re-run with for majority-vote scoring to reduce noise.
--runs 3
Scripts Not Found
- Error: when running skill scripts.
No such file or directory - Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
- Solution: Verify the skill is installed at the expected path and run on the scripts in the
chmod +xdirectory.scripts/
Notes
- All scripts use — no pip install needed
uv run --script - The judge always uses haiku (cheap, fast, reliable for scoring)
- Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
- For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
- Scripts are in this skill's directory
scripts/