skill-eval
Original:🇺🇸 English
Translated
This skill should be used when the user wants to run baseline evaluations on existing agent skills, regenerate transcripts after a model upgrade, or check whether a skill still solves the gap it was authored for. Common triggers include "rerun the baselines", "re-eval skill X", "test all the skills", "check for skill drift", and "run the evals". Bakes in verbatim transcript capture (no paraphrasing), deterministic-only grading (regex / contains / file_exists — no LLM-as-judge), and the iteration-N workspace convention. Skip when authoring a new skill (use skill-creator) or modifying skill content directly.
2installs
Sourcezrosenbauer/skills
Added on
NPX Install
npx skill4agent add zrosenbauer/skills skill-evalTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →skill-eval
Re-run baseline evaluations on one or more skills. Uses the test definitions committed in each skill, dispatches pressure scenarios via subagents, saves transcripts to a gitignored workspace, and grades the runs deterministically.
evals.jsonWhen to use
Verbatim trigger phrases:
- "rerun the baselines"
- "re-eval skill X"
- "test all the skills"
- "check for skill drift"
- "run the evals"
- "did skill X still pass"
When NOT to use
- Authoring a new skill — use instead
/skill-creator - Modifying skill body content — just edit the SKILL.md
- Running unit tests for itself (those are vitest, not skill evals)
packages/skill-tools
Inputs
- — one of:
$ARGUMENTS- — re-eval one skill (looks under
<skill-name>andskills/).agents/skills/ - — re-eval every skill that has an
--allevals.json - empty — same as
--all
Workflow
1. Resolve target skills
If is a skill name, look in then . Confirm exists. If not, abort with an error pointing the user at (or to author by hand).
$ARGUMENTSskills/<name>/.agents/skills/<name>/evals.json/skill-creatorevals.jsonIf (or empty), find every directory with both and under those two roots.
--allSKILL.mdevals.json2. Determine the next iteration number
For each target skill, look at its nested workspace at (or , depending on the skill's location). If it doesn't exist, the next iteration is . Otherwise scan directories and use .
skills/<skill-name>/.workspace/.agents/skills/<skill-name>/.workspace/1iteration-N/max(N) + 1Create (the pattern is gitignored).
skills/<skill-name>/.workspace/iteration-<N>/.workspace/3. Dispatch each eval via the Agent tool
For every eval in , run two subagent dispatches:
evals.json3a. WITHOUT skill (RED baseline)
Use the Agent tool with . The prompt template:
subagent_type: general-purposeExecute this task exactly:
[eval.prompt]
No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.Save the agent's reply (the entire response text) to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md3b. WITH skill (GREEN run)
Use the Agent tool again, this time including the target skill's full SKILL.md content as system context. The prompt template:
Execute this task exactly:
[eval.prompt]
The skill `<skill-name>` is available — apply its rules and patterns.
After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.
If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.Save the response to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md4. Grade each transcript
For each transcript saved, invoke to run assertions:
skill-tools evalbash
node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
--variant <with_skill|without_skill> \
--iteration <N> \
--transcript <path-to-transcript.md>This writes next to the transcript. Each assertion is regex / contains / file_exists — deterministic, no LLM-as-judge.
grading.json5. Generate the benchmark
After all evals are graded for a skill:
bash
node packages/skill-tools/dist/index.mjs benchmark <skill-name>This aggregates the grading.json files into and for that iteration.
benchmark.jsonbenchmark.md6. Report
Summarize per skill:
- Which evals improved with the skill loaded vs. without
- Any evals that failed with the skill loaded (regression to investigate)
- Path to the benchmark and to the latest iteration directory
Suggest the user run to navigate transcripts in the TUI.
pnpm skill-tools view <skill-name>Examples
<example> <input>"rerun the baselines for ts-best-practices"</input> <output> 1. Resolve: skill at `skills/ts-best-practices/`, evals.json present with 3 cases. 2. Workspace: `skills/ts-best-practices/.workspace/iteration-2/` (iteration-1 already exists). 3. For each eval: dispatch Agent(general-purpose) twice (without/with skill), save transcripts. 4. Grade each transcript via skill-tools eval. 5. Aggregate via skill-tools benchmark → benchmark.md. 6. Report: "with skill 7/9 passed, without skill 3/9 — improvement on eval-1, regression on eval-2". </output> </example> <example> <input>"check for skill drift on all skills"</input> <output> 1. --all: find every dir under skills/ + .agents/skills/ with both SKILL.md and evals.json. 2. For each: run the same 6-step workflow. 3. Aggregate report: any skills where regression count exceeds improvement count are flagged for review. </output> </example> <good> Saved transcript verbatim to: skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md </good> <bad> The agent did roughly what we expected. Skipping transcript save. </bad>Always save raw transcripts verbatim — paraphrasing or dropping output makes regression detection impossible.
References
- schema is documented in
evals.jsonskills/skill-creator/references/evals-json.md - skill-tools CLI reference:
packages/skill-tools/