skill-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Evaluation & Improvement
Skill评估与改进
Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption.
通过实证测试衡量并提升Skill质量——因为结构无法保证行为表现,量化测试优于主观假设。
Operator Context
操作场景
This skill operates as the eval-driven improvement pipeline for Claude Code skills. It provides four capabilities: trigger evaluation, description optimization, output benchmarking, and structural validation.
本Skill作为Claude Code Skill的评估驱动型改进流水线,提供四项核心能力:触发评估、描述优化、输出基准测试和结构验证。
Hardcoded Behaviors (Always Apply)
硬编码行为(始终适用)
- Measure before changing: Always run baseline eval before making improvements
- Train/test split: Use 60/40 holdout to prevent overfitting descriptions
- Generalize, don't overfit: Improvements should help across many prompts, not just test cases
- Report results: Always show before/after metrics
- 先测量再修改:在进行改进前始终运行基准评估
- 训练/测试拆分:采用60/40的留存数据划分,避免描述过拟合
- 注重泛化,而非过拟合:改进应适用于大量提示词,而非仅针对测试用例
- 输出结果报告:始终展示改进前后的指标
Default Behaviors (ON unless disabled)
默认行为(默认开启,可关闭)
- HTML reports: Generate visual reports for description optimization
- Verbose output: Show per-query pass/fail during eval runs
- 3 runs per query: Run each trigger test 3 times for reliability
- HTML报告:为描述优化生成可视化报告
- 详细输出:在评估运行期间展示每个查询的通过/失败情况
- 每个查询运行3次:为保证可靠性,每个触发测试运行3次
Optional Behaviors (OFF unless enabled)
可选行为(默认关闭,需开启)
- Blind A/B comparison: Use comparator agent for unbiased output comparison
- Full benchmark suite: Run aggregate benchmarks with timing and token metrics
- 盲态A/B对比:使用对比Agent进行无偏输出对比
- 完整基准测试套件:运行包含计时和令牌指标的聚合基准测试
What This Skill CAN Do
本Skill可实现的功能
- Test whether a skill's description triggers correctly for a set of queries
- Optimize descriptions via automated eval+improve loop (train/test split)
- Benchmark skill output quality (with-skill vs without-skill)
- Validate skill structure (frontmatter, naming, description length)
- Generate HTML reports for visual review
- 测试Skill描述是否能针对一组查询正确触发
- 通过自动化评估+改进循环优化描述(训练/测试拆分)
- 基准测试Skill输出质量(有Skill vs 无Skill对比)
- 验证Skill结构(前置元数据、命名、描述长度)
- 生成HTML报告用于可视化评审
What This Skill CANNOT Do
本Skill无法实现的功能
- Create new skills from scratch (use skill-creator-engineer)
- Modify skill instructions automatically (human reviews changes)
- Test skills that require specific MCP servers or external services
- Run evals without the CLI available
claude
- 从零创建新Skill(请使用skill-creator-engineer)
- 自动修改Skill指令(需人工审核变更)
- 测试需要特定MCP服务器或外部服务的Skill
- 在无CLI可用的情况下运行评估
claude
Instructions
操作步骤
Phase 1: ASSESS — Determine what to evaluate
阶段1:评估规划——确定评估内容
Step 1: Identify the skill
bash
undefined步骤1:识别目标Skill
bash
undefinedValidate skill structure first
先验证Skill结构
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
**Step 2: Choose evaluation mode**
| User Intent | Mode | Script |
|------------|------|--------|
| "Test if description triggers correctly" | Trigger eval | `run_eval.py` |
| "Optimize/improve the description" | Description optimization | `run_loop.py` |
| "Compare skill vs no-skill output" | Output benchmark | Manual + `aggregate_benchmark.py` |
| "Validate skill structure" | Quick validate | `quick_validate.py` |
**GATE**: Skill path confirmed, mode selected.python3 -m scripts.skill_eval.quick_validate <path/to/skill>
**步骤2:选择评估模式**
| 用户意图 | 模式 | 脚本 |
|------------|------|--------|
| "测试描述是否能正确触发" | 触发评估 | `run_eval.py` |
| "优化/改进描述" | 描述优化 | `run_loop.py` |
| "对比有Skill和无Skill的输出" | 输出基准测试 | 手动操作 + `aggregate_benchmark.py` |
| "验证Skill结构" | 快速验证 | `quick_validate.py` |
**关卡**:确认Skill路径,选定评估模式。Phase 2: EVALUATE — Run the appropriate evaluation
阶段2:执行评估——运行对应评估
Mode A: Trigger Evaluation
模式A:触发评估
Test whether a skill's description causes Claude to invoke it for the right queries.
Step 1: Create eval set (or use existing)
Create a JSON file with 8-20 test queries:
json
[
{"query": "realistic user prompt that should trigger", "should_trigger": true},
{"query": "similar but different domain prompt", "should_trigger": false}
]Eval set quality matters — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills.
Step 2: Run evaluation
bash
python3 -m scripts.skill_eval.run_eval \
--eval-set evals.json \
--skill-path <path/to/skill> \
--runs-per-query 3 \
--verboseThis spawns for each query, checking whether it invokes the skill. Output includes pass/fail per query with trigger rates.
claude -pGATE: Eval results available. Proceed to improvement if failures found.
测试Skill描述是否能让Claude针对正确的查询调用该Skill。
步骤1:创建评估集(或使用现有评估集)
创建一个包含8-20个测试查询的JSON文件:
json
[
{"query": "应触发Skill的真实用户提示词", "should_trigger": true},
{"query": "类似但属于不同领域的提示词", "should_trigger": false}
]评估集质量至关重要——使用包含细节的真实提示词(如文件路径、上下文、口语化表述),而非抽象的单行语句。重点关注Skill与相邻Skill竞争的边缘场景。
步骤2:运行评估
bash
python3 -m scripts.skill_eval.run_eval \
--eval-set evals.json \
--skill-path <path/to/skill> \
--runs-per-query 3 \
--verbose该命令会为每个查询启动,检查是否调用了目标Skill。输出内容包含每个查询的通过/失败情况及触发率。
claude -p关卡:获取评估结果。若发现失败案例,进入改进阶段。
Mode B: Description Optimization
模式B:描述优化
Automated loop that tests, improves, and re-tests descriptions.
bash
python3 -m scripts.skill_eval.run_loop \
--eval-set evals.json \
--skill-path <path/to/skill> \
--model claude-opus-4-6 \
--max-iterations 5 \
--verboseThis will:
- Split eval set 60/40 train/test (stratified by should_trigger)
- Evaluate current description on all queries (3 runs each)
- Use Claude with extended thinking to propose improvements based on failures
- Re-evaluate the new description
- Repeat until all pass or max iterations reached
- Select best description by test score (prevents overfitting)
- Open an HTML report in the browser
GATE: Loop complete. Best description identified.
通过自动化循环完成测试、改进、再测试描述的流程。
bash
python3 -m scripts.skill_eval.run_loop \
--eval-set evals.json \
--skill-path <path/to/skill> \
--model claude-opus-4-6 \
--max-iterations 5 \
--verbose该流程会:
- 将评估集按60/40拆分为训练集和测试集(按should_trigger分层)
- 在所有查询上评估当前描述(每个查询运行3次)
- 使用Claude的深度思考能力,基于失败案例提出改进方案
- 重新评估新描述
- 重复上述步骤,直至所有查询通过或达到最大迭代次数
- 根据测试集得分选择最优描述(避免过拟合)
- 在浏览器中打开HTML报告
关卡:循环完成,确定最优描述。
Mode C: Output Benchmark
模式C:输出基准测试
Compare skill quality by running prompts with and without the skill.
Step 1: Create test prompts — 2-3 realistic user prompts
Step 2: Run with-skill and without-skill in parallel subagents:
For each test prompt, spawn two agents:
- With skill: Load the skill, run the prompt, save outputs
- Without skill (baseline): Same prompt, no skill, save outputs
Step 3: Grade outputs
Spawn a grader subagent using the prompt in . It evaluates assertions against the outputs.
agents/grader.mdStep 4: Aggregate
bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>Produces and with pass rates, timing, and token usage.
benchmark.jsonbenchmark.mdStep 5: Analyze (optional)
For blind comparison, use to judge outputs without knowing which skill produced them. Then use to understand why the winner won.
agents/comparator.mdagents/analyzer.mdGATE: Benchmark results available.
通过对比有Skill和无Skill的情况,测试Skill质量。
步骤1:创建测试提示词——2-3个真实用户提示词
步骤2:并行运行有Skill和无Skill的测试
针对每个测试提示词,启动两个Agent:
- 有Skill:加载目标Skill,运行提示词,保存输出结果
- 无Skill(基准线):使用相同提示词,不加载Skill,保存输出结果
步骤3:评分输出结果
使用中的提示词启动评分Agent,根据输出结果评估预设断言。
agents/grader.md步骤4:聚合结果
bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>生成包含通过率、计时和令牌使用情况的和文件。
benchmark.jsonbenchmark.md步骤5:分析结果(可选)
若需盲态对比,使用在不知晓输出来源的情况下评判结果。然后使用分析获胜方的优势原因。
agents/comparator.mdagents/analyzer.md关卡:获取基准测试结果。
Mode D: Quick Validate
模式D:快速验证
bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>Checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>检查内容:是否存在SKILL.md、有效的前置元数据、必填字段(名称、描述)、短横线命名格式、描述长度不超过1024字符、无尖括号。
Phase 3: IMPROVE — Apply results
阶段3:应用改进——落地评估结果
Step 1: Review results
For trigger eval / description optimization:
- Show the best description vs original
- Show per-query results (which queries improved, which regressed)
- Show train vs test scores
For output benchmark:
- Show pass rate delta (with-skill vs without-skill)
- Show timing and token cost delta
- Highlight assertions that only pass with the skill (value-add)
Step 2: Apply changes (with user confirmation)
If description optimization found a better description:
- Show before/after with scores
- Ask user to confirm
- Update the skill's SKILL.md frontmatter
- Re-run quick_validate to confirm the update is valid
GATE: Changes applied and validated, or user chose to keep original.
步骤1:评审结果
对于触发评估/描述优化:
- 展示最优描述与原始描述的对比
- 展示每个查询的结果(哪些查询得到改进,哪些出现退化)
- 展示训练集与测试集的得分
对于输出基准测试:
- 展示通过率差值(有Skill vs 无Skill)
- 展示计时和令牌成本差值
- 突出仅在有Skill时能通过的断言(Skill的价值增益)
步骤2:应用变更(需用户确认)
若描述优化找到更优描述:
- 展示改进前后的描述及对应得分
- 请求用户确认
- 更新Skill的SKILL.md前置元数据
- 重新运行quick_validate以确认更新有效
关卡:变更已应用并验证,或用户选择保留原始描述。
Error Handling
错误处理
Error: "No SKILL.md found"
错误:"No SKILL.md found"
Cause: Skill path doesn't point to a valid skill directory
Solution: Verify path contains a file. Skills must follow the structure.
SKILL.mdskill-name/SKILL.md原因:Skill路径未指向有效的Skill目录
解决方案:验证路径是否包含文件。Skill必须遵循的结构。
SKILL.mdskill-name/SKILL.mdError: "claude: command not found"
错误:"claude: command not found"
Cause: Claude CLI not available for trigger evaluation
Solution: Install Claude Code CLI. Trigger eval requires to test skill invocation.
claude -p原因:触发评估需要Claude CLI,但当前环境未安装
解决方案:安装Claude Code CLI。触发评估需要命令来测试Skill调用。
claude -pError: "anthropic SDK not installed"
错误:"anthropic SDK not installed"
Cause: Description optimization requires the Anthropic Python SDK
Solution: . Only needed for and .
pip install anthropicimprove_description.pyrun_loop.py原因:描述优化需要Anthropic Python SDK,但当前环境未安装
解决方案:执行。仅和需要该依赖。
pip install anthropicimprove_description.pyrun_loop.pyError: "CLAUDECODE environment variable"
错误:"CLAUDECODE environment variable"
Cause: Running eval from inside a Claude Code session blocks nested instances
Solution: The scripts automatically strip the env var. If issues persist, run from a separate terminal.
CLAUDECODE原因:在Claude Code会话内部运行评估会阻止嵌套实例
解决方案:脚本会自动清除环境变量。若问题仍存在,从独立终端运行脚本。
CLAUDECODEError: "All queries timeout"
错误:"All queries timeout"
Cause: Default 30s timeout too short for complex queries
Solution: Increase with . Simple trigger queries should complete in <15s.
--timeout 60原因:默认30秒超时时间对于复杂查询过短
解决方案:使用参数增加超时时间。简单的触发查询应在15秒内完成。
--timeout 60Anti-Patterns
反模式
Anti-Pattern 1: Abstract Eval Queries
反模式1:抽象化评估查询
What it looks like: ,
Why wrong: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior.
Do instead:
"Format this data""Create a chart""ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"表现:、
问题:真实用户会使用详细、具体的提示词。抽象查询无法测试真实的触发行为。
正确做法:使用类似的真实提示词。
"Format this data""Create a chart""ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"Anti-Pattern 2: Overfitting to Test Cases
反模式2:过度拟合测试用例
What it looks like: Adding specific query text to the description to force triggers
Why wrong: Works for test set, fails on real usage. Bloats the description.
Do instead: Generalize from failures to broader categories of user intent.
表现:在描述中添加特定查询文本以强制触发
问题:仅在测试集上有效,在实际使用中失效。同时会导致描述冗余。
正确做法:从失败案例中归纳更广泛的用户意图类别。
Anti-Pattern 3: Skipping Baseline
反模式3:跳过基准线测试
What it looks like: Running with-skill only, no without-skill comparison
Why wrong: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill.
Do instead: Always run both configurations. The delta is what matters.
表现:仅运行有Skill的测试,不进行无Skill的对比
问题:无法证明Skill的价值——可能Claude在无Skill时也能很好地处理任务。
正确做法:始终运行两种配置。两者的差值才是Skill的真正价值。
References
参考资料
Scripts (in scripts/skill_eval/
)
scripts/skill_eval/脚本(位于scripts/skill_eval/
)
scripts/skill_eval/- — Trigger evaluation: tests description against query set
run_eval.py - — Eval+improve loop: automated description optimization
run_loop.py - — Single-shot description improvement via Claude API
improve_description.py - — HTML report from loop output
generate_report.py - — Benchmark aggregation from grading results
aggregate_benchmark.py - — Structural validation of SKILL.md
quick_validate.py
- — 触发评估:测试描述与查询集的匹配情况
run_eval.py - — 评估+改进循环:自动化描述优化
run_loop.py - — 通过Claude API实现单次描述改进
improve_description.py - — 从循环输出生成HTML报告
generate_report.py - — 从评分结果聚合基准测试数据
aggregate_benchmark.py - — SKILL.md的结构验证
quick_validate.py
Bundled Agents (in skills/skill-eval/agents/
)
skills/skill-eval/agents/内置Agent(位于skills/skill-eval/agents/
)
skills/skill-eval/agents/- — Evaluates assertions against execution outputs
grader.md - — Blind A/B comparison of two outputs
comparator.md - — Post-hoc analysis of why one version beat another
analyzer.md
- — 根据执行输出评估断言
grader.md - — 输出的盲态A/B对比
comparator.md - — 事后分析获胜版本的优势原因
analyzer.md
Reference Files
参考文件
- — JSON schemas for evals.json, grading.json, benchmark.json
${CLAUDE_SKILL_DIR}/references/schemas.md
- — evals.json、grading.json、benchmark.json的JSON schema
${CLAUDE_SKILL_DIR}/references/schemas.md
Shared Patterns
共享模式
- Anti-Rationalization
- Verification Checklist
- 反合理化
- 验证清单