skill-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Benchmark
Skill基准测试
You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).
Methodology based on industry best practices (Anthropic & OpenAI eval guidance):
- Layered grading: deterministic checks first, then LLM-as-judge
- Isolated sandbox per session — clean state, no shared artifacts
- Multiple runs to account for non-determinism
- Negative control tasks to detect false positives
- Transcript analysis for behavioral signals
你是一个Skill基准测试系统。你的工作是严格评估Claude Code的某一Skill相比基准水平(不使用该Skill)是否能提升性能。
基于行业最佳实践的方法论(Anthropic & OpenAI评估指南):
- 分层评分:先进行确定性检查,再由LLM作为评审打分
- 每个会话使用独立沙箱——干净的运行环境,无共享工件
- 多轮运行以应对非确定性
- 阴性控制任务以检测假阳性
- 分析对话记录以提取行为信号
Execution Flow
执行流程
Follow these steps exactly:
严格按照以下步骤执行:
Step 1: Gather Input
步骤1:收集输入
The user can run this skill in two ways:
Option 1: Custom config — User creates a :
config.ymlcp .claude/skills/skill-benchmark/config.example.yml .claude/skills/skill-benchmark/config.yml用户可通过两种方式运行本Skill:
方式1:自定义配置 — 用户创建:
config.ymlcp .claude/skills/skill-benchmark/config.example.yml .claude/skills/skill-benchmark/config.ymledit config.yml
edit config.yml
/skill-benchmark
**Option 2: Default run** — No config needed:/skill-benchmark
undefined/skill-benchmark
**方式2:默认运行** — 无需配置:/skill-benchmark
undefinedWhat to do:
操作说明:
-
Check for config.yml — Look for it in order: (1)in the skill directory, (2)
config.yml, (3) path passed as argument. If found, read and use those values. If not found, use built-in defaults:~/.claude/skills/skill-benchmark/config.ymlrunner_model: sonnetjudge_model: opustask_count: 5negative_controls: 1difficulties: {easy: 2, medium: 2, hard: 1}runs: 1max_turns: 10results_dir: ./skill-bench/results
-
Which skill to benchmark — Ifis set in config.yml, use that. Otherwise ask the user via
skill. Search common locations:AskUserQuestion.claude/skills/<name>/SKILL.md~/.claude/skills/<name>/SKILL.md- Direct file path
-
Task set — Ask if they have a custom task set directory, or if you should auto-generate tasks based on the skill's domain.
-
Confirm settings — Show the user the final config (loaded or default) and ask if they want to change anything before starting.
-
Set— Create the results directory with a skill-name and timestamp:
$RESULTS_DIRbashRESULTS_DIR="<results_dir>/<skill_name>-$(date +%Y%m%d-%H%M%S)" mkdir -p "$RESULTS_DIR"All subsequent paths (,tasks/,sandbox/,outputs/,grades/) go underreport.md. Do NOT put files directly in the base$RESULTS_DIR— always nest under the timestamped subdirectory.results_dir
-
检查config.yml — 按以下顺序查找配置文件:(1) Skill目录下的,(2)
config.yml,(3) 作为参数传入的路径。如果找到,则读取并使用其中的值;如果未找到,则使用内置默认值:~/.claude/skills/skill-benchmark/config.ymlrunner_model: sonnetjudge_model: opustask_count: 5negative_controls: 1difficulties: {easy: 2, medium: 2, hard: 1}runs: 1max_turns: 10results_dir: ./skill-bench/results
-
确定要测试的Skill — 如果中设置了
config.yml,则使用该Skill;否则通过skill询问用户。在以下常见位置搜索:AskUserQuestion.claude/skills/<name>/SKILL.md~/.claude/skills/<name>/SKILL.md- 直接传入的文件路径
-
任务集 — 询问用户是否有自定义任务集目录,或者是否需要根据Skill的领域自动生成任务。
-
确认设置 — 向用户展示最终配置(加载的配置或默认配置),并询问是否要在开始前修改任何内容。
-
设置— 创建包含Skill名称和时间戳的结果目录:
$RESULTS_DIRbashRESULTS_DIR="<results_dir>/<skill_name>-$(date +%Y%m%d-%H%M%S)" mkdir -p "$RESULTS_DIR"所有后续路径(、tasks/、sandbox/、outputs/、grades/)都位于report.md下。请勿将文件直接放在基础$RESULTS_DIR中——务必嵌套在带时间戳的子目录下。results_dir
Step 2: Read & Analyze Target Skill
步骤2:读取并分析目标Skill
Read the target skill's file completely. Extract:
SKILL.md- Domain: What area does this skill cover? (e.g., code review, testing, deployment)
- Capabilities: What specific things does this skill instruct Claude to do?
- Trigger conditions: When should this skill be used?
- Tools used: What tools does the skill rely on?
Write a brief analysis summary — you'll use this to generate relevant tasks.
完整读取目标Skill的文件,提取以下信息:
SKILL.md- 领域:该Skill覆盖哪些领域?(例如:代码评审、测试、部署)
- 能力:该Skill具体指示Claude执行哪些操作?
- 触发条件:何时应使用该Skill?
- 使用的工具:该Skill依赖哪些工具?
撰写简短的分析摘要——你将用它来生成相关任务。
Step 3: Generate Benchmark Tasks
步骤3:生成基准测试任务
If no custom task set was provided, auto-generate tasks. Design tasks following eval best practices:
如果未提供自定义任务集,则自动生成任务。遵循评估最佳实践设计任务:
Task Categories (all required):
任务类别(所有类别均为必填):
-
Positive tasks — Tasks where the skill SHOULD help (the majority):
- Easy (2 tasks): Straightforward tasks in the skill's domain
- Medium (2 tasks): Tasks requiring deeper application of the skill's guidance
- Hard (1 task): Complex tasks where the skill's specialized knowledge matters most
-
Negative control (1 task): A task OUTSIDE the skill's domain where the skill should NOT activate or help. This catches false positives — if the skill hurts performance on unrelated tasks, that's a red flag.
-
正向任务 — 该Skill应能帮助完成的任务(占多数):
- 简单(2个任务):Skill领域内的直接任务
- 中等(2个任务):需要深度应用Skill指导的任务
- 困难(1个任务):Skill的专业知识至关重要的复杂任务
-
阴性控制任务(1个任务):Skill领域之外的任务,Skill不应激活或提供帮助。这用于检测假阳性——如果Skill对无关任务的表现产生负面影响,则是一个危险信号。
Task Format
任务格式
Write each task to :
$RESULTS_DIR/tasks/task-NN-<difficulty>.mdmarkdown
undefined将每个任务写入:
$RESULTS_DIR/tasks/task-NN-<difficulty>.mdmarkdown
undefinedTask: <descriptive-name>
Task: <描述性名称>
difficulty: easy|medium|hard
category: <domain>
type: positive|negative-control
difficulty: easy|medium|hard
category: <领域>
type: positive|negative-control
Prompt
Prompt
<the exact prompt that will be sent to Claude via >
claude -p<将通过发送给Claude的精确提示词>
claude -pExpected Outcome
预期结果
<clear description of what a correct response looks like>
<正确响应的清晰描述>
Verification Checks
验证检查
<deterministic checks to run BEFORE LLM grading>
- file_exists: <filename that should be created>
- file_contains: <pattern> in <filename> (or just <pattern> to search all files)
- syntax_valid: <language — run syntax checker>
- runs_without_error: <command to execute, e.g., "python3 <filename>">
<在LLM评分之前运行的确定性检查项>
- file_exists: <应创建的文件名>
- file_contains: <pattern> in <filename> (或仅<pattern>以搜索所有文件)
- syntax_valid: <语言 — 运行语法检查器>
- runs_without_error: <要执行的命令,例如"python3 <filename>">
Grading Rubric
评分标准
- Correctness: <specific criteria for correctness>
- Completeness: <what must be included for full marks>
- Quality: <quality expectations — best practices, clarity, etc.>
- Correctness: <正确性的具体标准>
- Completeness: <获得满分必须包含的内容>
- Quality: <质量期望 — 最佳实践、清晰度等>
Tags
Tags
<comma-separated tags for grouping>
```
Task design rules:
- Prompts must be self-contained — no prior context since they run as fresh sessions
claude -p - Include with concrete, deterministic things to test (file exists, code runs, output matches)
Verification Checks - Two domain experts should independently reach the same pass/fail verdict — if the task is ambiguous, rewrite it
- Each task must be solvable — the expected outcome must be achievable
<用于分组的逗号分隔标签>
**任务设计规则:**
- 提示词必须自包含——由于任务在全新的`claude -p`会话中运行,因此无需前置上下文
- 包含`验证检查`项,列出具体、可确定的测试内容(文件存在、代码可运行、输出匹配)
- 两位领域专家应能独立得出相同的通过/不通过结论——如果任务存在歧义,请重写
- 每个任务必须可解决——预期结果必须可实现
---Step 4: Run Eval Sessions
步骤4:运行评估会话
For each task, run TWO sessions using . Each session MUST run in its own isolated sandbox directory so they cannot interfere with each other.
claude -p对于每个任务,使用运行两个会话。每个会话必须在自己的独立沙箱目录中运行,以避免相互干扰。
claude -pMulti-Run Support
多轮运行支持
If in config, run each task N times. Each run gets its own isolated sandbox and output directory. This accounts for non-determinism in LLM outputs.
runs > 1-
Directory structure for multi-run ():
runs: 3sandbox/task-01/run-1/with-skill/ sandbox/task-01/run-1/baseline/ sandbox/task-01/run-2/with-skill/ sandbox/task-01/run-2/baseline/ sandbox/task-01/run-3/with-skill/ sandbox/task-01/run-3/baseline/ outputs/task-01/run-1/with-skill/ outputs/task-01/run-1/baseline/ outputs/task-01/run-2/with-skill/ outputs/task-01/run-2/baseline/ ... grades/task-01/run-1/with-skill-grade.json grades/task-01/run-1/baseline-grade.json ... -
Directory structure for single run (, the default):
runs: 1sandbox/task-01/with-skill/ sandbox/task-01/baseline/ outputs/task-01/with-skill/ outputs/task-01/baseline/ grades/task-01/with-skill-grade.json grades/task-01/baseline-grade.jsonWhen, skip theruns: 1subdirectory level entirely for simpler output.run-N/ -
Aggregation for multi-run: After grading all runs, compute per-task:
- avg_score: Mean of weighted_total across all runs
- best_score: Max weighted_total across runs
- worst_score: Min weighted_total across runs
- pass@k: At least 1 run scored >= 70 (task considered "passable")
- pass^k: ALL runs scored >= 70 (task consistently passes)
- std_dev: Standard deviation of scores (high = inconsistent behavior)
如果配置中的,则每个任务运行N次。每次运行都有自己的独立沙箱和输出目录。这是为了应对LLM输出的非确定性。
runs > 1-
多轮运行的目录结构():
runs: 3sandbox/task-01/run-1/with-skill/ sandbox/task-01/run-1/baseline/ sandbox/task-01/run-2/with-skill/ sandbox/task-01/run-2/baseline/ sandbox/task-01/run-3/with-skill/ sandbox/task-01/run-3/baseline/ outputs/task-01/run-1/with-skill/ outputs/task-01/run-1/baseline/ outputs/task-01/run-2/with-skill/ outputs/task-01/run-2/baseline/ ... grades/task-01/run-1/with-skill-grade.json grades/task-01/run-1/baseline-grade.json ... -
单轮运行的目录结构(,默认设置):
runs: 1sandbox/task-01/with-skill/ sandbox/task-01/baseline/ outputs/task-01/with-skill/ outputs/task-01/baseline/ grades/task-01/with-skill-grade.json grades/task-01/baseline-grade.json当时,完全跳过runs: 1子目录层级,以简化输出。run-N/ -
多轮运行的聚合:完成所有运行的评分后,按任务计算以下指标:
- avg_score:所有运行的weighted_total的平均值
- best_score:所有运行的weighted_total的最大值
- worst_score:所有运行的weighted_total的最小值
- pass@k:至少1次运行的得分>=70(任务被视为“可通过”)
- pass^k:所有运行的得分>=70(任务持续通过)
- std_dev:得分的标准差(值越高表示行为越不一致)
Isolation Setup
隔离设置
Before running ANY sessions, create isolated working directories for EVERY session:
bash
undefined在运行任何会话之前,为每个会话创建独立的工作目录:
bash
undefinedFor runs: 1 (default)
对于runs: 1(默认设置)
mkdir -p "$RESULTS_DIR/sandbox/task-NN/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/baseline"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/baseline"
For runs: 3 (multi-run)
对于runs: 3(多轮运行)
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/baseline"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-2/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/baseline"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-2/with-skill"
... etc
... 以此类推
Each `claude -p` call MUST `cd` into its own sandbox directory first. This prevents:
- File collisions (both sessions writing `fibonacci.py` to the same place)
- One session reading files created by the other
- Any shared state between with-skill and baseline runs
- Any shared state between different runs of the same task
每个`claude -p`调用必须先`cd`到自己的沙箱目录。这可以防止:
- 文件冲突(两个会话都向同一位置写入`fibonacci.py`)
- 一个会话读取另一个会话创建的文件
- 启用Skill和基准测试运行之间共享任何状态
- 同一任务的不同运行之间共享任何状态Nested Session Fix
嵌套会话修复
CRITICAL: Claude Code blocks inside an existing session via and env vars. You MUST unset these.
claude -pCLAUDECODECLAUDE_CODE_ENTRYPOINT关键提示: Claude Code通过和环境变量阻止在现有会话内部运行。你必须取消设置这些变量。
CLAUDECODECLAUDE_CODE_ENTRYPOINTclaude -pSession Commands
会话命令
Every call MUST include — without it, headless sessions hang forever waiting for a human to approve tool use.
claude -p--dangerously-skip-permissionsSession A — With Skill:
bash
undefined每个调用必须包含——没有这个参数,无头会话会一直挂起,等待人类批准工具使用。
claude -p--dangerously-skip-permissions会话A — 启用Skill:
bash
undefinedFor runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill
对于runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill
For runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill
对于runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill
cd "$SANDBOX_DIR" &&
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT
claude -p "<task_prompt>"
--output-format stream-json
--verbose
--dangerously-skip-permissions
--allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write"
--append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill="<skill_name>" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work."
--model <runner_model>
--max-turns <max_turns> \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT
claude -p "<task_prompt>"
--output-format stream-json
--verbose
--dangerously-skip-permissions
--allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write"
--append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill="<skill_name>" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work."
--model <runner_model>
--max-turns <max_turns> \
"$OUTPUT_DIR/raw_stream.jsonl" 2>&1
**Why `--append-system-prompt`?** Without it, the skill is merely *available* as a tool — the model must choose to call it. For straightforward tasks, the model often skips the skill entirely and writes code directly. The appended system prompt ensures the skill is always invoked, making the benchmark a fair comparison of "with skill instructions" vs "without".
**Session B — Baseline (no skill):**
```bash
cd "$SANDBOX_DIR" && \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \
claude -p "<task_prompt>" \
--output-format stream-json \
--verbose \
--dangerously-skip-permissions \
--allowedTools "Read,Edit,Bash,Grep,Glob,Write" \
--disallowedTools "Skill" \
--model <runner_model> \
--max-turns <max_turns> \
> "$OUTPUT_DIR/raw_stream.jsonl" 2>&1Why ? The tool is a built-in that alone does not restrict. Without explicitly disallowing it, the baseline model may still invoke the skill, contaminating the comparison.
--disallowedTools "Skill"Skill--allowedToolsWhere and depend on run count:
$SANDBOX_DIR$OUTPUT_DIR- →
runs: 1and$RESULTS_DIR/sandbox/task-NN/<mode>$RESULTS_DIR/outputs/task-NN/<mode> - →
runs: Nand$RESULTS_DIR/sandbox/task-NN/run-R/<mode>$RESULTS_DIR/outputs/task-NN/run-R/<mode>
IMPORTANT:
- is REQUIRED — without it,
--dangerously-skip-permissionshangs waiting for permission approval with no human to click "Allow".claude -p - Replace with the ACTUAL skill name from Step 1 (e.g.,
<skill_name>). Do NOT leavecode-commenterempty — that means no skill is loaded and both sessions become identical.Skill() - Always into the sandbox BEFORE running
cd. This is the isolation mechanism.claude -p - Use absolute paths for the output redirect () since you're cd'ing.
> .../raw_stream.jsonl
cd "$SANDBOX_DIR" && \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \
claude -p "<task_prompt>" \
--output-format stream-json \
--verbose \
--dangerously-skip-permissions \
--allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write" \
--append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill=\"<skill_name>\" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work." \
--model <runner_model> \
--max-turns <max_turns> \
"$OUTPUT_DIR/raw_stream.jsonl" 2>&1
**为什么使用`--append-system-prompt`?** 如果没有它,Skill只是作为一个可用工具——模型必须选择调用它。对于简单任务,模型通常会完全跳过Skill,直接编写代码。附加的系统提示确保Skill始终被调用,使基准测试成为“启用Skill指令”与“不启用Skill指令”的公平对比。
**会话B — 基准测试(不使用Skill):**
```bash
cd "$SANDBOX_DIR" && \\
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \\
claude -p "<task_prompt>" \\
--output-format stream-json \\
--verbose \\
--dangerously-skip-permissions \\
--allowedTools "Read,Edit,Bash,Grep,Glob,Write" \\
--disallowedTools "Skill" \\
--model <runner_model> \\
--max-turns <max_turns> \\
> "$OUTPUT_DIR/raw_stream.jsonl" 2>&1为什么使用? 工具是内置工具,仅使用无法限制它。如果不明确禁止,基准测试模型可能仍会调用Skill,从而污染对比结果。
--disallowedTools "Skill"Skill--allowedTools$SANDBOX_DIR$OUTPUT_DIR- →
runs: 1和$RESULTS_DIR/sandbox/task-NN/<mode>$RESULTS_DIR/outputs/task-NN/<mode> - →
runs: N和$RESULTS_DIR/sandbox/task-NN/run-R/<mode>$RESULTS_DIR/outputs/task-NN/run-R/<mode>
重要提示:
- 是必填项——没有它,
--dangerously-skip-permissions会一直挂起,等待人类批准权限,而没有人类点击“允许”。claude -p - 将替换为步骤1中获取的实际Skill名称(例如
<skill_name>)。请勿将code-commenter留空——这意味着未加载任何Skill,两个会话将变得完全相同。Skill() - 始终先到沙箱目录,再运行
cd。这是隔离机制的关键。claude -p - 输出重定向使用绝对路径(),因为你已经执行了
> .../raw_stream.jsonl命令。cd
Execution Strategy
执行策略
- Run Session A and Session B for the SAME task+run in parallel (use background Bash commands)
- Process tasks sequentially to avoid overwhelming the system
- For multi-run: complete all runs of task-01 before starting task-02
- Within a task, you MAY run multiple runs in parallel if system resources allow
- After each session completes, parse to produce THREE files in the output directory:
raw_stream.jsonl
response.json — Extract the last event from the JSONL stream.
type: "result"transcript.json — All stream events collected into a JSON array.
meta.json — Session metadata extracted from response.json. Contains: , (from keys), , , , , , , , and (input/output/cache tokens). The script handles this extraction — run it with for the full field list.
session_idmodelmodelUsageskill_namemodestop_reasonduration_msduration_api_msnum_turnstotal_cost_usdusagescripts/parse_stream.py--helpIf a session fails or times out, log the error in meta.json and mark it as a failed run (score: 0).
- 并行运行同一任务+运行的会话A和会话B(使用后台Bash命令)
- 按顺序处理任务,避免系统过载
- 对于多轮运行:完成任务01的所有运行后,再开始任务02
- 在一个任务内,如果系统资源允许,可以并行运行多个轮次
- 每个会话完成后,解析,在输出目录中生成三个文件:
raw_stream.jsonl
response.json — 从JSONL流中提取最后一个事件。
type: "result"transcript.json — 将所有流事件收集到一个JSON数组中。
meta.json — 从response.json中提取的会话元数据,包含:、(来自键)、、、、、、、和(输入/输出/缓存令牌)。脚本处理此提取——运行查看完整字段列表。
session_idmodelmodelUsageskill_namemodestop_reasonduration_msduration_api_msnum_turnstotal_cost_usdusagescripts/parse_stream.py--help如果会话失败或超时,在meta.json中记录错误,并将其标记为失败运行(得分:0)。
Step 5: Grade Outputs (Layered Grading)
步骤5:评分输出(分层评分)
Use a two-layer grading approach: deterministic checks first, then LLM-as-judge. This catches clear failures fast and uses the model for nuanced assessment.
使用两层评分方法:先进行确定性检查,再由LLM作为评审。这样可以快速发现明显的失败,并利用模型进行细微的评估。
Layer 1: Deterministic Checks
第一层:确定性检查
For each session output, run the deterministic checks script:
bash
undefined对每个会话输出运行确定性检查脚本:
bash
undefinedFor runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json
对于runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json
For runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json
对于runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json
python3 "scripts/run_checks.py"
"$RESULTS_DIR/tasks/task-NN-<difficulty>.md"
"$SANDBOX_DIR"
"$GRADES_DIR/<mode>-checks.json"
"$RESULTS_DIR/tasks/task-NN-<difficulty>.md"
"$SANDBOX_DIR"
"$GRADES_DIR/<mode>-checks.json"
This script reads the `## Verification Checks` section from the task file and runs each check (file_exists, syntax_valid, runs_without_error, file_contains) in the sandbox directory.
Save results to `$RESULTS_DIR/grades/task-NN/<mode>-checks.json`:
```json
{
"file_exists": true,
"syntax_valid": true,
"runs_without_error": true,
"file_contains": {"def add": true, "def subtract": true},
"all_passed": true
}If deterministic checks fail (file missing, syntax error, runtime crash), the task gets a correctness ceiling of 50 regardless of LLM grading — the code doesn't work.
python3 "scripts/run_checks.py" \
"$RESULTS_DIR/tasks/task-NN-<difficulty>.md" \
"$SANDBOX_DIR" \
"$GRADES_DIR/<mode>-checks.json"
该脚本读取任务文件中的`## 验证检查`部分,并在沙箱目录中运行每个检查(file_exists、syntax_valid、runs_without_error、file_contains)。
将结果保存到`$RESULTS_DIR/grades/task-NN/<mode>-checks.json`:
```json
{
"file_exists": true,
"syntax_valid": true,
"runs_without_error": true,
"file_contains": {"def add": true, "def subtract": true},
"all_passed": true
}如果确定性检查失败(文件缺失、语法错误、运行时崩溃),无论LLM评分如何,任务的正确性得分上限为50——因为代码无法正常工作。
Layer 2: LLM-as-Judge
第二层:LLM作为评审
For each task, launch a grader subagent (use the Agent tool with and set to the judge model).
subagent_type: "general-purpose"modelThe grader prompt MUST include:
- The original task prompt
- The expected outcome from the task file
- The grading rubric from the task file
- The actual output to grade (the field from response.json)
result - The deterministic check results from Layer 1
- Instructions to score each criterion on a 0-100 scale with justification
Also tell the grader to READ the actual files the session created in the sandbox directory () to verify correctness — don't just grade the text output, verify the code actually exists and is correct.
$RESULTS_DIR/sandbox/task-NN/<mode>/Grade EACH output independently (do not show the grader both outputs — this prevents comparison bias).
Grading criteria and default weights:
- Correctness (40%): Does the output solve the task correctly? Cap at 50 if deterministic checks failed.
- Completeness (25%): Are all requirements addressed?
- Quality (20%): Code quality, best practices, clarity of explanation
- Efficiency (15%): Was the solution direct and efficient? (Also factor in token usage)
The grader MUST return a structured response. Instruct it to output JSON:
json
{
"deterministic_checks_passed": true|false,
"correctness": { "score": 0-100, "justification": "..." },
"completeness": { "score": 0-100, "justification": "..." },
"quality": { "score": 0-100, "justification": "..." },
"efficiency": { "score": 0-100, "justification": "..." },
"weighted_total": 0-100,
"summary": "..."
}Save grades to the corresponding grades directory:
- →
runs: 1and$RESULTS_DIR/grades/task-NN/with-skill-grade.jsonbaseline-grade.json - →
runs: Nand$RESULTS_DIR/grades/task-NN/run-R/with-skill-grade.jsonbaseline-grade.json
You can run graders for different tasks/runs in parallel using background agents.
对于每个任务,启动一个评审子Agent(使用Agent工具,设置,并将设置为评审模型)。
subagent_type: "general-purpose"model评审提示必须包含:
- 原始任务提示
- 任务文件中的预期结果
- 任务文件中的评分标准
- 要评分的实际输出(来自response.json的字段)
result - 第一层的确定性检查结果
- 按0-100分对每个标准打分并提供理由的说明
还需告知评审要读取会话在沙箱目录()中创建的实际文件,以验证正确性——不要仅评分文本输出,要验证代码是否真的存在且正确。
$RESULTS_DIR/sandbox/task-NN/<mode>/独立评分每个输出(不要向评审展示两个输出——这会避免对比偏差)。
评分标准和默认权重:
- 正确性(40%):输出是否正确解决了任务?如果确定性检查失败,得分上限为50。
- 完整性(25%):是否满足所有要求?
- 质量(20%):代码质量、最佳实践、清晰度等
- 效率(15%):解决方案是否直接高效?(还要考虑令牌使用情况)
评审必须返回结构化响应。指示其输出JSON:
json
{
"deterministic_checks_passed": true|false,
"correctness": { "score": 0-100, "justification": "..." },
"completeness": { "score": 0-100, "justification": "..." },
"quality": { "score": 0-100, "justification": "..." },
"efficiency": { "score": 0-100, "justification": "..." },
"weighted_total": 0-100,
"summary": "..."
}将评分保存到相应的评分目录:
- →
runs: 1和$RESULTS_DIR/grades/task-NN/with-skill-grade.jsonbaseline-grade.json - →
runs: N和$RESULTS_DIR/grades/task-NN/run-R/with-skill-grade.jsonbaseline-grade.json
你可以使用后台Agent并行运行不同任务/运行的评审。
Step 6: Analyze Transcripts
步骤6:分析对话记录
Before generating the report, analyze the transcript.json files for behavioral signals. This is critical — scores alone don't tell the full story.
For each session, run the analyze script:
bash
undefined生成报告前,分析transcript.json文件以提取行为信号。这一点至关重要——仅靠得分无法完整说明情况。
对每个会话运行分析脚本:
bash
undefinedFor runs: 1 → outputs/task-NN/<mode>/
对于runs: 1 → outputs/task-NN/<mode>/
For runs: N → outputs/task-NN/run-R/<mode>/
对于runs: N → outputs/task-NN/run-R/<mode>/
python3 "scripts/analyze_transcript.py"
"$OUTPUT_DIR/transcript.json"
"$OUTPUT_DIR/behavior.json"
"$OUTPUT_DIR/transcript.json"
"$OUTPUT_DIR/behavior.json"
This extracts from transcript.json:
- **Tool call counts**: How many times each tool was used (Read, Write, Edit, Bash, etc.)
- **Thrashing detection**: Did the session loop or retry the same action? (same tool called 3+ times consecutively)
- **Error recovery**: Did the session hit errors and recover, or fail silently?
Output (`behavior.json`):
```json
{
"tool_calls": {"Read": 2, "Write": 1, "Bash": 3, "Edit": 0},
"total_tool_calls": 6,
"thrashing_detected": false,
"errors_encountered": 0,
"errors_recovered": 0
}python3 "scripts/analyze_transcript.py" \
"$OUTPUT_DIR/transcript.json" \
"$OUTPUT_DIR/behavior.json"
该脚本从transcript.json中提取以下信息:
- **工具调用次数**:每个工具被调用的次数(Read、Write、Edit、Bash等)
- **重复操作检测**:会话是否出现循环或重复执行相同操作?(连续调用同一工具3次以上)
- **错误恢复**:会话是否遇到错误并恢复,还是静默失败?
输出(`behavior.json`):
```json
{
"tool_calls": {"Read": 2, "Write": 1, "Bash": 3, "Edit": 0},
"total_tool_calls": 6,
"thrashing_detected": false,
"errors_encountered": 0,
"errors_recovered": 0
}Step 7: Generate Comparison Report
步骤7:生成对比报告
After all grading and analysis is complete, generate the final report.
Read all grade files, meta files, and behavior files, then compute:
- Per-task scores: Weighted total for skill vs baseline
- Per-task deltas: skill_score - baseline_score
- Aggregate scores: Average across all tasks
- Per-criterion aggregates: Average correctness, completeness, quality, efficiency for each condition
- Deterministic pass rate: % of tasks where all deterministic checks passed (skill vs baseline)
- Negative control results: How did the skill perform on out-of-domain tasks?
- Token usage & cost comparison: From meta.json and
total_cost_usdfieldsusage - Behavioral comparison: Tool usage patterns, thrashing, turn efficiency from behavior.json
- Verdict logic:
- Delta >= +10%: USE — skill significantly helps
- Delta between +3% and +10%: LIKELY USE — skill provides moderate benefit
- Delta between -3% and +3%: NEUTRAL — skill has negligible effect
- Delta between -10% and -3%: LIKELY DON'T USE — skill may hurt
- Delta <= -10%: DON'T USE — skill significantly hurts
Write the report to using this format:
$RESULTS_DIR/report.mdmarkdown
undefined完成所有评分和分析后,生成最终报告。
读取所有评分文件、元数据文件和行为文件,然后计算:
- 按任务得分:启用Skill与基准测试的加权总分
- 按任务差值:skill_score - baseline_score
- 聚合得分:所有任务的平均值
- 按标准聚合:每种条件下的平均正确性、完整性、质量、效率
- 确定性检查通过率:所有确定性检查通过的任务占比(启用Skill vs 基准测试)
- 阴性控制结果:Skill在领域外任务中的表现如何?
- 令牌使用和成本对比:来自meta.json的和
total_cost_usd字段usage - 行为对比:来自behavior.json的工具使用模式、重复操作、轮次效率
- 结论逻辑:
- 差值 >= +10%:推荐使用——Skill显著提升性能
- 差值在+3%到+10%之间:推荐使用——Skill提供中等收益
- 差值在-3%到+3%之间:中性——Skill影响可忽略
- 差值在-10%到-3%之间:不推荐使用——Skill可能产生负面影响
- 差值 <= -10%:不推荐使用——Skill显著降低性能
将报告写入,使用以下格式:
$RESULTS_DIR/report.mdmarkdown
undefinedSkill Benchmark Report: <skill-name>
Skill基准测试报告: <skill-name>
Date: <YYYY-MM-DD HH:MM>
Runner Model: <model> | Judge Model: <model> | Tasks: <N> | Runs: <R>
日期: <YYYY-MM-DD HH:MM>
运行模型: <model> | 评审模型: <model> | 任务数: <N> | 运行次数: <R>
Verdict: <emoji> <VERDICT>
结论: <emoji> <VERDICT>
Skill scores <X>% <higher/lower> than baseline on average.
Skill平均得分比基准测试<X>% <更高/更低>。
Summary
摘要
| Metric | With Skill | Baseline | Delta |
|---|---|---|---|
| Avg Score | X% | Y% | +/-Z% |
| Correctness | X% | Y% | +/-Z% |
| Completeness | X% | Y% | +/-Z% |
| Quality | X% | Y% | +/-Z% |
| Efficiency | X% | Y% | +/-Z% |
| 指标 | 启用Skill | 基准测试 | 差值 |
|---|---|---|---|
| 平均得分 | X% | Y% | +/-Z% |
| 正确性 | X% | Y% | +/-Z% |
| 完整性 | X% | Y% | +/-Z% |
| 质量 | X% | Y% | +/-Z% |
| 效率 | X% | Y% | +/-Z% |
Deterministic Check Pass Rate
确定性检查通过率
| Condition | Pass Rate |
|---|---|
| With Skill | X/N tasks (Y%) |
| Baseline | X/N tasks (Y%) |
| 条件 | 通过率 |
|---|---|
| 启用Skill | X/N个任务 (Y%) |
| 基准测试 | X/N个任务 (Y%) |
Per-Task Breakdown
按任务细分
| # | Task | Type | Difficulty | Skill | Baseline | Delta | Winner |
|---|---|---|---|---|---|---|---|
| 1 | ... | positive | easy | X% | Y% | +/-Z% | Skill/Baseline |
| N | ... | negative | - | X% | Y% | +/-Z% | ... |
| # | 任务 | 类型 | 难度 | Skill得分 | 基准测试得分 | 差值 | 获胜方 |
|---|---|---|---|---|---|---|---|
| 1 | ... | positive | easy | X% | Y% | +/-Z% | Skill/基准测试 |
| N | ... | negative | - | X% | Y% | +/-Z% | ... |
Negative Control Results
阴性控制结果
<How did the skill perform on out-of-domain tasks? If it hurt performance, flag this.>
<Skill在领域外任务中的表现如何?如果它降低了性能,请标记出来。>
Where Skill Helps
Skill的优势领域
- <identified patterns where skill outperformed baseline>
- <Skill表现优于基准测试的已识别模式>
Where Skill Hurts
Skill的劣势领域
- <identified patterns where baseline outperformed skill>
- <基准测试表现优于Skill的已识别模式>
Behavioral Analysis
行为分析
| Metric | With Skill | Baseline | Delta |
|---|---|---|---|
| Avg Tool Calls | X | Y | +/-Z |
| Avg Turns | X | Y | +/-Z |
| Thrashing Detected | X/N | Y/N | |
| Avg Duration (s) | X | Y | +/-Z |
| Avg Cost | $X | $Y | +/-$Z |
| Total Cost | $X | $Y | +/-$Z |
| 指标 | 启用Skill | 基准测试 | 差值 |
|---|---|---|---|
| 平均工具调用次数 | X | Y | +/-Z |
| 平均轮次 | X | Y | +/-Z |
| 检测到重复操作 | X/N | Y/N | |
| 平均时长 (秒) | X | Y | +/-Z |
| 平均成本 | $X | $Y | +/-$Z |
| 总成本 | $X | $Y | +/-$Z |
Recommendations
建议
- <actionable suggestions based on the results>
- <suggestions for improving the skill if it underperformed>
- <flag if skill hurts negative control tasks>
Present the report to the user and tell them where the full results are saved.
---- <基于结果的可操作建议>
- <如果Skill表现不佳,提出改进建议>
- <如果Skill对阴性控制任务产生负面影响,标记出来>
向用户展示报告,并告知他们完整结果的保存位置。
---References
参考资料
- Output directory structure — full tree for single-run and multi-run modes
- Configuration — config format, variables, and parsing
- 输出目录结构 — 单轮和多轮模式的完整目录树
- 配置 — 配置格式、变量和解析
Available scripts
可用脚本
- — Parse
scripts/parse_stream.py→raw_stream.jsonl,response.json,transcript.jsonmeta.json - — Analyze
scripts/analyze_transcript.py→transcript.json(tool counts, thrashing, errors)behavior.json - — Run deterministic verification checks from task file against sandbox
scripts/run_checks.py
Run any script with for full usage details.
--help- — 解析
scripts/parse_stream.py→raw_stream.jsonl、response.json、transcript.jsonmeta.json - — 分析
scripts/analyze_transcript.py→transcript.json(工具计数、重复操作、错误)behavior.json - — 根据任务文件中的确定性验证检查,对沙箱运行检查
scripts/run_checks.py
运行任何脚本时,使用查看完整使用说明。
--helpError Handling
错误处理
- If a session fails: log the error, score as 0, continue with remaining tasks
claude -p - If a grader agent fails: retry once, then score as "UNGRADED" and exclude from averages
- If the target skill file cannot be found: list available skills and ask user to choose
- If fewer than 2 tasks complete successfully: abort and report insufficient data
- If deterministic checks crash (e.g., python3 not available): log warning, skip to LLM grading
- 如果会话失败:记录错误,打分为0,继续处理剩余任务
claude -p - 如果评审Agent失败:重试一次,然后标记为“未评分”,并从平均值中排除
- 如果找不到目标Skill文件:列出可用Skill,让用户选择
- 如果成功完成的任务少于2个:中止并报告数据不足
- 如果确定性检查崩溃(例如:python3不可用):记录警告,跳过直接进行LLM评分 ",