skill-benchmark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Benchmark

Skill基准测试

You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).

Methodology based on industry best practices (Anthropic & OpenAI eval guidance):

Layered grading: deterministic checks first, then LLM-as-judge
Isolated sandbox per session — clean state, no shared artifacts
Multiple runs to account for non-determinism
Negative control tasks to detect false positives
Transcript analysis for behavioral signals

你是一个Skill基准测试系统。你的工作是严格评估Claude Code的某一Skill相比基准水平（不使用该Skill）是否能提升性能。

基于行业最佳实践的方法论（Anthropic & OpenAI评估指南）：

分层评分：先进行确定性检查，再由LLM作为评审打分
每个会话使用独立沙箱——干净的运行环境，无共享工件
多轮运行以应对非确定性
阴性控制任务以检测假阳性
分析对话记录以提取行为信号

Execution Flow

执行流程

Follow these steps exactly:

严格按照以下步骤执行：

Step 1: Gather Input

步骤1：收集输入

The user can run this skill in two ways:

Option 1: Custom config — User creates a

config.yml

cp .claude/skills/skill-benchmark/config.example.yml .claude/skills/skill-benchmark/config.yml

用户可通过两种方式运行本Skill：

方式1：自定义配置 — 用户创建

config.yml

：

cp .claude/skills/skill-benchmark/config.example.yml .claude/skills/skill-benchmark/config.yml

edit config.yml

/skill-benchmark


**Option 2: Default run** — No config needed:

/skill-benchmark

undefined

/skill-benchmark


**方式2：默认运行** — 无需配置：

/skill-benchmark

undefined

What to do:

操作说明：

Check for config.yml — Look for it in order: (1)

config.yml

in the skill directory, (2)

~/.claude/skills/skill-benchmark/config.yml

, (3) path passed as argument. If found, read and use those values. If not found, use built-in defaults:

```
runner_model: sonnet
```
```
judge_model: opus
```
```
task_count: 5
```
```
negative_controls: 1
```

difficulties: {easy: 2, medium: 2, hard: 1}

```
runs: 1
```
```
max_turns: 10
```
```
results_dir: ./skill-bench/results
```

Which skill to benchmark — If
```
skill
```
is set in config.yml, use that. Otherwise ask the user via
```
AskUserQuestion
```
. Search common locations:
- ```
.claude/skills/<name>/SKILL.md
```
- ```
~/.claude/skills/<name>/SKILL.md
```
- Direct file path
Task set — Ask if they have a custom task set directory, or if you should auto-generate tasks based on the skill's domain.
Confirm settings — Show the user the final config (loaded or default) and ask if they want to change anything before starting.
Set
$RESULTS_DIR
— Create the results directory with a skill-name and timestamp:
bash
```
RESULTS_DIR="<results_dir>/<skill_name>-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RESULTS_DIR"
```
All subsequent paths (
```
tasks/
```
,
```
sandbox/
```
,
```
outputs/
```
,
```
grades/
```
,
```
report.md
```
) go under
```
$RESULTS_DIR
```
. Do NOT put files directly in the base
```
results_dir
```
— always nest under the timestamped subdirectory.

检查config.yml — 按以下顺序查找配置文件：(1) Skill目录下的

config.yml

，(2)

~/.claude/skills/skill-benchmark/config.yml

，(3) 作为参数传入的路径。如果找到，则读取并使用其中的值；如果未找到，则使用内置默认值：

```
runner_model: sonnet
```
```
judge_model: opus
```
```
task_count: 5
```
```
negative_controls: 1
```

difficulties: {easy: 2, medium: 2, hard: 1}

```
runs: 1
```
```
max_turns: 10
```
```
results_dir: ./skill-bench/results
```

确定要测试的Skill — 如果
```
config.yml
```
中设置了
```
skill
```
，则使用该Skill；否则通过
```
AskUserQuestion
```
询问用户。在以下常见位置搜索：
- ```
.claude/skills/<name>/SKILL.md
```
- ```
~/.claude/skills/<name>/SKILL.md
```
- 直接传入的文件路径
任务集 — 询问用户是否有自定义任务集目录，或者是否需要根据Skill的领域自动生成任务。
确认设置 — 向用户展示最终配置（加载的配置或默认配置），并询问是否要在开始前修改任何内容。
设置
$RESULTS_DIR
— 创建包含Skill名称和时间戳的结果目录：
bash
```
RESULTS_DIR="<results_dir>/<skill_name>-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RESULTS_DIR"
```
所有后续路径（
```
tasks/
```
、
```
sandbox/
```
、
```
outputs/
```
、
```
grades/
```
、
```
report.md
```
）都位于
```
$RESULTS_DIR
```
下。请勿将文件直接放在基础
```
results_dir
```
中——务必嵌套在带时间戳的子目录下。

Step 2: Read & Analyze Target Skill

步骤2：读取并分析目标Skill

Read the target skill's

SKILL.md

file completely. Extract:

Domain: What area does this skill cover? (e.g., code review, testing, deployment)
Capabilities: What specific things does this skill instruct Claude to do?
Trigger conditions: When should this skill be used?
Tools used: What tools does the skill rely on?

Write a brief analysis summary — you'll use this to generate relevant tasks.

完整读取目标Skill的

SKILL.md

文件，提取以下信息：

领域：该Skill覆盖哪些领域？（例如：代码评审、测试、部署）
能力：该Skill具体指示Claude执行哪些操作？
触发条件：何时应使用该Skill？
使用的工具：该Skill依赖哪些工具？

撰写简短的分析摘要——你将用它来生成相关任务。

Step 3: Generate Benchmark Tasks

步骤3：生成基准测试任务

If no custom task set was provided, auto-generate tasks. Design tasks following eval best practices:

如果未提供自定义任务集，则自动生成任务。遵循评估最佳实践设计任务：

Task Categories (all required):

任务类别（所有类别均为必填）：

Positive tasks — Tasks where the skill SHOULD help (the majority):
- Easy (2 tasks): Straightforward tasks in the skill's domain
- Medium (2 tasks): Tasks requiring deeper application of the skill's guidance
- Hard (1 task): Complex tasks where the skill's specialized knowledge matters most
Negative control (1 task): A task OUTSIDE the skill's domain where the skill should NOT activate or help. This catches false positives — if the skill hurts performance on unrelated tasks, that's a red flag.

正向任务 — 该Skill应能帮助完成的任务（占多数）：
- 简单（2个任务）：Skill领域内的直接任务
- 中等（2个任务）：需要深度应用Skill指导的任务
- 困难（1个任务）：Skill的专业知识至关重要的复杂任务
阴性控制任务（1个任务）：Skill领域之外的任务，Skill不应激活或提供帮助。这用于检测假阳性——如果Skill对无关任务的表现产生负面影响，则是一个危险信号。

Task Format

任务格式

Write each task to

$RESULTS_DIR/tasks/task-NN-<difficulty>.md

markdown

undefined

将每个任务写入

$RESULTS_DIR/tasks/task-NN-<difficulty>.md

：

markdown

undefined

Task: <descriptive-name>

Task: <描述性名称>

difficulty: easy|medium|hard category: <domain> type: positive|negative-control

difficulty: easy|medium|hard category: <领域> type: positive|negative-control

Prompt

<the exact prompt that will be sent to Claude via

claude -p

<将通过

claude -p

发送给Claude的精确提示词>

Expected Outcome

预期结果

<正确响应的清晰描述>

Verification Checks

验证检查

<deterministic checks to run BEFORE LLM grading> - file_exists: <filename that should be created> - file_contains: <pattern> in <filename> (or just <pattern> to search all files) - syntax_valid: <language — run syntax checker> - runs_without_error: <command to execute, e.g., "python3 <filename>">

<在LLM评分之前运行的确定性检查项>

file_exists: <应创建的文件名>
file_contains: <pattern> in <filename> （或仅<pattern>以搜索所有文件）
syntax_valid: <语言 — 运行语法检查器>
runs_without_error: <要执行的命令，例如"python3 <filename>">

Grading Rubric

评分标准

Correctness: <specific criteria for correctness>
Completeness: <what must be included for full marks>
Quality: <quality expectations — best practices, clarity, etc.>

Correctness: <正确性的具体标准>
Completeness: <获得满分必须包含的内容>
Quality: <质量期望 — 最佳实践、清晰度等>

For runs: 1 (default)

对于runs: 1（默认设置）

mkdir -p "$RESULTS_DIR/sandbox/task-NN/with-skill" mkdir -p "$RESULTS_DIR/sandbox/task-NN/baseline"

For runs: 3 (multi-run)

对于runs: 3（多轮运行）

mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/with-skill" mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/baseline" mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-2/with-skill"

... etc

... 以此类推


Each `claude -p` call MUST `cd` into its own sandbox directory first. This prevents:
- File collisions (both sessions writing `fibonacci.py` to the same place)
- One session reading files created by the other
- Any shared state between with-skill and baseline runs
- Any shared state between different runs of the same task


每个`claude -p`调用必须先`cd`到自己的沙箱目录。这可以防止：
- 文件冲突（两个会话都向同一位置写入`fibonacci.py`）
- 一个会话读取另一个会话创建的文件
- 启用Skill和基准测试运行之间共享任何状态
- 同一任务的不同运行之间共享任何状态

Nested Session Fix

嵌套会话修复

CRITICAL: Claude Code blocks

claude -p

inside an existing session via

CLAUDECODE

and

CLAUDE_CODE_ENTRYPOINT

env vars. You MUST unset these.

关键提示： Claude Code通过

CLAUDECODE

和

CLAUDE_CODE_ENTRYPOINT

环境变量阻止在现有会话内部运行

claude -p

。你必须取消设置这些变量。

Session Commands

会话命令

Every

claude -p

call MUST include

--dangerously-skip-permissions

— without it, headless sessions hang forever waiting for a human to approve tool use.

Session A — With Skill:

bash

undefined

每个

claude -p

调用必须包含

--dangerously-skip-permissions

——没有这个参数，无头会话会一直挂起，等待人类批准工具使用。

会话A — 启用Skill：

bash

undefined

For runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill

对于runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill

For runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill

对于runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill

cd "$SANDBOX_DIR" &&
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT
claude -p "<task_prompt>"
--output-format stream-json
--verbose
--dangerously-skip-permissions
--allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write"
--append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill="<skill_name>" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work."
--model <runner_model>
--max-turns <max_turns> \

"$OUTPUT_DIR/raw_stream.jsonl" 2>&1


**Why `--append-system-prompt`?** Without it, the skill is merely *available* as a tool — the model must choose to call it. For straightforward tasks, the model often skips the skill entirely and writes code directly. The appended system prompt ensures the skill is always invoked, making the benchmark a fair comparison of "with skill instructions" vs "without".

**Session B — Baseline (no skill):**
```bash
cd "$SANDBOX_DIR" && \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \
  claude -p "<task_prompt>" \
  --output-format stream-json \
  --verbose \
  --dangerously-skip-permissions \
  --allowedTools "Read,Edit,Bash,Grep,Glob,Write" \
  --disallowedTools "Skill" \
  --model <runner_model> \
  --max-turns <max_turns> \
  > "$OUTPUT_DIR/raw_stream.jsonl" 2>&1

Why
--disallowedTools "Skill"
? The

Skill

tool is a built-in that

--allowedTools

alone does not restrict. Without explicitly disallowing it, the baseline model may still invoke the skill, contaminating the comparison.

Where

$SANDBOX_DIR

and

$OUTPUT_DIR

depend on run count:

runs: 1

→

$RESULTS_DIR/sandbox/task-NN/<mode>

and

$RESULTS_DIR/outputs/task-NN/<mode>

runs: N

→

$RESULTS_DIR/sandbox/task-NN/run-R/<mode>

and

$RESULTS_DIR/outputs/task-NN/run-R/<mode>

IMPORTANT:

```
--dangerously-skip-permissions
```
is REQUIRED — without it,
```
claude -p
```
hangs waiting for permission approval with no human to click "Allow".
Replace
```
<skill_name>
```
with the ACTUAL skill name from Step 1 (e.g.,
```
code-commenter
```
). Do NOT leave
```
Skill()
```
empty — that means no skill is loaded and both sessions become identical.
Always
```
cd
```
into the sandbox BEFORE running
```
claude -p
```
. This is the isolation mechanism.
Use absolute paths for the output redirect (
```
> .../raw_stream.jsonl
```
) since you're cd'ing.

cd "$SANDBOX_DIR" && \ env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \ claude -p "<task_prompt>" \ --output-format stream-json \ --verbose \ --dangerously-skip-permissions \ --allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write" \ --append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill=\"<skill_name>\" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work." \ --model <runner_model> \ --max-turns <max_turns> \

"$OUTPUT_DIR/raw_stream.jsonl" 2>&1


**为什么使用`--append-system-prompt`？** 如果没有它，Skill只是作为一个可用工具——模型必须选择调用它。对于简单任务，模型通常会完全跳过Skill，直接编写代码。附加的系统提示确保Skill始终被调用，使基准测试成为“启用Skill指令”与“不启用Skill指令”的公平对比。

**会话B — 基准测试（不使用Skill）：**
```bash
cd "$SANDBOX_DIR" && \\
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \\
  claude -p "<task_prompt>" \\
  --output-format stream-json \\
  --verbose \\
  --dangerously-skip-permissions \\
  --allowedTools "Read,Edit,Bash,Grep,Glob,Write" \\
  --disallowedTools "Skill" \\
  --model <runner_model> \\
  --max-turns <max_turns> \\
  > "$OUTPUT_DIR/raw_stream.jsonl" 2>&1

为什么使用
--disallowedTools "Skill"
？

Skill

工具是内置工具，仅使用

--allowedTools

无法限制它。如果不明确禁止，基准测试模型可能仍会调用Skill，从而污染对比结果。

$SANDBOX_DIR

和

$OUTPUT_DIR

取决于运行次数：

runs: 1

→

$RESULTS_DIR/sandbox/task-NN/<mode>

和

$RESULTS_DIR/outputs/task-NN/<mode>

runs: N

→

$RESULTS_DIR/sandbox/task-NN/run-R/<mode>

和

$RESULTS_DIR/outputs/task-NN/run-R/<mode>

重要提示：

```
--dangerously-skip-permissions
```
是必填项——没有它，
```
claude -p
```
会一直挂起，等待人类批准权限，而没有人类点击“允许”。
将
```
<skill_name>
```
替换为步骤1中获取的实际Skill名称（例如
```
code-commenter
```
）。请勿将
```
Skill()
```
留空——这意味着未加载任何Skill，两个会话将变得完全相同。
始终先
```
cd
```
到沙箱目录，再运行
```
claude -p
```
。这是隔离机制的关键。
输出重定向使用绝对路径（
```
> .../raw_stream.jsonl
```
），因为你已经执行了
```
cd
```
命令。

Execution Strategy

执行策略

Run Session A and Session B for the SAME task+run in parallel (use background Bash commands)
Process tasks sequentially to avoid overwhelming the system
For multi-run: complete all runs of task-01 before starting task-02
Within a task, you MAY run multiple runs in parallel if system resources allow
After each session completes, parse
```
raw_stream.jsonl
```
to produce THREE files in the output directory:

response.json — Extract the last

type: "result"

event from the JSONL stream.

transcript.json — All stream events collected into a JSON array.

meta.json — Session metadata extracted from response.json. Contains:

session_id

model

(from

modelUsage

keys),

skill_name

mode

stop_reason

duration_ms

duration_api_ms

num_turns

total_cost_usd

, and

usage

(input/output/cache tokens). The

scripts/parse_stream.py

script handles this extraction — run it with

--help

for the full field list.

If a session fails or times out, log the error in meta.json and mark it as a failed run (score: 0).

并行运行同一任务+运行的会话A和会话B（使用后台Bash命令）
按顺序处理任务，避免系统过载
对于多轮运行：完成任务01的所有运行后，再开始任务02
在一个任务内，如果系统资源允许，可以并行运行多个轮次
每个会话完成后，解析
```
raw_stream.jsonl
```
，在输出目录中生成三个文件：

response.json — 从JSONL流中提取最后一个

type: "result"

事件。

transcript.json — 将所有流事件收集到一个JSON数组中。

meta.json — 从response.json中提取的会话元数据，包含：

session_id

、

model

（来自

modelUsage

键）、

skill_name

、

mode

、

stop_reason

、

duration_ms

、

duration_api_ms

、

num_turns

、

total_cost_usd

和

usage

（输入/输出/缓存令牌）。

scripts/parse_stream.py

脚本处理此提取——运行

--help

查看完整字段列表。

如果会话失败或超时，在meta.json中记录错误，并将其标记为失败运行（得分：0）。

Step 5: Grade Outputs (Layered Grading)

步骤5：评分输出（分层评分）

Use a two-layer grading approach: deterministic checks first, then LLM-as-judge. This catches clear failures fast and uses the model for nuanced assessment.

使用两层评分方法：先进行确定性检查，再由LLM作为评审。这样可以快速发现明显的失败，并利用模型进行细微的评估。

Layer 1: Deterministic Checks

第一层：确定性检查

For each session output, run the deterministic checks script:

bash

undefined

对每个会话输出运行确定性检查脚本：

bash

undefined

For runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json

对于runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json

For runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json

对于runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json

python3 "scripts/run_checks.py"
"$RESULTS_DIR/tasks/task-NN-<difficulty>.md"
"$SANDBOX_DIR"
"$GRADES_DIR/<mode>-checks.json"


This script reads the `## Verification Checks` section from the task file and runs each check (file_exists, syntax_valid, runs_without_error, file_contains) in the sandbox directory.

Save results to `$RESULTS_DIR/grades/task-NN/<mode>-checks.json`:
```json
{
  "file_exists": true,
  "syntax_valid": true,
  "runs_without_error": true,
  "file_contains": {"def add": true, "def subtract": true},
  "all_passed": true
}

If deterministic checks fail (file missing, syntax error, runtime crash), the task gets a correctness ceiling of 50 regardless of LLM grading — the code doesn't work.

python3 "scripts/run_checks.py" \ "$RESULTS_DIR/tasks/task-NN-<difficulty>.md" \ "$SANDBOX_DIR" \ "$GRADES_DIR/<mode>-checks.json"


该脚本读取任务文件中的`## 验证检查`部分，并在沙箱目录中运行每个检查（file_exists、syntax_valid、runs_without_error、file_contains）。

将结果保存到`$RESULTS_DIR/grades/task-NN/<mode>-checks.json`：
```json
{
  "file_exists": true,
  "syntax_valid": true,
  "runs_without_error": true,
  "file_contains": {"def add": true, "def subtract": true},
  "all_passed": true
}

如果确定性检查失败（文件缺失、语法错误、运行时崩溃），无论LLM评分如何，任务的正确性得分上限为50——因为代码无法正常工作。

Layer 2: LLM-as-Judge

第二层：LLM作为评审

For each task, launch a grader subagent (use the Agent tool with

subagent_type: "general-purpose"

and

model

set to the judge model).

The grader prompt MUST include:

The original task prompt
The expected outcome from the task file
The grading rubric from the task file
The actual output to grade (the
```
result
```
field from response.json)
The deterministic check results from Layer 1
Instructions to score each criterion on a 0-100 scale with justification

Also tell the grader to READ the actual files the session created in the sandbox directory (

$RESULTS_DIR/sandbox/task-NN/<mode>/

) to verify correctness — don't just grade the text output, verify the code actually exists and is correct.

Grade EACH output independently (do not show the grader both outputs — this prevents comparison bias).

Grading criteria and default weights:

Correctness (40%): Does the output solve the task correctly? Cap at 50 if deterministic checks failed.
Completeness (25%): Are all requirements addressed?
Quality (20%): Code quality, best practices, clarity of explanation
Efficiency (15%): Was the solution direct and efficient? (Also factor in token usage)

The grader MUST return a structured response. Instruct it to output JSON:

json

{
  "deterministic_checks_passed": true|false,
  "correctness": { "score": 0-100, "justification": "..." },
  "completeness": { "score": 0-100, "justification": "..." },
  "quality": { "score": 0-100, "justification": "..." },
  "efficiency": { "score": 0-100, "justification": "..." },
  "weighted_total": 0-100,
  "summary": "..."
}

Save grades to the corresponding grades directory:

runs: 1

→

$RESULTS_DIR/grades/task-NN/with-skill-grade.json

and

baseline-grade.json

runs: N

→

$RESULTS_DIR/grades/task-NN/run-R/with-skill-grade.json

and

baseline-grade.json

You can run graders for different tasks/runs in parallel using background agents.

对于每个任务，启动一个评审子Agent（使用Agent工具，设置

subagent_type: "general-purpose"

，并将

model

设置为评审模型）。

评审提示必须包含：

原始任务提示
任务文件中的预期结果
任务文件中的评分标准
要评分的实际输出（来自response.json的
```
result
```
字段）
第一层的确定性检查结果
按0-100分对每个标准打分并提供理由的说明

还需告知评审要读取会话在沙箱目录（

$RESULTS_DIR/sandbox/task-NN/<mode>/

）中创建的实际文件，以验证正确性——不要仅评分文本输出，要验证代码是否真的存在且正确。

独立评分每个输出（不要向评审展示两个输出——这会避免对比偏差）。

评分标准和默认权重：

正确性（40%）：输出是否正确解决了任务？如果确定性检查失败，得分上限为50。
完整性（25%）：是否满足所有要求？
质量（20%）：代码质量、最佳实践、清晰度等
效率（15%）：解决方案是否直接高效？（还要考虑令牌使用情况）

评审必须返回结构化响应。指示其输出JSON：

json

{
  "deterministic_checks_passed": true|false,
  "correctness": { "score": 0-100, "justification": "..." },
  "completeness": { "score": 0-100, "justification": "..." },
  "quality": { "score": 0-100, "justification": "..." },
  "efficiency": { "score": 0-100, "justification": "..." },
  "weighted_total": 0-100,
  "summary": "..."
}

将评分保存到相应的评分目录：

runs: 1

→

$RESULTS_DIR/grades/task-NN/with-skill-grade.json

和

baseline-grade.json

runs: N

→

$RESULTS_DIR/grades/task-NN/run-R/with-skill-grade.json

和

baseline-grade.json

你可以使用后台Agent并行运行不同任务/运行的评审。

Step 6: Analyze Transcripts

步骤6：分析对话记录

Before generating the report, analyze the transcript.json files for behavioral signals. This is critical — scores alone don't tell the full story.

For each session, run the analyze script:

bash

undefined

生成报告前，分析transcript.json文件以提取行为信号。这一点至关重要——仅靠得分无法完整说明情况。

对每个会话运行分析脚本：

bash

undefined

For runs: 1 → outputs/task-NN/<mode>/

对于runs: 1 → outputs/task-NN/<mode>/

For runs: N → outputs/task-NN/run-R/<mode>/

对于runs: N → outputs/task-NN/run-R/<mode>/

python3 "scripts/analyze_transcript.py"
"$OUTPUT_DIR/transcript.json"
"$OUTPUT_DIR/behavior.json"


This extracts from transcript.json:
- **Tool call counts**: How many times each tool was used (Read, Write, Edit, Bash, etc.)
- **Thrashing detection**: Did the session loop or retry the same action? (same tool called 3+ times consecutively)
- **Error recovery**: Did the session hit errors and recover, or fail silently?

Output (`behavior.json`):
```json
{
  "tool_calls": {"Read": 2, "Write": 1, "Bash": 3, "Edit": 0},
  "total_tool_calls": 6,
  "thrashing_detected": false,
  "errors_encountered": 0,
  "errors_recovered": 0
}

python3 "scripts/analyze_transcript.py" \ "$OUTPUT_DIR/transcript.json" \ "$OUTPUT_DIR/behavior.json"


该脚本从transcript.json中提取以下信息：
- **工具调用次数**：每个工具被调用的次数（Read、Write、Edit、Bash等）
- **重复操作检测**：会话是否出现循环或重复执行相同操作？（连续调用同一工具3次以上）
- **错误恢复**：会话是否遇到错误并恢复，还是静默失败？

输出（`behavior.json`）：
```json
{
  "tool_calls": {"Read": 2, "Write": 1, "Bash": 3, "Edit": 0},
  "total_tool_calls": 6,
  "thrashing_detected": false,
  "errors_encountered": 0,
  "errors_recovered": 0
}

Step 7: Generate Comparison Report

步骤7：生成对比报告

After all grading and analysis is complete, generate the final report.

Read all grade files, meta files, and behavior files, then compute:

Per-task scores: Weighted total for skill vs baseline
Per-task deltas: skill_score - baseline_score
Aggregate scores: Average across all tasks
Per-criterion aggregates: Average correctness, completeness, quality, efficiency for each condition
Deterministic pass rate: % of tasks where all deterministic checks passed (skill vs baseline)
Negative control results: How did the skill perform on out-of-domain tasks?
Token usage & cost comparison: From meta.json
```
total_cost_usd
```
and
```
usage
```
fields
Behavioral comparison: Tool usage patterns, thrashing, turn efficiency from behavior.json
Verdict logic:
- Delta >= +10%: USE — skill significantly helps
- Delta between +3% and +10%: LIKELY USE — skill provides moderate benefit
- Delta between -3% and +3%: NEUTRAL — skill has negligible effect
- Delta between -10% and -3%: LIKELY DON'T USE — skill may hurt
- Delta <= -10%: DON'T USE — skill significantly hurts

Write the report to

$RESULTS_DIR/report.md

using this format:

markdown

undefined

完成所有评分和分析后，生成最终报告。

读取所有评分文件、元数据文件和行为文件，然后计算：

按任务得分：启用Skill与基准测试的加权总分
按任务差值：skill_score - baseline_score
聚合得分：所有任务的平均值
按标准聚合：每种条件下的平均正确性、完整性、质量、效率
确定性检查通过率：所有确定性检查通过的任务占比（启用Skill vs 基准测试）
阴性控制结果：Skill在领域外任务中的表现如何？
令牌使用和成本对比：来自meta.json的
```
total_cost_usd
```
和
```
usage
```
字段
行为对比：来自behavior.json的工具使用模式、重复操作、轮次效率
结论逻辑：
- 差值 >= +10%：推荐使用——Skill显著提升性能
- 差值在+3%到+10%之间：推荐使用——Skill提供中等收益
- 差值在-3%到+3%之间：中性——Skill影响可忽略
- 差值在-10%到-3%之间：不推荐使用——Skill可能产生负面影响
- 差值 <= -10%：不推荐使用——Skill显著降低性能

将报告写入

$RESULTS_DIR/report.md

，使用以下格式：

markdown

undefined

Skill Benchmark Report: <skill-name>

Skill基准测试报告: <skill-name>

Date: <YYYY-MM-DD HH:MM> Runner Model: <model> | Judge Model: <model> | Tasks: <N> | Runs: <R>

日期: <YYYY-MM-DD HH:MM> 运行模型: <model> | 评审模型: <model> | 任务数: <N> | 运行次数: <R>

Verdict: <emoji> <VERDICT>

结论: <emoji> <VERDICT>

Skill scores <X>% <higher/lower> than baseline on average.

Skill平均得分比基准测试<X>% <更高/更低>。

Summary

摘要

Metric	With Skill	Baseline	Delta
Avg Score	X%	Y%	+/-Z%
Correctness	X%	Y%	+/-Z%
Completeness	X%	Y%	+/-Z%
Quality	X%	Y%	+/-Z%
Efficiency	X%	Y%	+/-Z%

指标	启用Skill	基准测试	差值
平均得分	X%	Y%	+/-Z%
正确性	X%	Y%	+/-Z%
完整性	X%	Y%	+/-Z%
质量	X%	Y%	+/-Z%
效率	X%	Y%	+/-Z%

Deterministic Check Pass Rate

确定性检查通过率

Condition	Pass Rate
With Skill	X/N tasks (Y%)
Baseline	X/N tasks (Y%)

条件	通过率
启用Skill	X/N个任务 (Y%)
基准测试	X/N个任务 (Y%)

Per-Task Breakdown

按任务细分

#	Task	Type	Difficulty	Skill	Baseline	Delta	Winner
1	...	positive	easy	X%	Y%	+/-Z%	Skill/Baseline
N	...	negative	-	X%	Y%	+/-Z%	...

#	任务	类型	难度	Skill得分	基准测试得分	差值	获胜方
1	...	positive	easy	X%	Y%	+/-Z%	Skill/基准测试
N	...	negative	-	X%	Y%	+/-Z%	...

Negative Control Results

阴性控制结果

<Skill在领域外任务中的表现如何？如果它降低了性能，请标记出来。>

Where Skill Helps

Skill的优势领域

<identified patterns where skill outperformed baseline>

<Skill表现优于基准测试的已识别模式>

Where Skill Hurts

Skill的劣势领域

<identified patterns where baseline outperformed skill>

<基准测试表现优于Skill的已识别模式>

Behavioral Analysis

行为分析

Metric	With Skill	Baseline	Delta
Avg Tool Calls	X	Y	+/-Z
Avg Turns	X	Y	+/-Z
Thrashing Detected	X/N	Y/N
Avg Duration (s)	X	Y	+/-Z
Avg Cost	$X	$Y	+/-$Z
Total Cost	$X	$Y	+/-$Z

指标	启用Skill	基准测试	差值
平均工具调用次数	X	Y	+/-Z
平均轮次	X	Y	+/-Z
检测到重复操作	X/N	Y/N
平均时长 (秒)	X	Y	+/-Z
平均成本	$X	$Y	+/-$Z
总成本	$X	$Y	+/-$Z

Recommendations

建议

<actionable suggestions based on the results>
<suggestions for improving the skill if it underperformed>
<flag if skill hurts negative control tasks>


Present the report to the user and tell them where the full results are saved.

---

<基于结果的可操作建议>
<如果Skill表现不佳，提出改进建议>
<如果Skill对阴性控制任务产生负面影响，标记出来>


向用户展示报告，并告知他们完整结果的保存位置。

---

References

参考资料

Output directory structure — full tree for single-run and multi-run modes
Configuration — config format, variables, and parsing

输出目录结构 — 单轮和多轮模式的完整目录树
配置 — 配置格式、变量和解析

Available scripts

可用脚本

scripts/parse_stream.py
— Parse

raw_stream.jsonl

→

response.json

transcript.json

meta.json

scripts/analyze_transcript.py
— Analyze

transcript.json

→

behavior.json

(tool counts, thrashing, errors)

scripts/run_checks.py
— Run deterministic verification checks from task file against sandbox

Run any script with

--help

for full usage details.

scripts/parse_stream.py
— 解析

raw_stream.jsonl

→

response.json

、

transcript.json

、

meta.json

scripts/analyze_transcript.py
— 分析

transcript.json

→

behavior.json

（工具计数、重复操作、错误）

scripts/run_checks.py
— 根据任务文件中的确定性验证检查，对沙箱运行检查

运行任何脚本时，使用

--help

查看完整使用说明。

Error Handling

错误处理

If a
```
claude -p
```
session fails: log the error, score as 0, continue with remaining tasks
If a grader agent fails: retry once, then score as "UNGRADED" and exclude from averages
If the target skill file cannot be found: list available skills and ask user to choose
If fewer than 2 tasks complete successfully: abort and report insufficient data
If deterministic checks crash (e.g., python3 not available): log warning, skip to LLM grading

如果
```
claude -p
```
会话失败：记录错误，打分为0，继续处理剩余任务
如果评审Agent失败：重试一次，然后标记为“未评分”，并从平均值中排除
如果找不到目标Skill文件：列出可用Skill，让用户选择
如果成功完成的任务少于2个：中止并报告数据不足
如果确定性检查崩溃（例如：python3不可用）：记录警告，跳过直接进行LLM评分 ",

skill-benchmark

Original

Translation

Skill Benchmark

Skill基准测试

Execution Flow

执行流程

Step 1: Gather Input

步骤1：收集输入

edit config.yml

edit config.yml

What to do:

操作说明：

Step 2: Read & Analyze Target Skill

步骤2：读取并分析目标Skill

Step 3: Generate Benchmark Tasks

步骤3：生成基准测试任务

Task Categories (all required):

任务类别（所有类别均为必填）：

Task Format

任务格式

Task: <descriptive-name>

Task: <描述性名称>

Prompt

Prompt

Expected Outcome

预期结果

Verification Checks

验证检查

Grading Rubric

评分标准

Tags

Tags

Step 4: Run Eval Sessions

步骤4：运行评估会话

Multi-Run Support

多轮运行支持

Isolation Setup

隔离设置

For runs: 1 (default)

对于runs: 1（默认设置）

For runs: 3 (multi-run)

对于runs: 3（多轮运行）

... etc

... 以此类推

Nested Session Fix

嵌套会话修复

Session Commands

会话命令

For runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill

对于runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill

For runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill

对于runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill

Execution Strategy

执行策略

Step 5: Grade Outputs (Layered Grading)

步骤5：评分输出（分层评分）

Layer 1: Deterministic Checks

第一层：确定性检查

For runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json

对于runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json

For runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json

对于runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json

Layer 2: LLM-as-Judge

第二层：LLM作为评审

Step 6: Analyze Transcripts

步骤6：分析对话记录

For runs: 1 → outputs/task-NN/<mode>/

对于runs: 1 → outputs/task-NN/<mode>/

For runs: N → outputs/task-NN/run-R/<mode>/

对于runs: N → outputs/task-NN/run-R/<mode>/

Step 7: Generate Comparison Report

步骤7：生成对比报告

Skill Benchmark Report: <skill-name>

Skill基准测试报告: <skill-name>

Verdict: <emoji> <VERDICT>

结论: <emoji> <VERDICT>

Summary

摘要

Deterministic Check Pass Rate