skill-creator-pro

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Creator Pro

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run claude-with-access-to-the-skill on them
Help the user evaluate the results both qualitatively and quantitatively
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
- Use the
```
eval-viewer/generate_review.py
```
  script to show the user the results for them to look at, and also let them look at the quantitative metrics
Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
Repeat until you're satisfied
Expand the test set and try again at larger scale

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.

Cool? Cool.

一款用于创建新Skill并对其进行迭代优化的Skill。

从宏观层面来看，创建Skill的流程如下：

确定你想要Skill实现的功能，以及大致的实现方式
编写Skill的初稿
创建几个测试提示词，使用可访问该Skill的Claude运行这些提示词
帮助用户从定性和定量两个维度评估结果
- 当测试在后台运行时，如果还没有定量评估项就先起草一些（如果已有评估项，你可以直接使用，也可以在你认为需要调整的地方进行修改）。然后向用户说明这些评估项（如果是已有的评估项，就说明已有的内容）
- 使用
```
eval-viewer/generate_review.py
```
  脚本向用户展示结果供其查看，同时也让用户可以查看定量指标
根据用户对结果的评估反馈重写Skill（如果定量基准测试暴露出了明显的缺陷也要同步修改）
重复上述流程直到你对结果满意
扩大测试集，进行更大规模的测试

使用该Skill时，你的任务是判断用户处于上述流程的哪个阶段，然后介入帮助用户推进这些阶段。比如，用户可能说"我想要做一个实现X功能的Skill"，你可以帮他们明确需求范围、编写初稿、撰写测试用例、确定评估方式、运行所有提示词，然后重复迭代。

另一方面，如果用户已经有了Skill的初稿，这种情况下你可以直接进入评估/迭代的循环环节。

当然，你需要始终保持灵活，如果用户说"我不需要运行一堆评估，跟着我的思路来就行"，你就按照用户的要求做即可。

在Skill完成之后（当然顺序可以灵活调整），你也可以运行Skill描述优化工具（我们有单独的脚本实现这个功能），来优化Skill的触发效果。

没问题吧？很好。

Communicating with the user

与用户沟通

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.

So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:

"evaluation" and "benchmark" are borderline, but OK
for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them

It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

使用Skill创建工具的用户对编码术语的熟悉程度差异很大。你可能还没听说过（毕竟这是最近才兴起的趋势），现在Claude的强大能力甚至激励了水管工打开终端，父母和祖父母都会去搜索"怎么安装npm"。另一方面，大部分用户的计算机水平还是很不错的。

所以请留意上下文线索，确定怎么组织沟通话术！默认情况下，给你一些参考：

"evaluation（评估）"和"benchmark（基准测试）"属于边缘术语，但可以使用
对于"JSON"和"assertion（断言）"这类术语，你需要先明确有明显线索表明用户知道这些术语的含义，才能在不解释的情况下直接使用

如果你不确定用户是否理解某个术语，可以简单解释一下，也可以用简短的定义说明术语含义。

Creating a skill

创建Skill

Capture Intent

捕获意图

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.

What should this skill enable Claude to do?
When should this skill trigger? (what user phrases/contexts)
What's the expected output format?
Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
Which use case pattern does this skill follow? (See
```
references/design_principles.md
```
for the three categories: Document & Asset Creation, Workflow Automation, or MCP Enhancement. Understanding the pattern helps guide design decisions.)

首先要理解用户的意图。当前对话可能已经包含了用户想要捕获的工作流（比如用户说"把这个变成一个Skill"）。如果是这种情况，先从对话历史中提取答案：使用的工具、步骤顺序、用户做过的修正、观察到的输入/输出格式。用户可能需要补充缺失的信息，在进入下一步之前需要得到用户的确认。

这个Skill应该让Claude能够实现什么功能？
这个Skill应该什么时候触发？（用户的哪些短语/上下文场景）
期望的输出格式是什么？
我们是否需要设置测试用例来验证Skill的功能？输出可客观验证的Skill（文件转换、数据提取、代码生成、固定工作流步骤）适合用测试用例，输出偏主观的Skill（写作风格、艺术创作）通常不需要。根据Skill类型推荐合适的默认选项，但最终由用户决定。
这个Skill遵循哪种用例模式？（参考
```
references/design_principles.md
```
中的三类：文档与资产创建、工作流自动化、MCP增强。理解模式有助于指导设计决策。）

Interview and Research

调研与需求确认

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.

Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.

主动询问用户关于边界情况、输入/输出格式、示例文件、成功标准和依赖项的问题。等你把这部分内容确认清楚之后再编写测试提示词。

检查可用的MCP——如果对调研有帮助（搜索文档、查找类似Skill、查询最佳实践），可以通过子Agent并行调研，如果没有子Agent就直接在线查询。提前准备好上下文，减少用户的负担。

Write the SKILL.md

编写SKILL.md

Based on the user interview, fill in these components:

name: Skill identifier (kebab-case, no "claude" or "anthropic" in name - see
```
references/constraints_and_rules.md
```
)
description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Follow the formula:
```
[What it does] + [When to use] + [Trigger phrases]
```
. Must be under 1024 characters, no XML angle brackets. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'" See
```
references/constraints_and_rules.md
```
for detailed guidance.
compatibility: Required tools, dependencies (optional, rarely needed)
the rest of the skill :)

基于和用户的沟通结果，填写以下组成部分：

name：Skill标识符（短横线命名法，名称中不要包含"claude"或"anthropic"——参考
```
references/constraints_and_rules.md
```
）
description：触发时机、功能说明。这是主要的触发机制——需要同时包含Skill的功能和具体的使用场景。所有"使用时机"的信息都放在这里，不要放在正文里。遵循如下公式：
```
[功能描述] + [使用时机] + [触发短语]
```
。必须少于1024个字符，不要包含XML尖括号。注意：当前Claude存在Skill"触发不足"的倾向——本该使用Skill的时候没有调用。为了解决这个问题，请把Skill描述写得稍微"主动"一点。比如，不要写"如何构建一个简单快速的仪表盘来展示Anthropic内部数据。"，你可以写"如何构建一个简单快速的仪表盘来展示Anthropic内部数据。只要用户提到仪表盘、数据可视化、内部指标，或是想要展示任何类型的公司数据，即使他们没有明确要求'仪表盘'，也要确保使用这个Skill。"参考
```
references/constraints_and_rules.md
```
获取详细指引。
compatibility：所需工具、依赖项（可选，很少需要）
Skill的其余部分:)

Skill Writing Guide

Skill编写指南

Before diving into the details, familiarize yourself with core concepts:

Read
```
references/design_principles.md
```
for the three design principles (Progressive Disclosure, Composability, Portability) and three common use case patterns (Document Creation, Workflow Automation, MCP Enhancement)
Read
```
references/constraints_and_rules.md
```
for technical constraints, naming conventions, and security requirements
Keep
```
references/quick_checklist.md
```
handy for pre-publication verification

深入了解细节之前，先熟悉核心概念：

阅读
```
references/design_principles.md
```
了解三大设计原则（Progressive Disclosure、Composability、Portability）和三种常见用例模式（文档创建、工作流自动化、MCP增强）
阅读
```
references/constraints_and_rules.md
```
了解技术约束、命名规范和安全要求
随时查阅
```
references/quick_checklist.md
```
用于发布前的验证

Anatomy of a Skill

Skill的结构

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic/repetitive tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons, fonts)

skill-name/
├── SKILL.md (必填)
│   ├── YAML frontmatter (name、description为必填项)
│   └── Markdown说明内容
└── 绑定资源 (可选)
    ├── scripts/    - 用于确定性/重复性任务的可执行代码
    ├── references/ - 需要时加载到上下文中的文档
    └── assets/     - 输出中使用的文件（模板、图标、字体）

Progressive Disclosure

Skills use a three-level loading system:

Metadata (name + description) - Always in context (~100 words)
SKILL.md body - In context whenever skill triggers (<500 lines ideal)
Bundled resources - As needed (unlimited, scripts can execute without loading)

These word counts are approximate and you can feel free to go longer if needed.

Key patterns:

Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
Reference files clearly from SKILL.md with guidance on when to read them
For large reference files (>300 lines), include a table of contents

Domain organization: When a skill supports multiple domains/frameworks, organize by variant:

cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude reads only the relevant reference file.

Skill采用三级加载系统：

元数据（名称 + 描述）—— 始终在上下文中（约100字）
SKILL.md正文—— Skill触发时加载到上下文中（理想情况少于500行）
绑定资源—— 按需加载（无限制，脚本无需加载即可执行）

上述字数是近似值，如果需要你也可以写得更长。

关键模式：

保持SKILL.md少于500行；如果接近这个限制，就增加额外的层级结构，并明确说明使用该Skill的模型下一步应该去哪里获取后续信息。
在SKILL.md中清晰引用文件，并说明什么时候需要读取这些文件
对于大型参考文件（>300行），包含目录

领域组织：当一个Skill支持多个领域/框架时，按变体组织：

cloud-deploy/
├── SKILL.md (工作流 + 选择逻辑)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude只会读取相关的参考文件。

Principle of Lack of Surprise

无意外原则

This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.

不言而喻，Skill不得包含恶意软件、漏洞利用代码，或任何可能危害系统安全的内容。Skill的内容在描述后不应让用户对其意图感到意外。不要同意创建误导性Skill，或是旨在促进未授权访问、数据泄露或其他恶意活动的Skill。不过类似"角色扮演XYZ"这类请求是可以的。

Writing Patterns

编写模式

Prefer using the imperative form in instructions.

Defining output formats - You can do it like this:

markdown

undefined

在说明中优先使用祈使句形式。

定义输出格式——你可以这样做：

markdown

undefined

Report structure

报告结构

ALWAYS use this exact template:

必须严格使用这个模板：

[Title]

[标题]

Executive summary

执行摘要

Key findings

核心发现

Recommendations

建议


**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
```markdown


**示例模式**——包含示例很有用。你可以这样格式化（如果示例中有"输入"和"输出"，你可以稍微调整格式）：
```markdown

Commit message format

Commit信息格式

Example 1: Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication

undefined

示例1： 输入：添加了基于JWT令牌的用户认证输出：feat(auth): implement JWT-based authentication

undefined

Writing Style

写作风格

Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.

尽量向模型解释为什么某些要求很重要，而不是生硬地使用强制要求。运用心理理论，尽量让Skill具备通用性，不要过于局限于特定示例。先写一个初稿，然后换个视角重新审视并改进。

Test Cases

测试用例

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.

Save test cases to

evals/evals.json

. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.

json

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

See

references/schemas.md

for the full schema (including the

assertions

field, which you'll add later).

写完Skill初稿后，构思2-3个符合真实场景的测试提示词——就是真实用户实际会输入的内容。和用户确认：[你不需要完全照搬这句话]"这是我想测试的几个用例，你看是否合适，或者你想要添加更多用例吗？"然后运行这些测试。

将测试用例保存到

evals/evals.json

。暂时不要写断言——只写提示词即可。下一步测试运行过程中你可以起草断言。

json

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "用户的任务提示词",
      "expected_output": "期望结果的描述",
      "files": []
    }
  ]
}

参考

references/schemas.md

获取完整的schema（包含后续要添加的

assertions

字段）。

Plugin Integration Check

插件集成检查

IMPORTANT: After writing the skill draft, check if this skill is part of a Claude Code plugin. If the skill path contains

.claude-plugins/

plugins/

, automatically perform a plugin integration check.

重要提示：写完Skill初稿后，检查该Skill是否属于Claude Code插件的一部分。如果Skill路径包含

.claude-plugins/

或

plugins/

，自动执行插件集成检查。

When to Check

检查时机

Check plugin integration if:

Skill path contains
```
.claude-plugins/
```
or
```
plugins/
```
User mentions "plugin", "command", or "agent" in context
You notice related commands or agents in the same directory structure

满足以下条件时检查插件集成：

Skill路径包含
```
.claude-plugins/
```
或
```
plugins/
```
用户在上下文中提到"plugin"、"command"或"agent"
你在同一目录结构中发现了相关的命令或Agent

What to Check

检查内容

Detect Plugin Context

bash

# Look for plugin.json in parent directories
SKILL_DIR="path/to/skill"
CURRENT_DIR=$(dirname "$SKILL_DIR")

while [ "$CURRENT_DIR" != "/" ]; do
  if [ -f "$CURRENT_DIR/.claude-plugin/plugin.json" ]; then
    echo "Found plugin at: $CURRENT_DIR"
    break
  fi
  CURRENT_DIR=$(dirname "$CURRENT_DIR")
done

Check for Related Components
- Look for
```
commands/
```
  directory - are there commands that should use this skill?
- Look for
```
agents/
```
  directory - are there agents that should reference this skill?
- Search for skill name in existing commands and agents
Verify Three-Layer Architecture

The plugin should follow this pattern:
```
Command (Orchestration) → Agent (Execution) → Skill (Knowledge)
```
Command Layer should:
- Check prerequisites (is service running?)
- Gather user requirements (use AskUserQuestion)
- Delegate complex work to agent
- Verify final results
Agent Layer should:
- Define clear capabilities
- Reference skill for API/implementation details
- Outline execution workflow
- Handle errors and iteration
Skill Layer should:
- Document API endpoints and usage
- Provide best practices
- Include examples
- Add troubleshooting guide
- NOT contain workflow logic (that's in commands)

Generate Integration Report

If this skill is part of a plugin, generate a brief report:

markdown

## Plugin Integration Status

Plugin: {name} v{version}
Skill: {skill-name}

### Related Components
- Commands: {list or "none found"}
- Agents: {list or "none found"}

### Architecture Check
- [ ] Command orchestrates workflow
- [ ] Agent executes autonomously
- [ ] Skill documents knowledge
- [ ] Clear separation of concerns

### Recommendations
{specific suggestions if integration is incomplete}

Offer to Fix Integration Issues

If you find issues:
- Missing command that should orchestrate this skill
- Agent that doesn't reference the skill
- Command that tries to do everything (monolithic)
- Skill that contains workflow logic
Offer to create/fix these components following the three-layer pattern.

检测插件上下文

bash

# 在父目录中查找plugin.json
SKILL_DIR="path/to/skill"
CURRENT_DIR=$(dirname "$SKILL_DIR")

while [ "$CURRENT_DIR" != "/" ]; do
  if [ -f "$CURRENT_DIR/.claude-plugin/plugin.json" ]; then
    echo "Found plugin at: $CURRENT_DIR"
    break
  fi
  CURRENT_DIR=$(dirname "$CURRENT_DIR")
done

检查相关组件
- 查找
```
commands/
```
  目录——有没有应该使用该Skill的命令？
- 查找
```
agents/
```
  目录——有没有应该引用该Skill的Agent？
- 在现有命令和Agent中搜索Skill名称
验证三层架构

插件应该遵循如下模式：
```
Command（编排层） → Agent（执行层） → Skill（知识层）
```
Command层应该：
- 检查前置条件（服务是否运行？）
- 收集用户需求（使用AskUserQuestion）
- 将复杂工作委托给Agent
- 验证最终结果
Agent层应该：
- 定义清晰的能力范围
- 引用Skill获取API/实现细节
- 梳理执行工作流
- 处理错误和迭代
Skill层应该：
- 记录API端点和使用方法
- 提供最佳实践
- 包含示例
- 添加故障排除指南
- 不包含工作流逻辑（这部分属于Command）

生成集成报告

如果该Skill属于某个插件，生成一份简短报告：

markdown

## 插件集成状态

插件：{name} v{version}
Skill：{skill-name}

### 相关组件
- 命令：{列表或"未找到"}
- Agent：{列表或"未找到"}

### 架构检查
- [ ] Command编排工作流
- [ ] Agent自主执行
- [ ] Skill记录知识
- [ ] 清晰的职责分离

### 建议
{如果集成不完整，给出具体建议}

提供集成问题修复服务

如果你发现如下问题：
- 缺少应该编排该Skill的Command
- Agent没有引用该Skill
- Command试图处理所有逻辑（单体架构）
- Skill包含工作流逻辑
主动提出按照三层模式创建/修复这些组件。

Example Integration Check

集成检查示例

bash

undefined

bash

undefined

After creating skill at: plugins/my-plugin/skills/api-helper/

在以下路径创建Skill后：plugins/my-plugin/skills/api-helper/

1. Detect plugin

1. 检测插件

Found plugin: my-plugin v1.0.0

2. Check for related components

2. 检查相关组件

Commands found:

commands/api-call.md (references api-helper ✅)

Agents found:

agents/api-executor.md (references api-helper ✅)

Commands found:

commands/api-call.md (references api-helper ✅)

Agents found:

agents/api-executor.md (references api-helper ✅)

3. Verify architecture

3. 验证架构

✅ Command delegates to agent ✅ Agent references skill ✅ Skill documents API only ✅ Clear separation of concerns

Integration Score: 0.9 (Excellent)

undefined

✅ Command委托给Agent ✅ Agent引用Skill ✅ Skill仅记录API相关内容 ✅ 清晰的职责分离

集成得分：0.9（优秀）

undefined

Reference Documentation

参考文档

For detailed architecture guidance, see:

```
PLUGIN_ARCHITECTURE.md
```
in project root
```
tldraw-helper/ARCHITECTURE.md
```
for reference implementation
```
tldraw-helper/commands/draw.md
```
for example command

After integration check, proceed with test cases as normal.

获取详细架构指引，请参考：

项目根目录下的
```
PLUGIN_ARCHITECTURE.md
```
参考实现
```
/tldraw-helper/ARCHITECTURE.md
```
示例命令
```
/tldraw-helper/commands/draw.md
```

集成检查完成后，正常进行测试用例流程。

Running and evaluating test cases

运行和评估测试用例

This section is one continuous sequence — don't stop partway through. Do NOT use

/skill-test

or any other testing skill.

Put results in

<skill-name>-workspace/

as a sibling to the skill directory. Within the workspace, organize results by iteration (

iteration-1/

iteration-2/

, etc.) and within that, each test case gets a directory (

eval-0/

eval-1/

, etc.). Don't create all of this upfront — just create directories as you go.

这部分是一个连续的流程——不要中途停止。不要使用

/skill-test

或任何其他测试Skill。

<skill-name>-workspace/

下。在工作空间中，按迭代轮次组织结果（

iteration-1/

、

iteration-2/

等），每个迭代目录下，每个测试用例单独一个目录（

eval-0/

、

eval-1/

等）。不需要提前创建所有目录——按需创建即可。

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1：在同一轮次中启动所有运行（使用Skill的版本和基线版本）

For each test case, spawn two subagents in the same turn — one with the skill, one without. This is important: don't spawn the with-skill runs first and then come back for baselines later. Launch everything at once so it all finishes around the same time.

With-skill run:

Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">

Baseline run (same prompt, but the baseline depends on context):

Creating a new skill: no skill at all. Same prompt, no skill path, save to
```
without_skill/outputs/
```
.
Improving an existing skill: the old version. Before editing, snapshot the skill (
```
cp -r <skill-path> <workspace>/skill-snapshot/
```
), then point the baseline subagent at the snapshot. Save to
```
old_skill/outputs/
```
.

Write an

eval_metadata.json

for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.

json

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}

对每个测试用例，在同一轮次中启动两个子Agent——一个使用该Skill，一个不使用。这点很重要：不要先运行使用Skill的版本，之后再运行基线版本。同时启动所有测试，这样它们会差不多同时完成。

使用Skill的运行：

执行该任务：
- Skill路径：<path-to-skill>
- 任务：<eval prompt>
- 输入文件：<eval files if any, or "none">
- 保存输出到：<workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- 要保存的输出：<用户关心的内容——比如".docx文件"、"最终CSV">

基线版本运行（相同的提示词，但基线版本取决于上下文）：

创建新Skill：完全不使用Skill。相同提示词，不填Skill路径，保存到
```
without_skill/outputs/
```
。
改进现有Skill：使用旧版本。编辑前，对Skill做快照（
```
cp -r <skill-path> <workspace>/skill-snapshot/
```
），然后让基线子Agent使用这个快照。保存到
```
old_skill/outputs/
```
。

为每个测试用例编写

eval_metadata.json

（断言暂时可以为空）。给每个评估项起一个描述性的名称，说明它测试的内容——不要只叫"eval-0"。目录也用这个名称命名。如果本次迭代使用了新的或修改过的评估提示词，为每个新的评估目录创建这些文件——不要假设它们会从上一轮迭代继承。

json

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "用户的任务提示词",
  "assertions": []
}

Step 2: While runs are in progress, draft assertions

步骤2：测试运行过程中，起草断言

Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in

evals/evals.json

, review them and explain what they check.

Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.

Update the

eval_metadata.json

files and

evals/evals.json

with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.

不要只是等待测试运行完成——你可以高效利用这段时间。为每个测试用例起草定量断言，并向用户说明。如果

evals/evals.json

中已经有断言，审核这些断言并说明它们检查的内容。

好的断言是可以客观验证的，并且有描述性的名称——在基准查看器中应该清晰易读，让扫一眼结果的人立刻就能理解每个断言检查的内容。主观类Skill（写作风格、设计质量）更适合定性评估——不要强制给需要人工判断的内容加断言。

断言起草完成后，更新

eval_metadata.json

文件和

evals/evals.json

。同时向用户说明他们在查看器中会看到什么——包括定性输出和定量基准测试结果。

Step 3: As runs complete, capture timing data

步骤3：测试运行完成后，收集耗时数据

When each subagent task completes, you receive a notification containing

total_tokens

and

duration_ms

. Save this data immediately to

timing.json

in the run directory:

json

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

每个子Agent任务完成时，你会收到包含

total_tokens

和

duration_ms

的通知。立即将这些数据保存到运行目录下的

timing.json

中：

json

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

这是收集这些数据的唯一机会——它仅通过任务通知发送，不会在其他地方持久化。收到每个通知就处理，不要试图批量处理。

Step 4: Grade, aggregate, and launch the viewer

步骤4：评分、聚合、启动查看器

Once all runs are done:

Grade each run — spawn a grader subagent (or grade inline) that reads
```
agents/grader.md
```
and evaluates each assertion against the outputs. Save results to
```
grading.json
```
in each run directory. The grading.json expectations array must use the fields
```
text
```
,
```
passed
```
, and
```
evidence
```
(not
```
name
```
/
```
met
```
/
```
details
```
or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
Aggregate into benchmark — run the aggregation script from the skill-creator directory:
bash
```
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```
This produces
```
benchmark.json
```
and
```
benchmark.md
```
with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see
```
references/schemas.md
```
for the exact schema the viewer expects. Put each with_skill version before its baseline counterpart.
Do an analyst pass — read the benchmark data and surface patterns the aggregate stats might hide. See
```
agents/analyzer.md
```
(the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
Launch the viewer with both qualitative outputs and quantitative data:
bash
```
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!
```
For iteration 2+, also pass
```
--previous-workspace <workspace>/iteration-<N-1>
```
.
Cowork / headless environments: If
```
webbrowser.open()
```
is not available or the environment has no display, use
```
--static <output_path>
```
to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a
```
feedback.json
```
file when the user clicks "Submit All Reviews". After download, copy
```
feedback.json
```
into the workspace directory for the next iteration to pick up.

Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.

Tell the user something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."

所有测试运行完成后：

为每个运行评分——启动一个评分子Agent（或直接在线评分），读取
```
agents/grader.md
```
，根据输出评估每个断言。将结果保存到每个运行目录下的
```
grading.json
```
中。grading.json的expectations数组必须使用
```
text
```
、
```
passed
```
和
```
evidence
```
字段（不能用
```
name
```
/
```
met
```
/
```
details
```
或其他变体）——查看器依赖这些确切的字段名。对于可以通过程序检查的断言，编写并运行脚本，而不是人工检查——脚本更快、更可靠，还可以在迭代中复用。
聚合为基准测试结果——从skill-creator目录运行聚合脚本：
bash
```
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```
这会生成
```
benchmark.json
```
和
```
benchmark.md
```
，包含每个配置的通过率、耗时和token使用量，以及平均值±标准差和差值。如果手动生成benchmark.json，参考
```
references/schemas.md
```
获取查看器需要的确切schema。将使用Skill的版本放在基线版本前面。
分析师视角检查——阅读基准测试数据，找出聚合统计数据可能隐藏的模式。参考
```
agents/analyzer.md
```
（"分析基准测试结果"部分）了解需要关注的内容——比如无论是否使用Skill都总能通过的断言（无区分度）、方差大的评估项（可能不稳定），以及时间/token的权衡。
启动查看器，同时展示定性输出和定量数据：
bash
```
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!
```
如果是第2轮及以上迭代，还要传入
```
--previous-workspace <workspace>/iteration-<N-1>
```
。
协同工作/无头环境： 如果
```
webbrowser.open()
```
不可用，或者环境没有显示器，使用
```
--static <output_path>
```
生成独立的HTML文件，而不是启动服务器。用户点击"提交所有评论"时，反馈会下载为
```
feedback.json
```
文件。下载完成后，将
```
feedback.json
```
复制到工作空间目录，供下一轮迭代使用。

注意：请使用generate_review.py创建查看器，不需要编写自定义HTML。

告知用户类似这样的内容："我已经在你的浏览器中打开了结果。有两个标签页——'输出'可以让你点击查看每个测试用例并留下反馈，'基准测试'展示定量对比结果。你完成后回到这里告诉我就行。"

What the user sees in the viewer

用户在查看器中看到的内容

The "Outputs" tab shows one test case at a time:

Prompt: the task that was given
Output: the files the skill produced, rendered inline where possible
Previous Output (iteration 2+): collapsed section showing last iteration's output
Formal Grades (if grading was run): collapsed section showing assertion pass/fail
Feedback: a textbox that auto-saves as they type
Previous Feedback (iteration 2+): their comments from last time, shown below the textbox

The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.

Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to

feedback.json

"输出"标签页每次展示一个测试用例：

提示词：给出的任务
输出：Skill生成的文件，尽可能直接渲染
上一次输出（第2轮及以上迭代）：折叠区域展示上一轮迭代的输出
正式评分（如果运行了评分）：折叠区域展示断言通过/失败情况
反馈：用户输入时自动保存的文本框
上一次反馈（第2轮及以上迭代）：用户上一次的评论，显示在文本框下方

"基准测试"标签页展示统计摘要：每个配置的通过率、耗时和token使用量，还有每个评估项的细分数据和分析师观察结果。

可以通过上一个/下一个按钮或方向键导航。完成后，用户点击"提交所有评论"，所有反馈会保存到

feedback.json

。

Step 5: Read the feedback

步骤5：读取反馈

When the user tells you they're done, read

feedback.json

json

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
  ],
  "status": "complete"
}

Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.

Kill the viewer server when you're done with it:

bash

kill $VIEWER_PID 2>/dev/null

用户告知你他们完成后，读取

feedback.json

：

json

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "完美，我很喜欢", "timestamp": "..."}
  ],
  "status": "complete"
}

空反馈意味着用户认为没问题。重点改进用户有具体反馈的测试用例。

使用完查看器后关闭服务器：

bash

kill $VIEWER_PID 2>/dev/null

Improving the skill

改进Skill

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.

这是循环的核心。你已经运行了测试用例，用户已经审核了结果，现在你需要根据他们的反馈优化Skill。

How to think about improvements

如何考虑改进

Generalize from the feedback. The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
Keep the prompt lean. Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
Explain the why. Try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
Look for repeated work across test cases. Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a
```
create_docx.py
```
or a
```
build_chart.py
```
, that's a strong signal the skill should bundle that script. Write it once, put it in
```
scripts/
```
, and tell the skill to use it. This saves every future invocation from reinventing the wheel.

This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.

从反馈中提炼通用逻辑。 我们做这件事的核心目标是创建可以在大量不同提示词中使用百万次（甚至更多）的Skill。你和用户现在仅在少量示例上反复迭代，是因为这样可以加快进度。用户非常熟悉这些示例，他们可以快速评估新的输出。但如果你和用户共同开发的Skill只能适配这些示例，那它就没有价值。不要做零碎的过拟合修改，或是过于严格的强制要求，如果遇到难以解决的问题，你可以尝试换个思路，使用不同的隐喻，或者推荐不同的工作模式。尝试的成本很低，说不定你就能找到很好的解决方案。
保持提示词精简。 删除没有实际作用的内容。一定要阅读运行日志，而不是只看最终输出——如果看起来Skill让模型浪费了大量时间做无用的事情，你可以尝试删除Skill中导致这个问题的部分，看看效果如何。
解释原因。 尽量向模型解释你要求它做每件事的原因。现在的大模型非常智能，它们有很好的心理理论，当有好的框架支撑时，它们可以不局限于机械的指令，真正完成任务。即使用户的反馈很简短或者带有情绪，也要真正理解任务的本质，理解用户写这些内容的原因，以及他们实际想要的是什么，然后把这种理解转化到说明中。如果你发现自己在写全大写的ALWAYS或NEVER，或者使用非常严格的结构，这是一个黄色预警——如果可以的话，重新组织表述，解释背后的原因，让模型理解你要求的事情为什么重要。这是更人性化、更强大、更有效的方法。
查找测试用例中的重复工作。 阅读测试运行的日志，看看子Agent是否都独立编写了类似的辅助脚本，或者对某个任务采取了相同的多步骤方法。如果3个测试用例都让子Agent编写了
```
create_docx.py
```
或
```
build_chart.py
```
，这是一个很强的信号，说明Skill应该打包这个脚本。写一次，放到
```
scripts/
```
目录下，告诉Skill使用它。这样可以避免未来每次调用都重复造轮子。

这个任务非常重要（我们目标是每年创造数十亿美元的经济价值！），你的思考时间不是瓶颈；慢慢来，认真思考。我建议先写一个修订草稿，然后重新审视再做改进。真正站在用户的角度思考，理解他们的需求。

The iteration loop

迭代循环

After improving the skill:

Apply your improvements to the skill
Rerun all test cases into a new
```
iteration-<N+1>/
```
directory, including baseline runs. If you're creating a new skill, the baseline is always
```
without_skill
```
(no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
Launch the reviewer with
```
--previous-workspace
```
pointing at the previous iteration
Wait for the user to review and tell you they're done
Read the new feedback, improve again, repeat

Keep going until:

The user says they're happy
The feedback is all empty (everything looks good)
You're not making meaningful progress

改进Skill之后：

将改进应用到Skill中
在新的
```
iteration-<N+1>/
```
目录中重新运行所有测试用例，包括基线版本。如果你在创建新Skill，基线版本始终是
```
without_skill
```
（不使用Skill）——所有迭代都保持不变。如果你在改进现有Skill，你可以自行判断什么作为基线比较合适：用户最初提供的原始版本，或是上一轮迭代的版本。
启动评审工具，传入
```
--previous-workspace
```
指向之前的迭代目录
等待用户审核并告知你完成
读取新的反馈，再次改进，重复循环

持续迭代直到：

用户表示满意
所有反馈都是空的（所有内容都符合要求）
没有取得有意义的进展

Advanced: Blind comparison

高级功能：盲测对比

For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read

agents/comparator.md

and

agents/analyzer.md

for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.

This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

如果你想要更严谨地比较两个版本的Skill（比如用户问"新版本真的更好吗？"），可以使用盲测对比系统。参考

agents/comparator.md

和

agents/analyzer.md

了解详情。基本思路是：把两个输出交给独立的Agent，不告诉它哪个是哪个，让它判断质量。然后分析获胜版本的优势。

这是可选功能，需要子Agent，大部分用户不需要。人工审核循环通常已经足够。

Description Optimization

描述优化

The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.

SKILL.md frontmatter中的description字段是决定Claude是否调用Skill的核心机制。创建或改进Skill之后，主动提出优化描述，提升触发准确率。

Step 1: Generate trigger eval queries

步骤1：生成触发评估查询

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:

json

[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]

The queries must be realistic and something a Claude Code or Claude.ai user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).

Bad:

"Format this data"

"Extract text from PDF"

"Create a chart"

Good:

"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"

For the should-trigger queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.

For the should-not-trigger queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.

The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.

创建20个评估查询——包含应该触发和不应该触发的混合场景。保存为JSON：

json

[
  {"query": "用户提示词", "should_trigger": true},
  {"query": "另一个提示词", "should_trigger": false}
]

查询必须符合真实场景，是Claude Code或Claude.ai用户实际会输入的内容。不要用抽象的请求，要用具体、有细节的请求。比如文件路径、用户工作或场景的个人背景、列名和值、公司名称、URL。可以加一点背景信息。有些可以是小写，或者包含缩写、拼写错误、口语化表达。使用不同长度的查询，重点关注边界场景，而不是清晰明确的场景（用户会有机会确认这些查询）。

负面示例：

"Format this data"

、

"Extract text from PDF"

、

"Create a chart"

正面示例：

"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"

对于应该触发的查询（8-10个），考虑覆盖范围。你需要不同表述的相同意图——有些正式，有些口语化。包含用户没有明确提到Skill或文件类型，但明显需要该Skill的场景。加入一些不常见的用例，以及该Skill和其他Skill竞争但应该胜出的场景。

对于不应该触发的查询（8-10个），最有价值的是接近触发的场景——和Skill共享关键词或概念，但实际需要不同功能的查询。考虑相邻领域、模糊表述（单纯的关键词匹配会触发但实际不应该），以及查询涉及Skill的功能，但上下文里其他工具更合适的场景。

要避免的关键点：不要让不应该触发的查询明显不相关。比如用"写一个斐波那契函数"作为PDF Skill的负面测试用例太简单了——测试不出任何价值。负面用例应该是真正容易混淆的。

Step 2: Review with user

步骤2：和用户一起审核

Present the eval set to the user for review using the HTML template:

Read the template from
```
assets/eval_review.html
```
Replace the placeholders:
- ```
__EVAL_DATA_PLACEHOLDER__
```
  → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
- ```
__SKILL_NAME_PLACEHOLDER__
```
  → the skill's name
- ```
__SKILL_DESCRIPTION_PLACEHOLDER__
```
  → the skill's current description

Write to a temp file (e.g.,

/tmp/eval_review_<skill-name>.html

) and open it:

open /tmp/eval_review_<skill-name>.html

The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
The file downloads to
```
~/Downloads/eval_set.json
```
— check the Downloads folder for the most recent version in case there are multiple (e.g.,
```
eval_set (1).json
```
)

This step matters — bad eval queries lead to bad descriptions.

使用HTML模板向用户展示评估集供审核：

读取
```
assets/eval_review.html
```
模板
替换占位符：
- ```
__EVAL_DATA_PLACEHOLDER__
```
  → 评估项的JSON数组（不要加引号——这是JS变量赋值）
- ```
__SKILL_NAME_PLACEHOLDER__
```
  → Skill的名称
- ```
__SKILL_DESCRIPTION_PLACEHOLDER__
```
  → Skill当前的描述

写入临时文件（比如

/tmp/eval_review_<skill-name>.html

）并打开：

open /tmp/eval_review_<skill-name>.html

用户可以编辑查询、切换是否应该触发、添加/删除条目，然后点击"导出评估集"
文件会下载到
```
~/Downloads/eval_set.json
```
——检查Downloads文件夹中最新的版本，避免有多个版本（比如
```
eval_set (1).json
```
）

这一步很重要——质量差的评估查询会导致质量差的描述。

Step 3: Run the optimization loop

步骤3：运行优化循环

Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."

Save the eval set to the workspace, then run in the background:

bash

python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose

Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.

While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.

This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with

best_description

— selected by test score rather than train score to avoid overfitting.

告知用户："这需要一点时间——我会在后台运行优化循环，定期查看进度。"

将评估集保存到工作空间，然后在后台运行：

bash

python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose

使用你系统提示中的模型ID（当前会话使用的模型），这样触发测试会和用户实际体验一致。

运行过程中，定期查看输出，向用户更新当前迭代轮次和得分情况。

这个脚本会自动处理完整的优化循环。它将评估集分为60%训练集和40%留存测试集，评估当前描述（每个查询运行3次以获得可靠的触发率），然后调用Claude基于失败的案例提出改进方案。它会在训练集和测试集上重新评估每个新描述，最多迭代5次。完成后，它会在浏览器中打开HTML报告展示每轮迭代的结果，并返回包含

best_description

的JSON——通过测试集得分选择最优结果，避免过拟合。

How skill triggering works

Skill触发的工作原理

Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's

available_skills

list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.

This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.

理解触发机制有助于设计更好的评估查询。Skill会和名称+描述一起出现在Claude的

available_skills

列表中，Claude根据描述决定是否要查阅该Skill。需要知道的重点是：Claude只会对自己无法轻松处理的任务查阅Skill——简单的单步查询比如"读取这个PDF"可能不会触发Skill，即使描述完全匹配，因为Claude可以用基础工具直接处理。复杂的、多步骤的、专业的查询在描述匹配时会可靠地触发Skill。

这意味着你的评估查询应该足够有实质性，让Claude确实能从查阅Skill中获益。简单的查询比如"读取文件X"是很差的测试用例——不管描述质量如何，它们都不会触发Skill。

Step 4: Apply the result

步骤4：应用结果

Take

best_description

from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

获取JSON输出中的

best_description

，更新Skill的SKILL.md frontmatter。向用户展示修改前后的内容和得分。

Final Quality Check

最终质量检查

Before packaging, run through

references/quick_checklist.md

to verify:

All technical constraints met (naming, character limits, forbidden terms)

Description follows the formula:

[What it does] + [When to use] + [Trigger phrases]

File structure correct (SKILL.md capitalization, kebab-case folders)
Security requirements satisfied (no malware, no misleading functionality)
Quantitative success criteria achieved (90%+ trigger rate, efficient tool usage)
Design principles applied (Progressive Disclosure, Composability, Portability)

This checklist helps catch common issues before publication.

打包前，对照

references/quick_checklist.md

验证：

满足所有技术约束（命名、字符限制、禁用术语）

描述遵循公式：

[功能描述] + [使用时机] + [触发短语]

文件结构正确（SKILL.md大小写正确、目录使用短横线命名法）
满足安全要求（无恶意软件、无误导性功能）
达到定量成功标准（90%+触发率、高效的工具使用）
应用了设计原则（Progressive Disclosure、Composability、Portability）

这份检查清单有助于在发布前发现常见问题。

Package and Present (only if

present_files

tool is available)

打包和展示（仅当

present_files

工具可用时）

Check whether you have access to the

present_files

tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:

bash

python -m scripts.package_skill <path/to/skill-folder>

After packaging, direct the user to the resulting

.skill

file path so they can install it.

检查你是否有权限使用

present_files

工具。如果没有，跳过这一步。如果有，打包Skill并将.skill文件展示给用户：

bash

python -m scripts.package_skill <path/to/skill-folder>

打包完成后，指引用户到生成的

.skill

文件路径，让他们可以安装。

Claude.ai-specific instructions

Claude.ai专属说明

In Claude.ai, the core workflow is the same (draft → test → review → improve → repeat), but because Claude.ai doesn't have subagents, some mechanics change. Here's what to adapt:

Running test cases: No subagents means no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself. Do them one at a time. This is less rigorous than independent subagents (you wrote the skill and you're also running it, so you have full context), but it's a useful sanity check — and the human review step compensates. Skip the baseline runs — just use the skill to complete the task as requested.

Reviewing results: If you can't open a browser (e.g., Claude.ai's VM has no display, or you're on a remote server), skip the browser reviewer entirely. Instead, present results directly in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (like a .docx or .xlsx), save it to the filesystem and tell them where it is so they can download and inspect it. Ask for feedback inline: "How does this look? Anything you'd change?"

Benchmarking: Skip the quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback from the user.

The iteration loop: Same as before — improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.

Description optimization: This section requires the

claude

CLI tool (specifically

claude -p

) which is only available in Claude Code. Skip it if you're on Claude.ai.

Blind comparison: Requires subagents. Skip it.

Packaging: The

package_skill.py

script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting

.skill

file.

在Claude.ai中，核心工作流是一样的（草稿→测试→审核→改进→重复），但因为Claude.ai没有子Agent，部分机制会有变化。你需要调整的内容如下：

运行测试用例：没有子Agent意味着无法并行执行。对每个测试用例，读取Skill的SKILL.md，然后自己按照说明完成测试提示词的任务。逐个运行。这比独立子Agent的严谨性稍差（你写了Skill，同时又运行它，所以你有完整的上下文），但这是有用的合理性检查——而且人工审核步骤可以弥补这个不足。跳过基线运行——只使用Skill按要求完成任务即可。

审核结果：如果你无法打开浏览器（比如Claude.ai的虚拟机没有显示器，或者你在远程服务器上），完全跳过浏览器评审工具。直接在对话中展示结果。对每个测试用例，展示提示词和输出。如果输出是用户需要查看的文件（比如.docx或.xlsx），保存到文件系统，告知用户路径让他们可以下载检查。直接在线询问反馈："这个看起来怎么样？有什么你想修改的地方吗？"

基准测试：跳过定量基准测试——它依赖基线对比，没有子Agent的话没有意义。重点关注用户的定性反馈。

迭代循环：和之前一样——改进Skill，重新运行测试用例，询问反馈——只是中间不需要浏览器评审工具。如果有文件系统，你还是可以把结果按迭代目录组织。

描述优化：这部分需要

claude

CLI工具（特别是

claude -p

），仅在Claude Code中可用。如果你在Claude.ai中就跳过这一步。

盲测对比：需要子Agent，跳过。

打包：

package_skill.py

脚本可以在任何有Python和文件系统的环境运行。在Claude.ai中，你可以运行它，用户可以下载生成的

.skill

文件。

Cowork-Specific Instructions

Cowork专属说明

If you're in Cowork, the main things to know are:

You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
You don't have a browser or display, so when generating the eval viewer, use
```
--static <output_path>
```
to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using
```
generate_review.py
```
(not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER BEFORE evaluating inputs yourself. You want to get them in front of the human ASAP!
Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download
```
feedback.json
```
as a file. You can then read it from there (you may have to request access first).
Packaging works —
```
package_skill.py
```
just needs Python and a filesystem.
Description optimization (
```
run_loop.py
```
/
```
run_eval.py
```
) should work in Cowork just fine since it uses
```
claude -p
```
via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.

如果你在Cowork环境中，需要了解的主要内容：

你有子Agent，所以核心工作流（并行启动测试用例、运行基线、评分等）都可以正常运行。（不过如果你遇到严重的超时问题，也可以串行运行测试提示词，而不是并行。）
你没有浏览器或显示器，所以生成评估查看器时，使用
```
--static <output_path>
```
生成独立的HTML文件，而不是启动服务器。然后提供一个链接，用户可以点击在自己的浏览器中打开HTML。
不管是什么原因，Cowork环境似乎会让Claude在运行测试后忘记生成评估查看器，所以再强调一遍：不管你是在Cowork还是Claude Code中，运行测试之后，你都应该始终生成评估查看器供人工查看示例，之后再自己修改Skill、尝试修正，使用
```
generate_review.py
```
（不要自己写定制的HTML代码）。提前抱歉，这里我要全大写强调：在你自己评估输入之前，先生成评估查看器。你要尽快把结果展示给人工！
反馈机制不同：因为没有运行中的服务器，查看器的"提交所有评论"按钮会将
```
feedback.json
```
作为文件下载。你可以从那里读取它（可能需要先申请访问权限）。
打包功能正常——
```
package_skill.py
```
只需要Python和文件系统。
描述优化（
```
run_loop.py
```
/
```
run_eval.py
```
）在Cowork中也可以正常运行，因为它通过子进程使用
```
claude -p
```
，不需要浏览器，但请等你完全完成Skill开发，用户确认Skill状态良好之后再运行。

Reference files

参考文件

The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.

```
agents/grader.md
```
— How to evaluate assertions against outputs
```
agents/comparator.md
```
— How to do blind A/B comparison between two outputs
```
agents/analyzer.md
```
— How to analyze why one version beat another

The references/ directory has additional documentation:

```
references/design_principles.md
```
— Core design principles (Progressive Disclosure, Composability, Portability) and three common use case patterns (Document Creation, Workflow Automation, MCP Enhancement)
```
references/constraints_and_rules.md
```
— Technical constraints, naming conventions, security requirements, and quantitative success criteria
```
references/quick_checklist.md
```
— Comprehensive pre-publication checklist covering file structure, frontmatter, testing, and quality tiers
```
references/schemas.md
```
— JSON structures for evals.json, grading.json, etc.

Repeating one more time the core loop here for emphasis:

Figure out what the skill is about
Draft or edit the skill
Run claude-with-access-to-the-skill on test prompts
With the user, evaluate the outputs:
- Create benchmark.json and run
```
eval-viewer/generate_review.py
```
  to help the user review them
- Run quantitative evals
Repeat until you and the user are satisfied
Package the final skill and return it to the user.

Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run

eval-viewer/generate_review.py

so human can review test cases" in your TodoList to make sure it happens.

Good luck!

agents/目录包含专门的子Agent说明。当你需要启动相关子Agent时阅读这些内容。

```
agents/grader.md
```
——如何根据输出评估断言
```
agents/comparator.md
```
——如何对两个输出进行盲测A/B对比
```
agents/analyzer.md
```
——如何分析一个版本胜出的原因

references/目录有额外的文档：

```
references/design_principles.md
```
——核心设计原则（Progressive Disclosure、Composability、Portability）和三种常见用例模式（文档创建、工作流自动化、MCP增强）
```
references/constraints_and_rules.md
```
——技术约束、命名规范、安全要求和定量成功标准
```
references/quick_checklist.md
```
——全面的发布前检查清单，覆盖文件结构、frontmatter、测试和质量层级
```
references/schemas.md
```
——evals.json、grading.json等的JSON结构

最后再强调一次核心循环：

明确Skill的功能
起草或编辑Skill
使用可访问该Skill的Claude运行测试提示词
和用户一起评估输出：
- 创建benchmark.json，运行
```
eval-viewer/generate_review.py
```
  帮助用户审核结果
- 运行定量评估
重复直到你和用户都满意
打包最终Skill并返回给用户。

如果你有TodoList，请把这些步骤加进去，确保你不会忘记。如果你在Cowork环境中，请特别把"创建evals JSON，运行

eval-viewer/generate_review.py

让人工审核测试用例"加入你的TodoList，确保一定会执行。

祝你好运！