skill-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Creator

Skill Creator

A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
  • Decide what you want the skill to do and roughly how it should do it
  • Write a draft of the skill
  • Create a few test prompts and run claude-with-access-to-the-skill on them
  • Help the user evaluate the results both qualitatively and quantitatively
    • While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
    • Use the
      eval-viewer/generate_review.py
      script to show the user the results for them to look at, and also let them look at the quantitative metrics
  • Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
  • Repeat until you're satisfied
  • Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.
Cool? Cool.
一款用于创建新Skill并对其进行迭代优化的Skill。
从高层视角来看,创建Skill的流程如下:
  • 确定你希望Skill实现的功能,以及大致的实现方式
  • 编写Skill初稿
  • 设计几个测试提示词,用有权限调用该Skill的Claude运行测试
  • 协助用户从定性和定量两个维度评估结果
    • 当测试在后台运行时,如果还没有定量eval规则,可以先草拟规则(如果已有规则,你可以直接使用,也可以根据需要修改),然后向用户解释这些规则(如果规则已存在,就解释现有规则)
    • 使用
      eval-viewer/generate_review.py
      脚本向用户展示结果供其查看,同时也让用户可以查看定量指标
  • 根据用户对结果的评估反馈重写Skill(如果定量基准测试暴露出明显的缺陷也要同步优化)
  • 重复上述流程直到结果符合预期
  • 扩充测试集,开展更大规模的测试
使用本Skill时,你的工作是判断用户当前处于上述流程的哪个阶段,然后介入协助用户推进到后续阶段。例如,如果用户说「我想做一个用于X场景的Skill」,你可以协助用户明确需求、编写初稿、设计测试用例、确定评估方式、运行所有提示词测试,然后重复迭代。
另外,如果用户已经有了Skill初稿,你可以直接进入评估/迭代环节。
当然,你需要保持灵活性,如果用户说「我不需要运行这么多评估,和我一起随性摸索就行」,你也可以按用户的要求来。
在Skill开发完成后(当然顺序可以灵活调整),你也可以运行我们单独提供的Skill描述优化脚本,优化Skill的触发效果。
明白了吗?好的。

Communicating with the user

与用户沟通

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
  • "evaluation" and "benchmark" are borderline, but OK
  • for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

使用Skill Creator的用户对编程术语的熟悉程度差异很大。你可能还不知道(毕竟这也是最近才出现的趋势),现在Claude的强大能力甚至能让水管工也打开终端,父母和祖父母也会谷歌「如何安装npm」。当然,大部分用户还是有一定计算机基础的。
因此请留意上下文线索,调整你的沟通措辞!默认情况下给你一些参考:
  • 「evaluation」和「benchmark」属于边界术语,可以直接使用
  • 对于「JSON」和「assertion」这类术语,你需要有明确的线索确认用户了解其含义,才可以不加解释直接使用
如果不确定用户是否理解术语,简单解释一下即可,也可以给出简短的定义避免用户困惑。

Creating a skill

创建Skill

Capture Intent

捕获意图

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
  1. What should this skill enable Claude to do?
  2. When should this skill trigger? (what user phrases/contexts)
  3. What's the expected output format?
  4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
首先要理解用户的意图。当前对话可能已经包含了用户希望固化的工作流(例如用户说「把这个做成Skill」)。如果是这种情况,先从对话历史中提取答案:使用的工具、步骤顺序、用户做出的修正、观察到的输入/输出格式。可能需要用户补充缺失的信息,在进入下一步前需要获得用户确认。
  1. 这个Skill应该让Claude能够实现什么功能?
  2. 这个Skill应该在什么时候触发?(匹配什么样的用户话术/上下文)
  3. 期望的输出格式是什么?
  4. 是否需要设置测试用例验证Skill的功能?输出可客观验证的Skill(文件转换、数据提取、代码生成、固定工作流步骤)适合使用测试用例,输出偏主观的Skill(写作风格、艺术创作)通常不需要。根据Skill类型给出合适的默认建议,但最终由用户决定。

Interview and Research

调研与需求确认

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
主动询问用户关于边界情况、输入/输出格式、示例文件、成功标准、依赖项的问题。等这部分信息确认清楚后再编写测试提示词。
检查可用的MCP:如果对调研有帮助(搜索文档、查找类似Skill、查询最佳实践),可以通过subagent并行调研,如果没有subagent则直接在线查询。提前准备好上下文,减少用户的负担。

Write the SKILL.md

编写SKILL.md

Based on the user interview, fill in these components:
  • name: Skill identifier
  • description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
  • compatibility: Required tools, dependencies (optional, rarely needed)
  • the rest of the skill :)
基于和用户的沟通结果,补充以下组件:
  • name: Skill标识符
  • description: 触发场景、实现的功能。这是主要的触发机制,需要同时包含Skill的功能以及具体的使用场景。所有「使用场景」信息都放在这里,不要放在正文里。注意:目前Claude有「触发不足」的倾向,也就是在应该使用Skill的时候没有调用。为了避免这个问题,请把Skill描述写得稍微「主动」一点。例如,不要写「如何构建一个简单快速的仪表盘展示Anthropic内部数据」,你可以写成「如何构建一个简单快速的仪表盘展示Anthropic内部数据。只要用户提到仪表盘、数据可视化、内部指标,或者想要展示任何类型的公司数据,即便用户没有明确要求「仪表盘」,也请使用本Skill。」
  • compatibility: 所需的工具、依赖(可选,很少需要)
  • Skill的其余部分 :)

Skill Writing Guide

Skill编写指南

Anatomy of a Skill

Skill结构

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic/repetitive tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons, fonts)
skill-name/
├── SKILL.md (必填)
│   ├── YAML frontmatter (name、description必填)
│   └── Markdown说明
└── 捆绑资源 (可选)
    ├── scripts/    - 用于确定性/重复任务的可执行代码
    ├── references/ - 需要时加载到上下文的文档
    └── assets/     - 输出中使用的文件(模板、图标、字体)

Progressive Disclosure

渐进式加载

Skills use a three-level loading system:
  1. Metadata (name + description) - Always in context (~100 words)
  2. SKILL.md body - In context whenever skill triggers (<500 lines ideal)
  3. Bundled resources - As needed (unlimited, scripts can execute without loading)
These word counts are approximate and you can feel free to go longer if needed.
Key patterns:
  • Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
  • Reference files clearly from SKILL.md with guidance on when to read them
  • For large reference files (>300 lines), include a table of contents
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md
Claude reads only the relevant reference file.
Skill采用三级加载机制:
  1. 元数据(名称 + 描述)- 始终在上下文内(约100字)
  2. SKILL.md正文 - Skill触发时加载到上下文(理想情况下少于500行)
  3. 捆绑资源 - 按需加载(无限制,脚本无需加载即可执行)
以上字数是近似值,必要时可以更长。
关键规则:
  • 保持SKILL.md少于500行;如果接近这个限制,可以增加层级结构,明确指引使用Skill的模型下一步应该去哪里获取后续信息。
  • 在SKILL.md中清晰引用相关文件,并说明读取时机
  • 对于大型参考文件(>300行),包含目录
领域组织:当一个Skill支持多个领域/框架时,按变体组织:
cloud-deploy/
├── SKILL.md (工作流 + 选择逻辑)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md
Claude只会读取相关的参考文件。

Principle of Lack of Surprise

无意外原则

This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.
不言而喻,Skill不得包含恶意软件、漏洞利用代码,或任何可能危害系统安全的内容。Skill的内容在描述后不应该让用户对其意图感到意外。不要配合用户创建具有误导性的Skill,或用于未授权访问、数据泄露等恶意活动的Skill。不过类似「角色扮演为XYZ」的请求是可以的。

Writing Patterns

编写模式

Prefer using the imperative form in instructions.
Defining output formats - You can do it like this:
markdown
undefined
指令优先使用祈使句。
定义输出格式 - 可以按照以下方式编写:
markdown
undefined

Report structure

报告结构

ALWAYS use this exact template:
必须严格使用以下模板:

[Title]

[标题]

Executive summary

执行摘要

Key findings

关键发现

Recommendations

建议


**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
```markdown

**示例模式** - 包含示例很有用,可以按照以下格式编写(如果示例中有「Input」和「Output」,你可以灵活调整):
```markdown

Commit message format

提交信息格式

Example 1: Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication
undefined
示例1: 输入:添加了基于JWT令牌的用户认证 输出:feat(auth): implement JWT-based authentication
undefined

Writing Style

写作风格

Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
尽量向模型说明规则背后的原因,而不是生硬的强制性要求。运用心理理论,尽量让Skill具备通用性,不要局限于特定示例。写完初稿后,以全新的视角重读并优化。

Test Cases

测试用例

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.
Save test cases to
evals/evals.json
. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}
See
references/schemas.md
for the full schema (including the
assertions
field, which you'll add later).
写完Skill初稿后,设计2-3个符合真实用户使用场景的测试提示词。和用户确认:[你不需要完全照搬这个话术]「这是我计划测试的几个用例,你看是否合适,是否需要补充?」然后运行测试。
将测试用例保存到
evals/evals.json
。暂时不要编写assertion,只保存提示词。你可以在测试运行的下一步草拟assertion。
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "用户的任务提示词",
      "expected_output": "预期结果描述",
      "files": []
    }
  ]
}
完整的schema可以查看
references/schemas.md
(包含后续需要添加的
assertions
字段)。

Running and evaluating test cases

运行与评估测试用例

This section is one continuous sequence — don't stop partway through. Do NOT use
/skill-test
or any other testing skill.
Put results in
<skill-name>-workspace/
as a sibling to the skill directory. Within the workspace, organize results by iteration (
iteration-1/
,
iteration-2/
, etc.) and within that, each test case gets a directory (
eval-0/
,
eval-1/
, etc.). Don't create all of this upfront — just create directories as you go.
这部分是连续的流程,不要中途停止。不要使用
/skill-test
或其他测试Skill。
将结果存放在Skill目录同级的
<skill-name>-workspace/
目录下。工作空间内按迭代组织结果(
iteration-1/
iteration-2/
等),每个迭代目录下为每个测试用例创建单独的目录(
eval-0/
eval-1/
等)。不需要提前创建所有目录,按需创建即可。

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1:同轮次启动所有运行(带Skill和基准版本)

For each test case, spawn two subagents in the same turn — one with the skill, one without. This is important: don't spawn the with-skill runs first and then come back for baselines later. Launch everything at once so it all finishes around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">
Baseline run (same prompt, but the baseline depends on context):
  • Creating a new skill: no skill at all. Same prompt, no skill path, save to
    without_skill/outputs/
    .
  • Improving an existing skill: the old version. Before editing, snapshot the skill (
    cp -r <skill-path> <workspace>/skill-snapshot/
    ), then point the baseline subagent at the snapshot. Save to
    old_skill/outputs/
    .
Write an
eval_metadata.json
for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.
json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}
对每个测试用例,在同一轮次启动两个subagent:一个使用待测试的Skill,一个不使用。这点很重要:不要先启动带Skill的运行,之后再启动基准版本。同时启动所有任务,这样它们会差不多同时完成。
带Skill的运行:
执行以下任务:
- Skill路径: <path-to-skill>
- 任务: <eval提示词>
- 输入文件: <测试用例文件,没有则填"none">
- 输出保存到: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- 需要保存的输出: <用户关注的内容 — 例如「.docx文件」、「最终CSV」>
基准版本运行(相同的提示词,基准版本根据上下文确定):
  • 创建新Skill:不使用任何Skill。相同的提示词,不填Skill路径,保存到
    without_skill/outputs/
  • 优化现有Skill:使用旧版本的Skill。编辑前先快照Skill(
    cp -r <skill-path> <workspace>/skill-snapshot/
    ),然后让基准subagent使用快照版本。保存到
    old_skill/outputs/
为每个测试用例编写
eval_metadata.json
(assertion可以暂时为空)。给每个eval起一个描述性的名称,说明它测试的内容,不要只用「eval-0」。目录也使用这个名称。如果当前迭代使用了新的或修改过的eval提示词,为每个新的eval目录创建这些文件,不要默认继承前序迭代的内容。
json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "用户的任务提示词",
  "assertions": []
}

Step 2: While runs are in progress, draft assertions

步骤2:运行过程中草拟assertion

Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in
evals/evals.json
, review them and explain what they check.
Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update the
eval_metadata.json
files and
evals/evals.json
with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.
不要只是等待运行完成,你可以高效利用这段时间。为每个测试用例草拟定量assertion,并向用户解释。如果
evals/evals.json
中已经有assertion,审核这些规则并向用户说明它们的检查内容。
好的assertion是可客观验证的,且有描述性的名称——在基准查看器中可以清晰读取,让用户一眼就能明白每个assertion的检查内容。主观类Skill(写作风格、设计质量)更适合定性评估,不要给需要人工判断的内容强制添加assertion。
草拟完成后,将assertion更新到
eval_metadata.json
evals/evals.json
文件中。同时向用户说明他们会在查看器中看到的内容:包括定性输出和定量基准测试结果。

Step 3: As runs complete, capture timing data

步骤3:运行完成后收集耗时数据

When each subagent task completes, you receive a notification containing
total_tokens
and
duration_ms
. Save this data immediately to
timing.json
in the run directory:
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.
每个subagent任务完成后,你会收到包含
total_tokens
duration_ms
的通知。立即将这些数据保存到运行目录下的
timing.json
中:
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
这是收集这些数据的唯一机会——这些数据仅在任务通知中提供,不会在其他地方持久化。收到通知后立即处理,不要批量处理。

Step 4: Grade, aggregate, and launch the viewer

步骤4:评分、聚合、启动查看器

Once all runs are done:
  1. Grade each run — spawn a grader subagent (or grade inline) that reads
    agents/grader.md
    and evaluates each assertion against the outputs. Save results to
    grading.json
    in each run directory. The grading.json expectations array must use the fields
    text
    ,
    passed
    , and
    evidence
    (not
    name
    /
    met
    /
    details
    or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
  2. Aggregate into benchmark — run the aggregation script from the skill-creator directory:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    This produces
    benchmark.json
    and
    benchmark.md
    with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see
    references/schemas.md
    for the exact schema the viewer expects. Put each with_skill version before its baseline counterpart.
  3. Do an analyst pass — read the benchmark data and surface patterns the aggregate stats might hide. See
    agents/analyzer.md
    (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
  4. Launch the viewer with both qualitative outputs and quantitative data:
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    For iteration 2+, also pass
    --previous-workspace <workspace>/iteration-<N-1>
    .
    Cowork / headless environments: If
    webbrowser.open()
    is not available or the environment has no display, use
    --static <output_path>
    to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a
    feedback.json
    file when the user clicks "Submit All Reviews". After download, copy
    feedback.json
    into the workspace directory for the next iteration to pick up.
Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
  1. Tell the user something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
所有运行完成后:
  1. 为每个运行评分 —— 启动评分subagent(或直接在线评分),读取
    agents/grader.md
    ,根据输出评估每个assertion的结果。将结果保存到每个运行目录下的
    grading.json
    。grading.json的expectations数组必须使用
    text
    passed
    evidence
    字段(不要用
    name
    /
    met
    /
    details
    或其他变体)——查看器依赖这些准确的字段名。对于可以程序化检查的assertion,编写并运行脚本检查,不要人工核对——脚本更快、更可靠,还可以在迭代中复用。
  2. 聚合为基准数据 —— 从skill-creator目录运行聚合脚本:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    该命令会生成
    benchmark.json
    benchmark.md
    ,包含每个配置的通过率、耗时、token使用量,以及平均值±标准差和差值。如果手动生成benchmark.json,请参考
    references/schemas.md
    查看查看器要求的准确schema。 将带Skill的版本放在基准版本前面。
  3. 分析师检查 —— 读取基准数据,挖掘聚合统计可能隐藏的模式。参考
    agents/analyzer.md
    (「分析基准测试结果」部分)的检查项:比如无论是否使用Skill都始终通过的assertion(无区分度)、方差高的eval(可能不稳定)、耗时/token的权衡等。
  4. 启动查看器,同时展示定性输出和定量数据:
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    对于第2次及以上的迭代,还要添加参数
    --previous-workspace <workspace>/iteration-<N-1>
    协同/无界面环境: 如果
    webbrowser.open()
    不可用,或环境没有显示器,使用
    --static <output_path>
    参数生成独立的HTML文件,而不是启动服务器。用户点击「提交所有评论」后,反馈会下载为
    feedback.json
    文件。下载完成后,将
    feedback.json
    复制到工作空间目录,供下一次迭代使用。
注意:请使用generate_review.py创建查看器,不需要编写自定义HTML。
  1. 告知用户:比如「我已经在浏览器中打开了结果。有两个标签页——「输出」可以查看每个测试用例并留下反馈,「基准测试」展示定量对比结果。查看完成后回到这里告诉我即可。」

What the user sees in the viewer

用户在查看器中看到的内容

The "Outputs" tab shows one test case at a time:
  • Prompt: the task that was given
  • Output: the files the skill produced, rendered inline where possible
  • Previous Output (iteration 2+): collapsed section showing last iteration's output
  • Formal Grades (if grading was run): collapsed section showing assertion pass/fail
  • Feedback: a textbox that auto-saves as they type
  • Previous Feedback (iteration 2+): their comments from last time, shown below the textbox
The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.
Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to
feedback.json
.
「输出」标签页每次展示一个测试用例:
  • 提示词:给出的任务内容
  • 输出:Skill生成的文件,可内联渲染的会直接展示
  • 上一版输出(第2次及以上迭代):折叠区域展示上一次迭代的输出
  • 正式评分(如果已运行评分):折叠区域展示assertion的通过/失败结果
  • 反馈:文本框,输入时自动保存
  • 上一版反馈(第2次及以上迭代):用户上次的评论,展示在文本框下方
「基准测试」标签页展示统计摘要:每个配置的通过率、耗时、token使用量,以及每个eval的明细和分析师观察结果。
可以通过上一个/下一个按钮或方向键导航。完成后用户点击「提交所有评论」,所有反馈会保存到
feedback.json

Step 5: Read the feedback

步骤5:读取反馈

When the user tells you they're done, read
feedback.json
:
json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
  ],
  "status": "complete"
}
Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.
Kill the viewer server when you're done with it:
bash
kill $VIEWER_PID 2>/dev/null

用户告知你查看完成后,读取
feedback.json
json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "完美,很喜欢", "timestamp": "..."}
  ],
  "status": "complete"
}
空反馈意味着用户认为结果没问题。重点优化用户有明确反馈的测试用例。
使用完成后关闭查看器服务器:
bash
kill $VIEWER_PID 2>/dev/null

Improving the skill

优化Skill

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.
这是迭代循环的核心。你已经运行了测试用例,用户也审核了结果,现在需要根据用户的反馈优化Skill。

How to think about improvements

优化思路

  1. Generalize from the feedback. The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
  2. Keep the prompt lean. Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
  3. Explain the why. Try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
  4. Look for repeated work across test cases. Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a
    create_docx.py
    or a
    build_chart.py
    , that's a strong signal the skill should bundle that script. Write it once, put it in
    scripts/
    , and tell the skill to use it. This saves every future invocation from reinventing the wheel.
This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.
  1. 从反馈中提炼通用规则。我们的目标是创建可以在大量不同提示词场景下被百万次(甚至更多)调用的Skill。现在你和用户只需要在少数示例上反复迭代,这样可以提高效率。用户对这些示例非常熟悉,可以快速评估新的输出。但如果你和用户共同开发的Skill只能适配这些示例,那它没有任何价值。不要添加过于细碎的过拟合修改,或者过于严格的强制要求。如果遇到难以解决的问题,可以尝试换个思路,使用不同的隐喻,或者推荐不同的工作模式。试错成本很低,说不定你能找到很好的解决方案。
  2. 保持提示词精简。删除没有实际作用的内容。一定要读取完整的运行记录,不要只看最终输出——如果Skill让模型浪费大量时间做无意义的工作,你可以尝试删除Skill中导致这个问题的部分,看看效果如何。
  3. 说明原因。尽量向模型说明你要求它做每件事的原因。现在的大语言模型非常聪明,具备良好的心理理论,在合理的框架下可以超越 rote 指令的限制,真正解决问题。即使用户的反馈很简短或者带有情绪,也要真正理解任务的本质,理解用户的真实需求,然后把这种理解转化为指令。如果你发现自己在全大写写ALWAYS或者NEVER,或者使用非常僵化的结构,这是一个黄色预警——如果可能的话,换一种表述方式,解释背后的逻辑,让模型理解你要求的重要性。这是更人性化、更强大、更有效的方式。
  4. 寻找测试用例中的重复工作。读取测试运行的记录,注意subagent是否都独立编写了类似的辅助脚本,或者对某个任务采取了相同的多步骤处理方式。如果3个测试用例都让subagent编写了
    create_docx.py
    或者
    build_chart.py
    ,这是一个强烈的信号,说明Skill应该捆绑这个脚本。只需要写一次,放到
    scripts/
    目录下,告知Skill调用即可。这样可以避免未来每次调用都重复造轮子。
这个任务非常重要(我们的目标是每年创造数十亿美元的经济价值!),你的思考时间不是瓶颈;慢慢来,认真思考。建议先写一个修订初稿,然后重新审视优化。尽量站在用户的角度,理解他们的真实需求。

The iteration loop

迭代循环

After improving the skill:
  1. Apply your improvements to the skill
  2. Rerun all test cases into a new
    iteration-<N+1>/
    directory, including baseline runs. If you're creating a new skill, the baseline is always
    without_skill
    (no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
  3. Launch the reviewer with
    --previous-workspace
    pointing at the previous iteration
  4. Wait for the user to review and tell you they're done
  5. Read the new feedback, improve again, repeat
Keep going until:
  • The user says they're happy
  • The feedback is all empty (everything looks good)
  • You're not making meaningful progress

优化Skill后:
  1. 将优化内容应用到Skill中
  2. 在新的
    iteration-<N+1>/
    目录下重新运行所有测试用例,包括基准版本运行。如果是创建新Skill,基准版本始终是
    without_skill
    (不使用Skill)——所有迭代都保持这个基准。如果是优化现有Skill,你可以自行判断合适的基准:用户提供的原始版本,或者上一次迭代的版本。
  3. 启动查看器,添加
    --previous-workspace
    参数指向上一次迭代的工作空间
  4. 等待用户审核完成并告知你
  5. 读取新的反馈,再次优化,重复循环
持续迭代直到:
  • 用户表示满意
  • 所有反馈为空(所有结果都符合预期)
  • 没有取得有意义的进展

Advanced: Blind comparison

高级功能:盲测对比

For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read
agents/comparator.md
and
agents/analyzer.md
for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.
This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

如果你需要更严谨地对比两个版本的Skill(例如用户问「新版本真的更好吗?」),可以使用盲测对比系统。具体细节参考
agents/comparator.md
agents/analyzer.md
。基本思路是:把两个输出交给独立的agent,不告诉它哪个是哪个,让它判断质量,然后分析胜出的原因。
这个功能是可选的,需要subagent支持,大部分用户不需要。人工审核循环通常已经足够。

Description Optimization

描述优化

The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
SKILL.md frontmatter中的description字段是决定Claude是否调用Skill的核心机制。创建或优化Skill后,可以主动提出优化描述,提升触发准确率。

Step 1: Generate trigger eval queries

步骤1:生成触发eval查询

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
json
[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]
The queries must be realistic and something a Claude Code or Claude.ai user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).
Bad:
"Format this data"
,
"Extract text from PDF"
,
"Create a chart"
Good:
"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"
For the should-trigger queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.
For the should-not-trigger queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.
The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.
创建20个eval查询——包含应该触发和不应该触发的混合场景。保存为JSON:
json
[
  {"query": "用户提示词", "should_trigger": true},
  {"query": "另一个提示词", "should_trigger": false}
]
查询必须符合真实场景,是Claude Code或Claude.ai用户真实会输入的内容。不要用抽象的请求,要具体、有细节,例如文件路径、用户工作的相关上下文、列名和值、公司名称、URL。可以添加一点背景信息,有些可以是小写,包含缩写、拼写错误或者口语化表达。使用不同长度的查询,重点关注边界场景,而不是非常明确的场景(用户会有机会确认这些查询)。
不好的示例:
"Format this data"
"Extract text from PDF"
"Create a chart"
好的示例:
"ok我老板刚给我发了这个xlsx文件(在我下载文件夹里,叫什么'Q4 sales final FINAL v2.xlsx'),她让我加一列显示利润率的百分比。收入应该在C列,成本在D列吧我记得"
对于应该触发的查询(8-10个),要考虑覆盖度。你需要包含同一意图的不同表述——有些正式,有些口语化。包含用户没有明确提到Skill或文件类型,但明显需要使用该Skill的场景。包含一些不常见的使用场景,以及该Skill和其他Skill竞争但应该胜出的场景。
对于不应该触发的查询(8-10个),最有价值的是近似匹配的场景——和Skill共享关键词或概念,但实际需要不同功能的查询。比如相邻领域、歧义表述(简单的关键词匹配会触发但实际不应该)、查询涉及Skill的部分功能但更适合用其他工具的场景。
要避免的问题:不要让不应该触发的查询明显不相关。比如用「写一个斐波那契函数」作为PDF Skill的负向测试太简单了,没有测试价值。负向测试应该是真正有迷惑性的。

Step 2: Review with user

步骤2:和用户一起审核

Present the eval set to the user for review using the HTML template:
  1. Read the template from
    assets/eval_review.html
  2. Replace the placeholders:
    • __EVAL_DATA_PLACEHOLDER__
      → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
    • __SKILL_NAME_PLACEHOLDER__
      → the skill's name
    • __SKILL_DESCRIPTION_PLACEHOLDER__
      → the skill's current description
  3. Write to a temp file (e.g.,
    /tmp/eval_review_<skill-name>.html
    ) and open it:
    open /tmp/eval_review_<skill-name>.html
  4. The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
  5. The file downloads to
    ~/Downloads/eval_set.json
    — check the Downloads folder for the most recent version in case there are multiple (e.g.,
    eval_set (1).json
    )
This step matters — bad eval queries lead to bad descriptions.
使用HTML模板把eval集合展示给用户审核:
  1. 读取
    assets/eval_review.html
    模板
  2. 替换占位符:
    • __EVAL_DATA_PLACEHOLDER__
      → eval项的JSON数组(不要加引号——这是JS变量赋值)
    • __SKILL_NAME_PLACEHOLDER__
      → Skill的名称
    • __SKILL_DESCRIPTION_PLACEHOLDER__
      → Skill当前的描述
  3. 写入临时文件(例如
    /tmp/eval_review_<skill-name>.html
    )并打开:
    open /tmp/eval_review_<skill-name>.html
  4. 用户可以编辑查询、切换是否应该触发、添加/删除条目,然后点击「导出Eval集合」
  5. 文件会下载到
    ~/Downloads/eval_set.json
    ——检查下载文件夹中最新的版本,避免有多个版本(比如
    eval_set (1).json
这一步非常重要——不好的eval查询会导致不好的描述。

Step 3: Run the optimization loop

步骤3:运行优化循环

Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
Save the eval set to the workspace, then run in the background:
bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.
While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with
best_description
— selected by test score rather than train score to avoid overfitting.
告知用户:「这需要一点时间——我会在后台运行优化循环,定期同步进度。」
将eval集合保存到工作空间,然后后台运行:
bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
使用你系统提示词中的模型ID(当前会话使用的模型),这样触发测试和用户实际体验的效果一致。
运行过程中,定期查看输出,向用户同步当前迭代次数和得分情况。
这个命令会自动处理完整的优化循环。它会将eval集合分为60%训练集和40%留存测试集,评估当前描述的效果(每个查询运行3次以获得可靠的触发率),然后调用Claude进行深度思考,基于失败的案例提出优化方案。它会在训练集和测试集上重新评估每个新描述,最多迭代5次。完成后,它会在浏览器中打开HTML报告,展示每次迭代的结果,并返回包含
best_description
的JSON——基于测试集得分选择最优描述,避免过拟合。

How skill triggering works

Skill触发的工作原理

Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's
available_skills
list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.
理解触发机制有助于设计更好的eval查询。Skill会以名称+描述的形式出现在Claude的
available_skills
列表中,Claude基于描述决定是否调用该Skill。需要知道的重点是:Claude只会对自己无法轻松处理的任务调用Skill——简单的单步查询比如「读取这个PDF」可能不会触发Skill,即便描述完全匹配,因为Claude可以用基础工具直接处理。复杂、多步骤或专业的查询在描述匹配时会可靠触发Skill。
这意味着你的eval查询需要有足够的复杂度,让Claude确实需要调用Skill。类似「读取文件X」的简单查询是很差的测试用例——无论描述质量如何,它们都不会触发Skill。

Step 4: Apply the result

步骤4:应用结果

Take
best_description
from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

取JSON输出中的
best_description
,更新Skill的SKILL.md frontmatter。向用户展示优化前后的对比和得分情况。

Package and Present (only if
present_files
tool is available)

打包和展示(仅当
present_files
工具可用时)

Check whether you have access to the
present_files
tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
bash
python -m scripts.package_skill <path/to/skill-folder>
After packaging, direct the user to the resulting
.skill
file path so they can install it.

检查你是否有权限使用
present_files
工具。如果没有,跳过这一步。如果有,打包Skill并向用户展示.skill文件:
bash
python -m scripts.package_skill <path/to/skill-folder>
打包完成后,指引用户找到生成的
.skill
文件路径,方便他们安装。

Claude.ai-specific instructions

Claude.ai专属说明

In Claude.ai, the core workflow is the same (draft → test → review → improve → repeat), but because Claude.ai doesn't have subagents, some mechanics change. Here's what to adapt:
Running test cases: No subagents means no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself. Do them one at a time. This is less rigorous than independent subagents (you wrote the skill and you're also running it, so you have full context), but it's a useful sanity check — and the human review step compensates. Skip the baseline runs — just use the skill to complete the task as requested.
Reviewing results: If you can't open a browser (e.g., Claude.ai's VM has no display, or you're on a remote server), skip the browser reviewer entirely. Instead, present results directly in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (like a .docx or .xlsx), save it to the filesystem and tell them where it is so they can download and inspect it. Ask for feedback inline: "How does this look? Anything you'd change?"
Benchmarking: Skip the quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback from the user.
The iteration loop: Same as before — improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.
Description optimization: This section requires the
claude
CLI tool (specifically
claude -p
) which is only available in Claude Code. Skip it if you're on Claude.ai.
Blind comparison: Requires subagents. Skip it.
Packaging: The
package_skill.py
script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting
.skill
file.

在Claude.ai中,核心工作流是一样的(草稿→测试→审核→优化→重复),但因为Claude.ai没有subagent,部分机制需要调整。适配点如下:
运行测试用例:没有subagent意味着无法并行执行。对于每个测试用例,读取Skill的SKILL.md,然后自己按照指令完成测试提示词的任务。逐个运行。这比独立subagent的严谨性稍低(你编写了Skill,同时也在运行它,你有完整的上下文),但作为可用性检查还是很有用的——人工审核步骤可以弥补不足。跳过基准版本运行——直接使用Skill完成请求的任务即可。
审核结果:如果你无法打开浏览器(例如Claude.ai的VM没有显示器,或者你在远程服务器上),直接跳过浏览器查看器。直接在对话中展示结果。对每个测试用例,展示提示词和输出。如果输出是用户需要查看的文件(比如.docx或.xlsx),保存到文件系统,告知用户路径,让他们可以下载查看。直接在对话中询问反馈:「这个结果怎么样?有什么需要修改的吗?」
基准测试:跳过定量基准测试——它依赖基准对比,没有subagent的话没有意义。重点关注用户的定性反馈。
迭代循环:和之前一样——优化Skill,重新运行测试用例,询问反馈——只是不需要中间的浏览器查看器。如果有文件系统,你仍然可以按迭代目录组织结果。
描述优化:这部分需要
claude
CLI工具(特别是
claude -p
),仅在Claude Code中可用。如果是Claude.ai环境跳过这一步。
盲测对比:需要subagent,跳过。
打包
package_skill.py
脚本可以在任何有Python和文件系统的环境运行。在Claude.ai中,你可以运行它,用户可以下载生成的
.skill
文件。

Cowork-Specific Instructions

Cowork专属说明

If you're in Cowork, the main things to know are:
  • You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
  • You don't have a browser or display, so when generating the eval viewer, use
    --static <output_path>
    to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
  • For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using
    generate_review.py
    (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER BEFORE evaluating inputs yourself. You want to get them in front of the human ASAP!
  • Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download
    feedback.json
    as a file. You can then read it from there (you may have to request access first).
  • Packaging works —
    package_skill.py
    just needs Python and a filesystem.
  • Description optimization (
    run_loop.py
    /
    run_eval.py
    ) should work in Cowork just fine since it uses
    claude -p
    via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.

如果你在Cowork环境中,需要了解的重点:
  • 你有subagent,所以核心工作流(并行启动测试用例、运行基准版本、评分等)都可以正常运行。(不过如果遇到严重的超时问题,也可以串行运行测试提示词。)
  • 你没有浏览器或显示器,所以生成eval查看器时,使用
    --static <output_path>
    参数生成独立HTML文件,而不是启动服务器。然后提供链接让用户可以在自己的浏览器中打开HTML。
  • 不管什么原因,Cowork环境下Claude经常会忘记在测试运行后生成eval查看器,所以再次强调:无论你在Cowork还是Claude Code中,运行测试后,都应该先使用
    generate_review.py
    生成eval查看器供用户查看示例,之后再自己评估输入、尝试修正(不要自己写自定义HTML代码)。提前抱歉但这里我要全大写强调:在你自己评估输入之前,先生成EVAL查看器。你需要尽快把结果给到用户查看!
  • 反馈机制不同:因为没有运行中的服务器,查看器的「提交所有评论」按钮会将
    feedback.json
    下载为文件。你可以从该路径读取(可能需要先申请访问权限)。
  • 打包功能正常——
    package_skill.py
    只需要Python和文件系统。
  • 描述优化(
    run_loop.py
    /
    run_eval.py
    )在Cowok中也可以正常运行,因为它通过子进程调用
    claude -p
    ,不需要浏览器,但请等你完全完成Skill开发,用户确认Skill已经达标后再运行。

Reference files

参考文件

The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
  • agents/grader.md
    — How to evaluate assertions against outputs
  • agents/comparator.md
    — How to do blind A/B comparison between two outputs
  • agents/analyzer.md
    — How to analyze why one version beat another
The references/ directory has additional documentation:
  • references/schemas.md
    — JSON structures for evals.json, grading.json, etc.

Repeating one more time the core loop here for emphasis:
  • Figure out what the skill is about
  • Draft or edit the skill
  • Run claude-with-access-to-the-skill on test prompts
  • With the user, evaluate the outputs:
    • Create benchmark.json and run
      eval-viewer/generate_review.py
      to help the user review them
    • Run quantitative evals
  • Repeat until you and the user are satisfied
  • Package the final skill and return it to the user.
Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run
eval-viewer/generate_review.py
so human can review test cases" in your TodoList to make sure it happens.
Good luck!
agents/目录包含专用subagent的说明。需要启动相关subagent时可以读取。
  • agents/grader.md
    —— 如何根据输出评估assertion
  • agents/comparator.md
    —— 如何对两个输出进行盲测A/B对比
  • agents/analyzer.md
    —— 如何分析某个版本胜出的原因
references/目录有额外的文档:
  • references/schemas.md
    —— evals.json、grading.json等文件的JSON结构

最后再强调一次核心循环:
  • 明确Skill的功能定位
  • 编写或编辑Skill
  • 用有权限调用该Skill的Claude运行测试提示词
  • 和用户一起评估输出:
    • 创建benchmark.json,运行
      eval-viewer/generate_review.py
      协助用户审核
    • 运行定量eval
  • 重复直到你和用户都满意
  • 打包最终的Skill并返回给用户。
如果你有TodoList,请把这些步骤加进去,避免遗漏。如果你在Cowok环境,请特别把「创建evals JSON,运行
eval-viewer/generate_review.py
供用户审核测试用例」加到你的TodoList中,确保会执行。
祝你好运!