skill-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Creator

Skill Creator

A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
  • Decide what you want the skill to do and roughly how it should do it
  • Write a draft of the skill
  • Create a few test prompts and run claude-with-access-to-the-skill on them
  • Help the user evaluate the results both qualitatively and quantitatively
    • While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
    • Use the
      eval-viewer/generate_review.py
      script to show the user the results for them to look at, and also let them look at the quantitative metrics
  • Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
  • Repeat until you're satisfied
  • Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.
Cool? Cool.
一款用于创建新技能并迭代改进现有技能的工具。
从整体流程来看,创建技能的过程如下:
  • 确定你希望该技能实现的功能以及大致实现方式
  • 编写技能初稿
  • 创建几个测试提示词,并让具备该技能访问权限的Claude运行这些提示词
  • 协助用户从定性和定量两方面评估结果
    • 在后台运行测试的同时,如果还没有定量评估用例,就起草一些(如果已有评估用例,你可以直接使用,也可以根据需要进行修改)。然后向用户解释这些评估用例(如果是已有用例,就解释现有的用例)
    • 使用
      eval-viewer/generate_review.py
      脚本向用户展示结果供其查看,同时展示定量指标
  • 根据用户对结果的反馈(以及定量基准测试中发现的明显缺陷)重写技能
  • 重复上述步骤直到你满意为止
  • 扩展测试集并进行更大规模的测试
使用本技能时,你的任务是判断用户当前处于流程的哪个阶段,然后介入并帮助他们推进后续步骤。例如,如果用户说“我想创建一个用于X的技能”,你可以帮助他们明确需求、编写初稿、编写测试用例、确定评估方式、运行所有提示词并重复迭代。
另一方面,如果用户已经有了技能初稿,你可以直接进入评估/迭代环节。
当然,你需要保持灵活性,如果用户说“我不需要运行大量评估,只要和我一起调整就行”,你可以按照用户的要求来操作。
在技能完成后(顺序可以灵活调整),你还可以运行技能描述优化器(我们有单独的脚本)来优化技能的触发效果。
明白了吗?很好。

Communicating with the user

与用户沟通

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
  • "evaluation" and "benchmark" are borderline, but OK
  • for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

Skill Creator的使用者可能对编程术语的熟悉程度差异很大。你可能还没听说(毕竟这是最近才兴起的趋势),Claude的强大功能正激励着水管工打开终端,父母和祖父母搜索“如何安装npm”。另一方面,大部分用户可能具备较好的计算机操作能力。
因此,请留意上下文线索,选择合适的沟通措辞!以下是一些默认情况下的参考:
  • “evaluation(评估)”和“benchmark(基准测试)”属于边缘词汇,但可以使用
  • 对于“JSON”和“assertion(断言)”,需要看到用户明确表现出了解这些术语的线索后,才能在不解释的情况下使用
如果你不确定,可以简要解释术语,也可以用简短的定义来澄清,确保用户能理解。

Creating a skill

创建技能

Capture Intent

捕捉意图

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
  1. What should this skill enable Claude to do?
  2. When should this skill trigger? (what user phrases/contexts)
  3. What's the expected output format?
  4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
首先要理解用户的意图。当前对话中可能已经包含用户想要固化的工作流(例如,用户说“把这个变成技能”)。如果是这样,先从对话历史中提取关键信息——使用的工具、步骤顺序、用户做出的修正、观察到的输入/输出格式。用户可能需要补充信息,在进入下一步前请确认。
  1. 该技能应该让Claude能够实现什么功能?
  2. 该技能应该在何时触发?(用户的哪些表述/场景)
  3. 预期的输出格式是什么?
  4. 我们是否需要设置测试用例来验证技能是否有效?具有客观可验证输出的技能(如文件转换、数据提取、代码生成、固定工作流步骤)会从测试用例中受益。具有主观输出的技能(如写作风格、艺术创作)通常不需要测试用例。根据技能类型建议合适的默认方案,但最终由用户决定。

Interview and Research

访谈与调研

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
主动询问关于边缘情况、输入/输出格式、示例文件、成功标准和依赖项的问题。在这部分确定后再编写测试提示词。
检查可用的MCPs——如果对调研有用(搜索文档、查找类似技能、查阅最佳实践),可以通过子代理并行调研,否则直接在线调研。提前准备好相关背景信息,减轻用户负担。

Write the SKILL.md

编写SKILL.md

Based on the user interview, fill in these components:
  • name: Skill identifier
  • description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
  • compatibility: Required tools, dependencies (optional, rarely needed)
  • the rest of the skill :)
基于与用户的访谈内容,填充以下组件:
  • name:技能标识符
  • description:触发时机和功能描述。这是主要的触发机制——既要包含技能的功能,也要明确使用场景。所有“何时使用”的信息都放在这里,不要放在正文里。注意:目前Claude存在“触发不足”的倾向——即在有用时不使用技能。为了应对这一问题,请让技能描述稍微“主动”一些。例如,不要写“如何构建一个简单快速的仪表盘来展示Anthropic内部数据”,而是写“如何构建一个简单快速的仪表盘来展示Anthropic内部数据。当用户提到仪表盘、数据可视化、内部指标,或者想要展示任何类型的公司数据时,即使他们没有明确要求‘仪表盘’,也要确保使用本技能。”
  • compatibility:所需工具和依赖项(可选,很少需要)
  • 技能的其他内容

Skill Writing Guide

技能编写指南

Anatomy of a Skill

技能的结构

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic/repetitive tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons, fonts)
skill-name/
├── SKILL.md (必填)
│   ├── YAML前置元数据(name和description为必填项)
│   └── Markdown格式的说明文档
└── 捆绑资源(可选)
    ├── scripts/    - 用于确定性/重复性任务的可执行代码
    ├── references/ - 需要时加载到上下文的文档
    └── assets/     - 输出中使用的文件(模板、图标、字体)

Progressive Disclosure

渐进式披露

Skills use a three-level loading system:
  1. Metadata (name + description) - Always in context (~100 words)
  2. SKILL.md body - In context whenever skill triggers (<500 lines ideal)
  3. Bundled resources - As needed (unlimited, scripts can execute without loading)
These word counts are approximate and you can feel free to go longer if needed.
Key patterns:
  • Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
  • Reference files clearly from SKILL.md with guidance on when to read them
  • For large reference files (>300 lines), include a table of contents
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md
Claude reads only the relevant reference file.
技能采用三级加载系统:
  1. 元数据(名称+描述)- 始终在上下文中(约100字)
  2. SKILL.md正文 - 技能触发时加载到上下文中(理想情况下不超过500行)
  3. 捆绑资源 - 按需加载(无限制,脚本可在不加载的情况下执行)
这些字数是近似值,如果需要可以适当增加。
关键模式:
  • 保持SKILL.md在500行以内;如果接近这个限制,添加额外的层级结构,并明确指出使用该技能的模型下一步应该去哪里跟进。
  • 在SKILL.md中清晰引用文件,并说明何时需要读取这些文件
  • 对于大型参考文件(>300行),包含目录
领域组织:当技能支持多个领域/框架时,按变体组织:
cloud-deploy/
├── SKILL.md(工作流+选择逻辑)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md
Claude只会读取相关的参考文件。

Principle of Lack of Surprise

无意外原则

This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.
不言而喻,技能不得包含恶意软件、漏洞利用代码或任何可能危害系统安全的内容。技能的内容在描述其意图后,不应让用户感到意外。不要配合创建误导性技能或旨在促进未授权访问、数据泄露或其他恶意活动的技能。不过,像“扮演XYZ”这样的技能是可以的。

Writing Patterns

编写模式

Prefer using the imperative form in instructions.
Defining output formats - You can do it like this:
markdown
undefined
说明中优先使用祈使语气。
定义输出格式 - 可以这样做:
markdown
undefined

Report structure

报告结构

ALWAYS use this exact template:
始终使用以下精确模板:

[Title]

[标题]

Executive summary

执行摘要

Key findings

关键发现

Recommendations

建议


**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
```markdown

**示例模式** - 包含示例很有用。可以这样格式化(但如果示例中包含“Input”和“Output”,可以适当调整):
```markdown

Commit message format

提交信息格式

Example 1: Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication
undefined
示例1: Input: 添加了基于JWT令牌的用户认证 Output: feat(auth): implement JWT-based authentication
undefined

Writing Style

写作风格

Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
尽量向模型解释为什么某些内容很重要,而不是使用生硬的“必须”类表述。运用心智理论,让技能具有通用性,不要过度局限于特定示例。先编写初稿,然后以全新的视角审视并改进。

Test Cases

测试用例

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.
Save test cases to
evals/evals.json
. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}
See
references/schemas.md
for the full schema (including the
assertions
field, which you'll add later).
编写技能初稿后,提出2-3个真实的测试提示词——即真实用户可能会说的内容。将这些提示词分享给用户:[你不必使用完全相同的措辞]“这里有几个我想尝试的测试用例。这些看起来合适吗,或者你想添加更多?”然后运行这些测试用例。
将测试用例保存到
evals/evals.json
中。暂时不要编写断言——只保存提示词。你会在下一步测试运行过程中起草断言。
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "用户的任务提示词",
      "expected_output": "预期结果描述",
      "files": []
    }
  ]
}
完整的模式(包括
assertions
字段,稍后会添加)请参阅
references/schemas.md

Running and evaluating test cases

运行和评估测试用例

This section is one continuous sequence — don't stop partway through. Do NOT use
/skill-test
or any other testing skill.
Put results in
<skill-name>-workspace/
as a sibling to the skill directory. Within the workspace, organize results by iteration (
iteration-1/
,
iteration-2/
, etc.) and within that, each test case gets a directory (
eval-0/
,
eval-1/
, etc.). Don't create all of this upfront — just create directories as you go.
这部分是一个连续的流程——不要中途停止。不要使用
/skill-test
或任何其他测试技能。
将结果保存到技能目录的同级目录
<skill-name>-workspace/
中。在工作区内,按迭代版本组织结果(
iteration-1/
iteration-2/
等),每个测试用例在迭代目录下有自己的子目录(
eval-0/
eval-1/
等)。不要提前创建所有目录——按需创建即可。

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1:在同一轮中启动所有测试(带技能和基准版本)

For each test case, spawn two subagents in the same turn — one with the skill, one without. This is important: don't spawn the with-skill runs first and then come back for baselines later. Launch everything at once so it all finishes around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">
Baseline run (same prompt, but the baseline depends on context):
  • Creating a new skill: no skill at all. Same prompt, no skill path, save to
    without_skill/outputs/
    .
  • Improving an existing skill: the old version. Before editing, snapshot the skill (
    cp -r <skill-path> <workspace>/skill-snapshot/
    ), then point the baseline subagent at the snapshot. Save to
    old_skill/outputs/
    .
Write an
eval_metadata.json
for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.
json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}
对于每个测试用例,在同一轮中启动两个子代理——一个使用技能,一个不使用。这一点很重要:不要先启动带技能的测试,然后再回来启动基准测试。同时启动所有测试,这样它们可以大致同时完成。
带技能的测试:
执行以下任务:
- 技能路径:<path-to-skill>
- 任务:<eval prompt>
- 输入文件:<eval files if any, or "none">
- 输出保存路径:<workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- 需要保存的输出:<用户关心的内容——例如,".docx文件"、"最终的CSV文件">
基准测试(相同的提示词,但基准版本取决于上下文):
  • 创建新技能:不使用任何技能。相同的提示词,不指定技能路径,保存到
    without_skill/outputs/
  • 改进现有技能:使用旧版本。在编辑前,对技能进行快照(
    cp -r <skill-path> <workspace>/skill-snapshot/
    ),然后让基准子代理指向该快照。保存到
    old_skill/outputs/
为每个测试用例编写
eval_metadata.json
(断言暂时可以为空)。为每个评估用例指定一个描述性名称,不要只使用“eval-0”。目录名称也使用这个名称。如果本次迭代使用了新的或修改后的评估提示词,为每个新的评估目录创建这些文件——不要假设它们会从上一次迭代继承。
json
{
  "eval_id": 0,
  "eval_name": "描述性名称",
  "prompt": "用户的任务提示词",
  "assertions": []
}

Step 2: While runs are in progress, draft assertions

步骤2:在测试运行过程中起草断言

Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in
evals/evals.json
, review them and explain what they check.
Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update the
eval_metadata.json
files and
evals/evals.json
with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.
不要只是等待测试完成——你可以利用这段时间做有意义的事情。为每个测试用例起草定量断言,并向用户解释。如果
evals/evals.json
中已有断言,审核这些断言并解释它们的检查内容。
好的断言应该是客观可验证的,并且有描述性名称——在基准查看器中,有人扫一眼结果就能立即理解每个断言的检查内容。具有主观输出的技能(如写作风格、设计质量)更适合定性评估——不要强行对需要人工判断的内容添加断言。
起草完成后,更新
eval_metadata.json
文件和
evals/evals.json
中的断言。同时向用户解释他们在查看器中会看到的内容——包括定性输出和定量基准数据。

Step 3: As runs complete, capture timing data

步骤3:测试完成后捕获计时数据

When each subagent task completes, you receive a notification containing
total_tokens
and
duration_ms
. Save this data immediately to
timing.json
in the run directory:
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.
每个子代理任务完成后,你会收到包含
total_tokens
duration_ms
的通知。立即将这些数据保存到测试目录的
timing.json
中:
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
这是捕获这些数据的唯一机会——数据通过任务通知发送,不会在其他地方持久化。收到通知后立即处理,不要批量处理。

Step 4: Grade, aggregate, and launch the viewer

步骤4:评分、汇总并启动查看器

Once all runs are done:
  1. Grade each run — spawn a grader subagent (or grade inline) that reads
    agents/grader.md
    and evaluates each assertion against the outputs. Save results to
    grading.json
    in each run directory. The grading.json expectations array must use the fields
    text
    ,
    passed
    , and
    evidence
    (not
    name
    /
    met
    /
    details
    or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
  2. Aggregate into benchmark — run the aggregation script from the skill-creator directory:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    This produces
    benchmark.json
    and
    benchmark.md
    with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see
    references/schemas.md
    for the exact schema the viewer expects. Put each with_skill version before its baseline counterpart.
  3. Do an analyst pass — read the benchmark data and surface patterns the aggregate stats might hide. See
    agents/analyzer.md
    (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
  4. Launch the viewer with both qualitative outputs and quantitative data:
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    For iteration 2+, also pass
    --previous-workspace <workspace>/iteration-<N-1>
    .
    Cowork / headless environments: If
    webbrowser.open()
    is not available or the environment has no display, use
    --static <output_path>
    to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a
    feedback.json
    file when the user clicks "Submit All Reviews". After download, copy
    feedback.json
    into the workspace directory for the next iteration to pick up.
Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
  1. Tell the user something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
所有测试完成后:
  1. 为每个测试评分——启动评分子代理(或在线评分),读取
    agents/grader.md
    并根据输出评估每个断言。将结果保存到每个测试目录的
    grading.json
    中。
    grading.json
    的expectations数组必须使用
    text
    passed
    evidence
    字段(不能使用
    name
    /
    met
    /
    details
    或其他变体)——查看器依赖这些精确的字段名称。对于可以通过编程方式检查的断言,编写并运行脚本,而不是人工检查——脚本更快、更可靠,并且可以在迭代中重复使用。
  2. 汇总为基准数据——从skill-creator目录运行聚合脚本:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    这会生成
    benchmark.json
    benchmark.md
    ,包含每个配置的通过率、时间和令牌使用情况,以及平均值±标准差和差值。如果手动生成
    benchmark.json
    ,请参阅
    references/schemas.md
    查看查看器期望的精确模式。 将带技能的版本放在其基准版本之前。
  3. 分析师审核——读取基准数据,发现聚合统计数据可能隐藏的模式。请参阅
    agents/analyzer.md
    中的“分析基准测试结果”部分,了解需要关注的内容——例如,无论是否使用技能都始终通过的断言(无区分度)、高方差的评估(可能不稳定),以及时间/令牌的权衡。
  4. 启动查看器,同时展示定性输出和定量数据:
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    对于第2次及以后的迭代,还要传递
    --previous-workspace <workspace>/iteration-<N-1>
    参数。
    协作/无头环境:如果
    webbrowser.open()
    不可用或环境没有显示界面,使用
    --static <output_path>
    参数生成独立的HTML文件,而不是启动服务器。用户点击“提交所有反馈”后,反馈会下载为
    feedback.json
    文件。下载后,将
    feedback.json
    复制到工作区目录,供下一次迭代使用。
注意:请使用generate_review.py创建查看器;无需编写自定义HTML代码。
  1. 告知用户:“我已经在你的浏览器中打开了结果。有两个标签页——‘输出’标签页可以浏览每个测试用例并留下反馈,‘基准测试’标签页显示定量对比数据。完成后,请回到这里告诉我。”

What the user sees in the viewer

用户在查看器中看到的内容

The "Outputs" tab shows one test case at a time:
  • Prompt: the task that was given
  • Output: the files the skill produced, rendered inline where possible
  • Previous Output (iteration 2+): collapsed section showing last iteration's output
  • Formal Grades (if grading was run): collapsed section showing assertion pass/fail
  • Feedback: a textbox that auto-saves as they type
  • Previous Feedback (iteration 2+): their comments from last time, shown below the textbox
The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.
Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to
feedback.json
.
“输出”标签页一次显示一个测试用例:
  • 提示词:给定的任务提示词
  • 输出:技能生成的文件,可能会内联渲染
  • 上一次输出(第2次及以后迭代):折叠部分显示上一次迭代的输出
  • 正式评分(如果运行了评分):折叠部分显示断言的通过/失败情况
  • 反馈:自动保存的文本框
  • 上一次反馈(第2次及以后迭代):用户上次的评论,显示在文本框下方
“基准测试”标签页显示统计摘要:每个配置的通过率、时间和令牌使用情况,以及每个评估用例的细分和分析师观察结果。
通过上一个/下一个按钮或箭头键导航。完成后,用户点击“提交所有反馈”,所有反馈会保存到
feedback.json
中。

Step 5: Read the feedback

步骤5:读取反馈

When the user tells you they're done, read
feedback.json
:
json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
  ],
  "status": "complete"
}
Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.
Kill the viewer server when you're done with it:
bash
kill $VIEWER_PID 2>/dev/null

当用户告诉你他们完成后,读取
feedback.json
json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "完美,很喜欢这个", "timestamp": "..."}
  ],
  "status": "complete"
}
空反馈表示用户认为结果没问题。重点改进用户有具体投诉的测试用例对应的技能部分。
完成后关闭查看器服务器:
bash
kill $VIEWER_PID 2>/dev/null

Improving the skill

改进技能

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.
这是循环的核心。你已经运行了测试用例,用户已经审核了结果,现在需要根据他们的反馈改进技能。

How to think about improvements

如何思考改进方向

  1. Generalize from the feedback. The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
  2. Keep the prompt lean. Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
  3. Explain the why. Try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
  4. Look for repeated work across test cases. Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a
    create_docx.py
    or a
    build_chart.py
    , that's a strong signal the skill should bundle that script. Write it once, put it in
    scripts/
    , and tell the skill to use it. This saves every future invocation from reinventing the wheel.
This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.
  1. 从反馈中提炼共性。我们的目标是创建可以被使用数百万次(甚至更多)的技能,适用于各种不同的提示词。你和用户现在反复迭代几个示例,是为了更快地推进。用户非常了解这些示例,可以快速评估新的输出。但如果你们共同开发的技能只适用于这些示例,那它就毫无用处。与其做出过度拟合的细微修改,或者添加过于严格的“必须”规则,如果遇到顽固问题,你可以尝试使用不同的比喻,或者推荐不同的工作模式。这种尝试成本很低,可能会带来很好的效果。
  2. 保持提示词简洁。移除没有实际作用的内容。确保阅读测试运行的记录,而不仅仅是最终输出——如果技能导致模型浪费大量时间做无意义的事情,你可以尝试删除技能中导致这种情况的部分,看看效果如何。
  3. 解释背后的原因。尽量解释你要求模型做某事的原因。如今的大语言模型非常智能。它们具备良好的心智理论,在合适的引导下可以超越 rote 指令,真正解决问题。即使用户的反馈很简短或带有情绪,也要尝试真正理解任务,理解用户为什么这么写,他们实际想要什么,然后将这种理解融入到说明中。如果你发现自己用大写字母写ALWAYS或NEVER,或者使用过于僵化的结构,这是一个警告信号——如果可能,重新组织语言并解释原因,让模型理解你要求的内容为什么重要。这是一种更人性化、更强大、更有效的方法。
  4. 寻找测试用例中的重复工作。阅读测试运行的记录,注意子代理是否都独立编写了类似的辅助脚本,或者对某件事采取了相同的多步骤方法。如果所有3个测试用例都导致子代理编写了
    create_docx.py
    build_chart.py
    ,这强烈表明技能应该捆绑这个脚本。编写一次,放在
    scripts/
    目录中,并告诉技能使用它。这样可以避免未来每次调用都重复造轮子。
这项任务非常重要(我们的目标是每年创造数十亿美元的经济价值!),你的思考时间不是瓶颈;请花时间仔细考虑。我建议先编写修订初稿,然后重新审视并改进。尽最大努力站在用户的角度,理解他们的需求。

The iteration loop

迭代循环

After improving the skill:
  1. Apply your improvements to the skill
  2. Rerun all test cases into a new
    iteration-<N+1>/
    directory, including baseline runs. If you're creating a new skill, the baseline is always
    without_skill
    (no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
  3. Launch the reviewer with
    --previous-workspace
    pointing at the previous iteration
  4. Wait for the user to review and tell you they're done
  5. Read the new feedback, improve again, repeat
Keep going until:
  • The user says they're happy
  • The feedback is all empty (everything looks good)
  • You're not making meaningful progress

改进技能后:
  1. 将改进应用到技能中
  2. 将所有测试用例重新运行到新的
    iteration-<N+1>/
    目录中,包括基准测试。如果是创建新技能,基准始终是
    without_skill
    (不使用技能)——这在迭代中保持不变。如果是改进现有技能,根据判断选择合适的基准:用户最初提供的版本,或者上一次迭代的版本。
  3. 启动查看器,使用
    --previous-workspace
    参数指向上一次迭代的工作区
  4. 等待用户审核并告知你完成
  5. 读取新的反馈,再次改进,重复循环
持续迭代直到:
  • 用户表示满意
  • 所有反馈都是空的(一切看起来都很好)
  • 你无法取得有意义的进展

Advanced: Blind comparison

高级功能:盲法对比

For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read
agents/comparator.md
and
agents/analyzer.md
for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.
This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

当你需要更严格地比较两个技能版本时(例如,用户问“新版本真的更好吗?”),有一个盲法对比系统。请参阅
agents/comparator.md
agents/analyzer.md
了解详细信息。基本思路是:将两个输出交给独立的代理,不告诉它哪个是哪个,让它判断质量。然后分析获胜版本胜出的原因。
这是可选功能,需要子代理支持,大多数用户不需要。人工审核循环通常就足够了。

Description Optimization

描述优化

The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
SKILL.md前置元数据中的description字段是决定Claude是否调用技能的主要机制。在创建或改进技能后,主动提出优化描述以提高触发准确率。

Step 1: Generate trigger eval queries

步骤1:生成触发评估查询

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
json
[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]
The queries must be realistic and something a Claude Code or Claude.ai user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).
Bad:
"Format this data"
,
"Extract text from PDF"
,
"Create a chart"
Good:
"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"
For the should-trigger queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.
For the should-not-trigger queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.
The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.
创建20个评估查询——混合应该触发和不应该触发的情况。保存为JSON格式:
json
[
  {"query": "用户的提示词", "should_trigger": true},
  {"query": "另一个提示词", "should_trigger": false}
]
查询必须是真实的,是Claude Code或Claude.ai用户实际可能输入的内容。不要使用抽象的请求,要具体、详细,包含足够的信息。例如,文件路径、用户工作或情况的个人背景、列名和值、公司名称、URL。一点背景故事。有些可能是小写的,包含缩写、拼写错误或口语化表达。使用不同长度的查询,重点关注边缘情况,而不是清晰明确的情况(用户会有机会审核这些查询)。
不好的例子:
"格式化这些数据"
"从PDF中提取文本"
"创建一个图表"
好的例子:
"好吧,我的老板刚给我发了这个xlsx文件(在我的下载文件夹里,名字大概是'Q4 sales final FINAL v2.xlsx'),她让我加一列显示利润率百分比。收入在C列,成本应该在D列我记得"
对于应该触发的查询(8-10个),要考虑覆盖范围。你需要相同意图的不同表述——有的正式,有的随意。包括用户没有明确提到技能或文件类型,但显然需要该技能的情况。加入一些不常见的用例和该技能与其他技能竞争但应该胜出的情况。
对于不应该触发的查询(8-10个),最有价值的是那些“接近但不匹配”的情况——与技能共享关键词或概念,但实际需要其他功能的查询。考虑相邻领域、模糊表述(简单的关键词匹配会触发,但实际上不应该触发),以及查询涉及该技能的部分功能,但在该场景下另一个工具更合适的情况。
关键要避免:不要让不应该触发的查询明显无关。比如,用“写一个斐波那契函数”作为PDF技能的负面测试太简单了——它测试不了任何东西。负面测试用例应该真正具有挑战性。

Step 2: Review with user

步骤2:与用户一起审核

Present the eval set to the user for review using the HTML template:
  1. Read the template from
    assets/eval_review.html
  2. Replace the placeholders:
    • __EVAL_DATA_PLACEHOLDER__
      → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
    • __SKILL_NAME_PLACEHOLDER__
      → the skill's name
    • __SKILL_DESCRIPTION_PLACEHOLDER__
      → the skill's current description
  3. Write to a temp file (e.g.,
    /tmp/eval_review_<skill-name>.html
    ) and open it:
    open /tmp/eval_review_<skill-name>.html
  4. The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
  5. The file downloads to
    ~/Downloads/eval_set.json
    — check the Downloads folder for the most recent version in case there are multiple (e.g.,
    eval_set (1).json
    )
This step matters — bad eval queries lead to bad descriptions.
使用HTML模板向用户展示评估集供其审核:
  1. assets/eval_review.html
    读取模板
  2. 替换占位符:
    • __EVAL_DATA_PLACEHOLDER__
      → 评估项的JSON数组(不要加引号——这是一个JS变量赋值)
    • __SKILL_NAME_PLACEHOLDER__
      → 技能的名称
    • __SKILL_DESCRIPTION_PLACEHOLDER__
      → 技能当前的描述
  3. 写入临时文件(例如,
    /tmp/eval_review_<skill-name>.html
    )并打开:
    open /tmp/eval_review_<skill-name>.html
  4. 用户可以编辑查询、切换是否应该触发、添加/删除条目,然后点击“导出评估集”
  5. 文件会下载到
    ~/Downloads/eval_set.json
    ——如果有多个版本(例如,
    eval_set (1).json
    ),请检查下载文件夹中的最新版本
这一步很重要——不好的评估查询会导致不好的描述。

Step 3: Run the optimization loop

步骤3:运行优化循环

Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
Save the eval set to the workspace, then run in the background:
bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.
While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with
best_description
— selected by test score rather than train score to avoid overfitting.
告诉用户:“这需要一些时间——我会在后台运行优化循环,并定期检查进度。”
将评估集保存到工作区,然后在后台运行:
bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
使用当前会话的模型ID(系统提示中的模型ID),这样触发测试可以匹配用户实际体验到的情况。
在运行过程中,定期查看输出,向用户更新当前迭代次数和分数情况。
这个脚本会自动处理完整的优化循环。它将评估集分为60%的训练集和40%的保留测试集,评估当前描述(每个查询运行3次以获得可靠的触发率),然后调用Claude根据失败情况提出改进建议。它会在训练集和测试集上重新评估每个新描述,最多迭代5次。完成后,它会在浏览器中打开一个HTML报告,显示每次迭代的结果,并返回包含
best_description
的JSON——根据测试分数选择,而不是训练分数,以避免过拟合。

How skill triggering works

技能触发机制的工作原理

Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's
available_skills
list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.
理解触发机制有助于设计更好的评估查询。技能以名称+描述的形式出现在Claude的
available_skills
列表中,Claude根据描述决定是否调用该技能。需要知道的重要一点是,Claude只会在自己无法轻松处理的任务中调用技能——简单的单步查询,如“读取这个PDF”,即使描述完全匹配,也可能不会触发技能,因为Claude可以直接使用基本工具处理。复杂的、多步骤的或专业的查询,当描述匹配时,会可靠地触发技能。
这意味着你的评估查询应该足够复杂,Claude确实能从调用技能中受益。像“读取文件X”这样的简单查询是不好的测试用例——无论描述质量如何,它们都不会触发技能。

Step 4: Apply the result

步骤4:应用结果

Take
best_description
from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

从JSON输出中获取
best_description
,更新技能的SKILL.md前置元数据。向用户展示前后对比,并报告分数。

Package and Present (only if
present_files
tool is available)

打包与呈现(仅当
present_files
工具可用时)

Check whether you have access to the
present_files
tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
bash
python -m scripts.package_skill <path/to/skill-folder>
After packaging, direct the user to the resulting
.skill
file path so they can install it.

检查你是否有权访问
present_files
工具。如果没有,跳过这一步。如果有,打包技能并向用户呈现.skill文件:
bash
python -m scripts.package_skill <path/to/skill-folder>
打包完成后,告诉用户生成的.skill文件路径,以便他们安装。

Claude.ai-specific instructions

Claude.ai特定说明

In Claude.ai, the core workflow is the same (draft → test → review → improve → repeat), but because Claude.ai doesn't have subagents, some mechanics change. Here's what to adapt:
Running test cases: No subagents means no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself. Do them one at a time. This is less rigorous than independent subagents (you wrote the skill and you're also running it, so you have full context), but it's a useful sanity check — and the human review step compensates. Skip the baseline runs — just use the skill to complete the task as requested.
Reviewing results: If you can't open a browser (e.g., Claude.ai's VM has no display, or you're on a remote server), skip the browser reviewer entirely. Instead, present results directly in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (like a .docx or .xlsx), save it to the filesystem and tell them where it is so they can download and inspect it. Ask for feedback inline: "How does this look? Anything you'd change?"
Benchmarking: Skip the quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback from the user.
The iteration loop: Same as before — improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.
Description optimization: This section requires the
claude
CLI tool (specifically
claude -p
) which is only available in Claude Code. Skip it if you're on Claude.ai.
Blind comparison: Requires subagents. Skip it.
Packaging: The
package_skill.py
script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting
.skill
file.
Updating an existing skill: The user might be asking you to update an existing skill, not create a new one. In this case:
  • Preserve the original name. Note the skill's directory name and
    name
    frontmatter field -- use them unchanged. E.g., if the installed skill is
    research-helper
    , output
    research-helper.skill
    (not
    research-helper-v2
    ).
  • Copy to a writeable location before editing. The installed skill path may be read-only. Copy to
    /tmp/skill-name/
    , edit there, and package from the copy.
  • If packaging manually, stage in
    /tmp/
    first
    , then copy to the output directory -- direct writes may fail due to permissions.

在Claude.ai中,核心工作流是相同的(初稿→测试→审核→改进→重复),但由于Claude.ai没有子代理,一些机制会有所不同。以下是需要调整的地方:
运行测试用例:没有子代理意味着无法并行执行。对于每个测试用例,读取技能的SKILL.md,然后按照其说明完成测试提示词。逐个执行。这不如独立子代理严格(你编写了技能,又自己运行它,所以有完整的上下文),但这是一个有用的 sanity check——人工审核步骤可以弥补这一点。跳过基准测试——只使用技能完成用户要求的任务。
审核结果:如果你无法打开浏览器(例如,Claude.ai的VM没有显示界面,或者你在远程服务器上),完全跳过浏览器查看器。相反,直接在对话中呈现结果。对于每个测试用例,显示提示词和输出。如果输出是用户需要查看的文件(如.docx或.xlsx),将其保存到文件系统,并告诉用户位置,以便他们下载和检查。在线询问反馈:“这个看起来怎么样?有什么需要修改的吗?”
基准测试:跳过定量基准测试——它依赖于基准对比,没有子代理的话没有意义。专注于用户的定性反馈。
迭代循环:与之前相同——改进技能,重新运行测试用例,询问反馈——只是中间没有浏览器查看器。如果你有文件系统,仍然可以在文件系统上按迭代版本组织结果。
描述优化:这部分需要
claude
CLI工具(特别是
claude -p
),只有Claude Code才有。如果是在Claude.ai上,跳过这一步。
盲法对比:需要子代理支持。跳过。
打包
package_skill.py
脚本在任何有Python和文件系统的环境中都可以运行。在Claude.ai上,你可以运行它,用户可以下载生成的.skill文件。
更新现有技能:用户可能要求你更新现有技能,而不是创建新技能。在这种情况下:
  • 保留原始名称。注意技能的目录名称和
    name
    前置元数据字段——不要修改。例如,如果已安装的技能是
    research-helper
    ,输出
    research-helper.skill
    (不是
    research-helper-v2
    )。
  • 在编辑前复制到可写位置。已安装的技能路径可能是只读的。复制到
    /tmp/skill-name/
    ,在那里编辑,然后从副本打包。
  • 如果手动打包,先在
    /tmp/
    中准备
    ,然后复制到输出目录——直接写入可能会因权限问题失败。

Cowork-Specific Instructions

Cowork特定说明

If you're in Cowork, the main things to know are:
  • You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
  • You don't have a browser or display, so when generating the eval viewer, use
    --static <output_path>
    to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
  • For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using
    generate_review.py
    (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER BEFORE evaluating inputs yourself. You want to get them in front of the human ASAP!
  • Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download
    feedback.json
    as a file. You can then read it from there (you may have to request access first).
  • Packaging works —
    package_skill.py
    just needs Python and a filesystem.
  • Description optimization (
    run_loop.py
    /
    run_eval.py
    ) should work in Cowork just fine since it uses
    claude -p
    via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
  • Updating an existing skill: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.

如果你在Cowork环境中,需要了解的主要事项:
  • 你有子代理,所以主要工作流(并行启动测试用例、运行基准测试、评分等)都可以正常工作。(但是,如果遇到严重的超时问题,可以按顺序运行测试提示词,而不是并行。)
  • 你没有浏览器或显示界面,所以生成评估查看器时,使用
    --static <output_path>
    参数生成独立的HTML文件,而不是启动服务器。然后提供一个链接,用户可以点击在自己的浏览器中打开HTML文件。
  • 出于某种原因,Cowork环境似乎不太容易让Claude在测试后生成评估查看器,所以这里再次强调:无论你是在Cowork还是Claude Code中,在运行测试后,都应该始终生成评估查看器,让人工先查看示例,然后再自己修改技能并尝试纠正,使用
    generate_review.py
    (不要编写自己的定制HTML代码)。提前抱歉,但我要大写强调:在你自己评估输入并尝试纠正之前,一定要生成评估查看器。你要尽快让人工看到结果!
  • 反馈机制不同:由于没有运行中的服务器,查看器的“提交所有反馈”按钮会将
    feedback.json
    下载为文件。你可以从那里读取它(可能需要先请求访问权限)。
  • 打包工作正常——
    package_skill.py
    只需要Python和文件系统。
  • 描述优化(
    run_loop.py
    /
    run_eval.py
    )在Cowork中应该可以正常工作,因为它通过子进程使用
    claude -p
    ,而不是浏览器,但请等到技能完全完成且用户确认状态良好后再进行。
  • 更新现有技能:用户可能要求你更新现有技能,而不是创建新技能。遵循上面Claude.ai部分中的更新指南。

Reference files

参考文件

The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
  • agents/grader.md
    — How to evaluate assertions against outputs
  • agents/comparator.md
    — How to do blind A/B comparison between two outputs
  • agents/analyzer.md
    — How to analyze why one version beat another
The references/ directory has additional documentation:
  • references/schemas.md
    — JSON structures for evals.json, grading.json, etc.

Repeating one more time the core loop here for emphasis:
  • Figure out what the skill is about
  • Draft or edit the skill
  • Run claude-with-access-to-the-skill on test prompts
  • With the user, evaluate the outputs:
    • Create benchmark.json and run
      eval-viewer/generate_review.py
      to help the user review them
    • Run quantitative evals
  • Repeat until you and the user are satisfied
  • Package the final skill and return it to the user.
Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run
eval-viewer/generate_review.py
so human can review test cases" in your TodoList to make sure it happens.
Good luck!
agents/目录包含专门子代理的说明。当你需要启动相关子代理时,请阅读这些文件。
  • agents/grader.md
    — 如何根据输出评估断言
  • agents/comparator.md
    — 如何进行盲法A/B对比
  • agents/analyzer.md
    — 如何分析某个版本胜出的原因
references/目录包含额外的文档:
  • references/schemas.md
    — evals.json、grading.json等文件的JSON结构

最后再强调一遍核心循环:
  • 确定技能的用途
  • 起草或编辑技能
  • 让具备该技能访问权限的Claude运行测试提示词
  • 与用户一起评估输出:
    • 创建benchmark.json并运行
      eval-viewer/generate_review.py
      帮助用户审核
    • 运行定量评估
  • 重复直到你和用户都满意
  • 打包最终技能并返回给用户
如果有任务列表,请将这些步骤添加进去,确保不会忘记。如果你在Cowork环境中,请特别将“创建evals JSON并运行
eval-viewer/generate_review.py
让人工审核测试用例”添加到任务列表中,确保这一步会执行。
祝你好运!