create-skill-test

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Create Skill Test

创建Skill测试

This skill helps you scaffold evaluation tests (

eval.yaml

) for agent skills, ensuring they conform to the dotnet/skills repository conventions, pass the skill-validator checks, and avoid common overfitting pitfalls.

本Skill可帮助你为Agent Skill搭建评估测试文件（

eval.yaml

），确保其符合dotnet/skills仓库的规范，通过skill-validator检查，并避免常见的过拟合问题。

When to Use

适用场景

Creating a new
```
eval.yaml
```
test file for a skill
Adding scenarios to an existing eval file
Setting up test fixture files alongside eval definitions
Reviewing whether rubric items and assertions risk overfitting

为Skill创建新的
```
eval.yaml
```
测试文件
为现有评估文件添加测试场景
为评估定义配置测试夹具文件
检查评分标准项和断言是否存在过拟合风险

When Not to Use

不适用场景

Running or debugging existing tests (use the skill-validator directly)
Modifying the skill-validator tool itself
Creating or editing SKILL.md files (use the
```
create-skill
```
skill)

运行或调试现有测试（请直接使用skill-validator）
修改skill-validator工具本身
创建或编辑SKILL.md文件（请使用
```
create-skill
```
Skill）

Inputs

输入项

Input	Required	Description
Skill name	Yes	The skill being tested (must match a skill under `plugins/<plugin>/skills/` )
Plugin name	Yes	The plugin the skill belongs to (e.g., `dotnet-msbuild` )
Skill content	Recommended	The SKILL.md content to understand what the skill teaches
Scenario descriptions	Recommended	What situations the agent should be tested on

输入项	是否必填	描述
Skill名称	是	要测试的Skill（必须与 `plugins/<plugin>/skills/` 下的Skill名称匹配）
Plugin名称	是	Skill所属的Plugin（例如： `dotnet-msbuild` ）
Skill内容	推荐	SKILL.md的内容，用于理解该Skill的功能
场景描述	推荐	要测试Agent的场景情况

Workflow

工作流程

Step 1: Locate the target and determine the test directory

步骤1：定位目标并确定测试目录

Tests live at:

undefined

测试文件的存储路径：

undefined

For skills:

针对Skill:

tests/<plugin>/<skill-name>/eval.yaml

For agents (agent. prefix convention):

针对Agent（遵循agent.前缀约定）:

tests/<plugin>/agent.<agent-name>/eval.yaml


For skills, verify the skill exists at `plugins/<plugin>/skills/<skill-name>/SKILL.md`. For agents, verify the agent exists at `plugins/<plugin>/agents/<agent-name>.agent.md`. Read the target content to understand what it does — this is critical for writing non-overfitted rubric items.

tests/<plugin>/agent.<agent-name>/eval.yaml


对于Skill，请确认该Skill存在于`plugins/<plugin>/skills/<skill-name>/SKILL.md`路径下。对于Agent，请确认该Agent存在于`plugins/<plugin>/agents/<agent-name>.agent.md`路径下。阅读目标内容以了解其功能——这对于编写非过拟合的评分标准项至关重要。

Step 2: Create the test directory and eval.yaml

步骤2：创建测试目录和eval.yaml

Create the directory and file:

undefined

创建对应的目录和文件：

undefined

For skills:

针对Skill:

tests/<plugin>/<skill-name>/ └── eval.yaml

For agents:

针对Agent:

tests/<plugin>/agent.<agent-name>/ └── eval.yaml


The `agent.` prefix disambiguates agent test directories from skill test directories that might share the same name.

tests/<plugin>/agent.<agent-name>/ └── eval.yaml


`agent.`前缀用于区分Agent测试目录与可能重名的Skill测试目录。

Step 3: Write scenarios

步骤3：编写测试场景

Each scenario needs a

name

prompt

, at least one

assertion

, and a

rubric

. Use this structure:

yaml

scenarios:
  - name: "Descriptive scenario name"
    prompt: "Natural language task description as a developer would phrase it"
    setup:
      copy_test_files: true          # OR use inline files
    assertions:
      - type: "output_contains"
        value: "expected text"
    rubric:
      - "The agent correctly identified the root cause"
      - "The agent suggested a concrete, actionable fix"
    timeout: 120

每个场景需要包含

name

、

prompt

、至少一个

assertion

和

rubric

。请使用以下结构：

yaml

scenarios:
  - name: "描述性的场景名称"
    prompt: "以开发者的自然语言描述任务"
    setup:
      copy_test_files: true          # 或使用内嵌文件
    assertions:
      - type: "output_contains"
        value: "预期文本"
    rubric:
      - "Agent正确识别了根本原因"
      - "Agent提出了具体且可执行的修复方案"
    timeout: 120

Scenario guidelines

场景编写准则

Name: Describe what is being tested, not how (e.g., "Diagnose missing package reference" not "Test binlog replay and error extraction").
Prompt: Write as a natural developer request. Never mention the skill name or instruct the agent to "use a skill." Neutral prompts prevent prompt overfitting.
Timeout: Default is 120 seconds. Use 300-600 for scenarios requiring builds, benchmarks, or multi-step operations.

名称：描述要测试的内容，而非测试方式（例如：“诊断缺失的包引用”而非“测试binlog重放和错误提取”）。
提示词：以开发者的自然语言请求撰写。切勿提及Skill名称或指示Agent“使用某个Skill”。中立的提示词可避免提示过拟合。
超时时间：默认值为120秒。对于需要构建、基准测试或多步骤操作的场景，请设置为300-600秒。

Step 4: Configure setup

步骤4：配置初始化设置

Choose one of three setup strategies:

请选择以下三种初始化策略之一：

Option A: Copy test files (recommended for complex fixtures)

选项A：复制测试文件（适用于复杂夹具）

Place fixture files alongside

eval.yaml

and enable auto-copy:

yaml

setup:
  copy_test_files: true

All files in the directory (except

eval.yaml

) are copied into the agent's working directory.

将夹具文件放置在eval.yaml旁，并启用自动复制：

yaml

setup:
  copy_test_files: true

目录中的所有文件（除eval.yaml外）都会被复制到Agent的工作目录中。

Option B: Inline files (good for small, self-contained scenarios)

选项B：内嵌文件（适用于小型独立场景）

yaml

setup:
  files:
    - path: "MyProject/MyProject.csproj"
      content: |
        <Project Sdk="Microsoft.NET.Sdk">
          <PropertyGroup>
            <TargetFramework>net10.0</TargetFramework>
          </PropertyGroup>
        </Project>
    - path: "MyProject/Program.cs"
      content: |
        Console.WriteLine("Hello");

yaml

setup:
  files:
    - path: "MyProject/MyProject.csproj"
      content: |
        <Project Sdk="Microsoft.NET.Sdk">
          <PropertyGroup>
            <TargetFramework>net10.0</TargetFramework>
          </PropertyGroup>
        </Project>
    - path: "MyProject/Program.cs"
      content: |
        Console.WriteLine("Hello");

Option C: Reference fixture files from a subdirectory

选项C：从子目录引用夹具文件

yaml

setup:
  files:
    - path: "TestProject.csproj"
      source: "fixtures/scenario-a/TestProject.csproj"

Use this when multiple scenarios share a

fixtures/

directory with separate subdirectories.

yaml

setup:
  files:
    - path: "TestProject.csproj"
      source: "fixtures/scenario-a/TestProject.csproj"

当多个场景共享一个包含不同子目录的

fixtures/

目录时，可使用此方式。

Setup commands (optional)

初始化命令（可选）

Run shell commands before the agent starts (e.g., to build a project and generate artifacts):

yaml

setup:
  copy_test_files: true
  commands:
    - "dotnet build -bl:build.binlog"

在Agent启动前执行Shell命令（例如：构建项目并生成产物）：

yaml

setup:
  copy_test_files: true
  commands:
    - "dotnet build -bl:build.binlog"

Scenario dependencies (optional)

场景依赖（可选）

Some agents route to specific skills, or some skills depend on sibling agents. In the isolated run, only the target is loaded — so the scenario must declare its dependencies using

additional_required_skills

and/or

additional_required_agents

yaml

setup:
  copy_test_files: true
  additional_required_skills:
    - binlog-failure-analysis    # loaded in isolated run alongside the target
  additional_required_agents:
    - build-perf                 # registered in isolated run alongside the target

Names are resolved from the same plugin's
```
skills/
```
or
```
agents/
```
directory.
These only affect the isolated run. The plugin run already loads everything; the baseline loads nothing.
Different scenarios of the same target can declare different dependencies (per-scenario granularity).
If a declared name cannot be resolved, the validator fails with an error.

部分Agent会路由到特定Skill，或部分Skill依赖同级Agent。在隔离运行模式下，仅会加载目标对象——因此场景必须通过

additional_required_skills

和/或

additional_required_agents

声明其依赖项：

yaml

setup:
  copy_test_files: true
  additional_required_skills:
    - binlog-failure-analysis    # 在隔离运行模式下与目标对象一同加载
  additional_required_agents:
    - build-perf                 # 在隔离运行模式下与目标对象一同注册

名称将从同一Plugin的
```
skills/
```
或
```
agents/
```
目录中解析。
这些设置仅影响隔离运行模式。Plugin运行模式会加载所有内容；基准运行模式则不加载任何内容。
同一目标对象的不同场景可以声明不同的依赖项（支持按场景粒度配置）。
如果声明的名称无法解析，验证器会抛出错误。

Step 5: Write assertions

步骤5：编写断言

Assertions are hard pass/fail checks. Use them for objective, binary-verifiable criteria.

Type	Required fields	Description
`output_contains`	`value`	Agent output contains text (case-insensitive)
`output_not_contains`	`value`	Agent output must NOT contain text
`output_matches`	`pattern`	Agent output matches regex
`output_not_matches`	`pattern`	Agent output does NOT match regex
`file_exists`	`path`	File matching glob exists in work dir
`file_not_exists`	`path`	No file matching glob exists
`file_contains`	`path` , `value`	File at glob path contains text
`file_not_contains`	`path` , `value`	File at glob path does NOT contain text
`exit_success`	—	Agent produced non-empty output

断言是硬性的通过/不通过检查，用于客观、可二元验证的标准。

类型	必填字段	描述
`output_contains`	`value`	Agent输出包含指定文本（不区分大小写）
`output_not_contains`	`value`	Agent输出不得包含指定文本
`output_matches`	`pattern`	Agent输出匹配指定正则表达式
`output_not_matches`	`pattern`	Agent输出不得匹配指定正则表达式
`file_exists`	`path`	工作目录中存在匹配通配符的文件
`file_not_exists`	`path`	工作目录中不存在匹配通配符的文件
`file_contains`	`path` , `value`	通配符路径下的文件包含指定文本
`file_not_contains`	`path` , `value`	通配符路径下的文件不得包含指定文本
`exit_success`	—	Agent生成了非空输出

Assertion guidelines

断言编写准则

Prefer broad assertions that multiple valid approaches would satisfy.
Avoid narrow assertions that gate on a specific syntax or flag the LLM already knows.

Use

output_matches

with regex alternation for flexible matching:

"(root cause|primary error|underlying issue)"

Use
```
file_contains
```
/
```
file_not_contains
```
to verify the agent modified files correctly.
Use
```
output_not_contains
```
and
```
file_not_exists
```
to verify the agent avoided incorrect actions.

优先选择宽泛的断言，确保多种有效方案都能通过。
避免狭窄的断言，例如依赖LLM已知的特定语法或标志。
使用
```
output_matches
```
配合正则表达式的多选功能实现灵活匹配：
```
"(root cause|primary error|underlying issue)"
```
。
使用
```
file_contains
```
/
```
file_not_contains
```
验证Agent是否正确修改了文件。
使用
```
output_not_contains
```
和
```
file_not_exists
```
验证Agent是否避免了错误操作。

Step 6: Write rubric items

步骤6：编写评分标准项

Rubric items are evaluated by an LLM judge using pairwise comparison (baseline vs. skill-enhanced). Quality metrics (rubric-based at 40% weight plus overall judgment at 30%) together dominate the composite improvement score.

评分标准项由LLM评判通过两两比较（基准 vs 增强Skill）进行评估。质量指标（基于评分标准的40%权重加上整体判断的30%）在综合改进评分中占主导地位。

The three rubric classifications (and how to stay in "outcome")

三种评分标准分类（以及如何聚焦“结果”类）

The overfitting judge classifies each rubric item:

Classification	Description	Goal
outcome	Tests whether the agent reached a correct result. Describes WHAT, not HOW.	Target this
technique	Tests whether the agent used a skill-specific procedure.	Minimize
vocabulary	Tests whether the agent used specific terminology from the skill.	Avoid

过拟合评判器会将每个评分标准项分类：

分类	描述	目标
结果（outcome）	测试Agent是否达成正确结果。描述内容是“是什么”，而非“怎么做”。	优先选择此类
技术（technique）	测试Agent是否使用了Skill特定的流程。	尽量减少
术语（vocabulary）	测试Agent是否使用了Skill中的特定术语。	完全避免

Rubric writing rules

评分标准编写规则

Test outcomes, not methods. Write "Identified the root cause of the build failure" — not "Replayed the binlog using
```
dotnet build /flp
```
."
Allow alternative approaches. If multiple valid solutions exist, the rubric item should accept any of them.
Never reference the skill by name or use phrasing copied directly from the SKILL.md.
Don't test pre-existing LLM knowledge. If the LLM already knows something (common APIs, standard syntax, basic escaping), testing for it adds no signal.
Test findings, not diagnostic steps. Write "Correctly determined that the root cause is a missing PackageReference" — not "Used
```
dotnet restore
```
to check package resolution."
Each item should be independently evaluable. Avoid compound items that test multiple things.

测试结果，而非方法。例如编写“识别出构建失败的根本原因”——而非“使用
```
dotnet build /flp
```
重放binlog”。
允许替代方案。如果存在多种有效解决方案，评分标准项应接受其中任何一种。
切勿提及Skill名称或直接使用SKILL.md中的表述。
勿测试LLM已有的知识。如果LLM已经知晓某些内容（常见API、标准语法、基础转义），测试这些内容不会增加有效信号。
测试发现，而非诊断步骤。例如编写“正确判断根本原因是缺失PackageReference”——而非“使用
```
dotnet restore
```
检查包解析情况”。
每个项应可独立评估。避免包含多个测试点的复合项。

Examples

示例

Well-designed (outcome-focused):

yaml

rubric:
  - "Correctly identified the missing NuGet package as the root cause of the build failure"
  - "Recognized that downstream project failures were cascading from the root cause, not independent errors"
  - "Suggested a concrete fix that would resolve the root cause"

Overfitted (vocabulary/technique):

yaml

rubric:
  - "Replayed the binary log using 'dotnet build /flp:v=diag'"      # technique: gates on specific command
  - "Measured cold, warm, and no-op build scenarios"                  # vocabulary: uses skill's labels
  - "Used the --clreventlevel flag with dotnet trace collect"         # vocabulary: gates on specific flag

设计良好（聚焦结果）：

yaml

rubric:
  - "正确识别出缺失的NuGet包是构建失败的根本原因"
  - "认识到下游项目失败是由根本原因引发的连锁反应，而非独立错误"
  - "提出了可解决根本原因的具体修复方案"

过拟合（术语/技术类）：

yaml

rubric:
  - "使用'dotnet build /flp:v=diag'重放二进制日志"      # 技术类：依赖特定命令
  - "测量冷启动、热启动和无操作构建场景"                  # 术语类：使用Skill中的标签
  - "在dotnet trace collect中使用--clreventlevel标志"         # 术语类：依赖特定标志

Step 7: Add optional constraints

步骤7：添加可选约束

yaml

expect_tools: ["bash"]           # Agent must use these tools
reject_tools: ["create_file"]    # Agent must NOT use these tools
max_turns: 10                    # Maximum agent iterations
max_tokens: 5000                 # Maximum token budget

Use constraints sparingly — only when the scenario specifically requires or forbids certain agent behaviors.

yaml

expect_tools: ["bash"]           # Agent必须使用这些工具
reject_tools: ["create_file"]    # Agent不得使用这些工具
max_turns: 10                    # Agent的最大迭代次数
max_tokens: 5000                 # 最大token预算

请谨慎使用约束——仅当场景明确要求或禁止某些Agent行为时才添加。

Step 8: Add non-activation scenarios with

expect_activation: false

步骤8：添加

expect_activation: false

的非激活场景

Many skills have clear boundaries — situations where the skill should recognize it does not apply and decline gracefully. Test these boundaries using

expect_activation: false

许多Skill有明确的边界——即Skill应识别出自身不适用并优雅拒绝的场景。请使用

expect_activation: false

测试这些边界。

How

expect_activation: false

works

expect_activation: false

的工作方式

When a scenario has

expect_activation: false

All three runs still execute (baseline, skilled-isolated, skilled-plugin) and assertions are evaluated on each. The flag does not change which runs are performed.
Activation verdict is inverted — if the skill is not activated for this prompt, the evaluator reports it as
```
ℹ️ not activated (expected)
```
instead of treating it as a failure.
The scenario is excluded from the noise test — the multi-skill activation test only runs positive (
```
expect_activation: true
```
) scenarios.

当场景设置了

expect_activation: false

时：

仍会执行所有三种运行模式（基准、Skill隔离、Skill Plugin），并对每种模式的结果进行断言评估。该标志不会改变运行的模式。
激活判定结果会反转——如果Skill未针对该提示词激活，评估器会报告为
```
ℹ️ 未激活（符合预期）
```
，而非视为失败。
该场景会被排除在噪声测试之外——多Skill激活测试仅运行
```
expect_activation: true
```
的场景。

When to use non-activation scenarios

何时使用非激活场景

Add

expect_activation: false

scenarios when the skill has explicit "When Not to Use" boundaries. Common patterns:

Pattern	Example
Wrong input format	Skill handles Android tombstones; scenario provides an iOS crash log
Out-of-scope request	Skill collects dumps; scenario asks to analyze a dump
Incompatible project type	Skill converts PackageReference to CPM; scenario has packages.config
Wrong framework version	Skill migrates .NET 8→9; scenario provides a .NET 8 app and asks for .NET 10 migration
Prerequisite not met	Skill requires a specific file format that isn't present

当Skill有明确的“不适用场景”时，添加

expect_activation: false

场景。常见模式：

模式	示例
错误的输入格式	Skill处理Android tombstone；场景提供iOS崩溃日志
超出范围的请求	Skill用于收集转储文件；场景要求分析转储文件
不兼容的项目类型	Skill将PackageReference转换为CPM；场景使用packages.config
错误的框架版本	Skill迁移.NET 8→9；场景提供.NET 8应用并要求迁移到.NET 10
未满足前置条件	Skill需要特定文件格式，但该格式不存在

Example: Wrong input format

示例：错误的输入格式

yaml

- name: "Reject iOS crash log as wrong format"
  prompt: "I have a crash log file at crashlog_ios.txt from a crashed app. Please symbolicate the .NET runtime frames."
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_matches"
      pattern: "(iOS|Apple|not.*(Android|tombstone)|wrong.*(format|type))"
  rubric:
    - "Recognized that this is an iOS crash log, not an Android tombstone"
    - "Did NOT attempt to apply the Android tombstone symbolication workflow"
    - "Explained that iOS crash logs require a different symbolication process"

yaml

- name: "拒绝iOS崩溃日志（格式错误）"
  prompt: "我有一个来自崩溃应用的iOS崩溃日志文件crashlog_ios.txt。请为.NET运行时栈进行符号化处理。"
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_matches"
      pattern: "(iOS|Apple|不.*(Android|tombstone)|错误.*(格式|类型))"
  rubric:
    - "识别出这是iOS崩溃日志，而非Android tombstone"
    - "未尝试应用Android tombstone符号化流程"
    - "解释了iOS崩溃日志需要不同的符号化流程"

Example: Out-of-scope request

示例：超出范围的请求

yaml

- name: "Decline dump analysis request"
  prompt: |
    I already have a .dmp crash dump file from my .NET app. Can you help
    me analyze it to find the root cause of the crash?
  expect_activation: false
  assertions:
    - type: "output_matches"
      pattern: "(out of scope|not cover|does not|cannot|only.*collect)"
  rubric:
    - "Clearly states that dump analysis is out of scope for this skill"
    - "Does not attempt to open or analyze the dump file"
    - "Does not install analysis tools like dotnet-dump analyze, lldb, or windbg"
  timeout: 30

yaml

- name: "拒绝转储分析请求"
  prompt: |
    我已经有了.NET应用的.dmp崩溃转储文件。你能帮我分析它以找出崩溃的根本原因吗？
  expect_activation: false
  assertions:
    - type: "output_matches"
      pattern: "(超出范围|不涵盖|无法|不能|仅.*收集)"
  rubric:
    - "明确说明转储分析不在本Skill的范围内"
    - "未尝试打开或分析转储文件"
    - "未安装分析工具如dotnet-dump analyze、lldb或windbg"
  timeout: 30

Example: Incompatible project type

示例：不兼容的项目类型

yaml

- name: "Decline CPM conversion for packages.config project"
  prompt: "Convert my simple-packages-config/LegacyApp project to Central Package Management."
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_contains"
      value: "packages.config"
    - type: "file_not_exists"
      path: "simple-packages-config/Directory.Packages.props"
  rubric:
    - "Detected the project uses packages.config instead of PackageReference format"
    - "Informed the user that CPM requires PackageReference and cannot be applied to packages.config projects"
    - "Suggested migrating from packages.config to PackageReference first"
    - "Did not attempt to create Directory.Packages.props or modify any project files"

yaml

- name: "拒绝为packages.config项目转换为CPM"
  prompt: "将我的simple-packages-config/LegacyApp项目转换为Central Package Management。"
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_contains"
      value: "packages.config"
    - type: "file_not_exists"
      path: "simple-packages-config/Directory.Packages.props"
  rubric:
    - "检测到项目使用packages.config而非PackageReference格式"
    - "告知用户CPM需要PackageReference，无法应用于packages.config项目"
    - "建议先从packages.config迁移到PackageReference"
    - "未尝试创建Directory.Packages.props或修改任何项目文件"

Rubric guidelines for non-activation scenarios

非激活场景的评分标准准则

Non-activation rubric items typically verify three things:

Recognition — The agent identified why the skill doesn't apply.
Restraint — The agent did NOT attempt the skill's workflow (no file modifications, no tool installs).
Redirection — The agent suggested the correct alternative approach or next step.

非激活场景的评分标准项通常需要验证三点：

识别——Agent识别出Skill不适用的原因。
克制——Agent未尝试执行Skill的流程（无文件修改、无工具安装）。
引导——Agent提出了正确的替代方案或下一步操作。

Step 9: Validate the eval.yaml

步骤9：验证eval.yaml

Run the static validator:

bash

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>

Then run evaluation (at least 3 runs for reliable results):

bash

undefined

运行静态验证器：

bash

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>

然后运行评估（至少3次运行以获得可靠结果）：

bash

undefined

For skills:

针对Skill:

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>

For agents:

针对Agent:

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md

undefined

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md

undefined

eval.yaml Template

eval.yaml模板

yaml

scenarios:
  - name: "<Describe what the agent should accomplish>"
    prompt: "<Natural developer request — do not mention the skill>"
    setup:
      copy_test_files: true
    assertions:
      - type: "output_contains"
        value: "<key term that a correct response must include>"
      - type: "exit_success"
    rubric:
      - "<Outcome: what the agent should have identified or produced>"
      - "<Outcome: what fix or recommendation the agent should have given>"
      - "<Outcome: what incorrect approach the agent should have avoided>"
    timeout: 120

  - name: "<Describe situation where the skill should NOT apply>"
    prompt: "<Request that superficially matches the skill but falls outside its scope>"
    expect_activation: false
    setup:
      copy_test_files: true
    assertions:
      - type: "output_matches"
        pattern: "<pattern matching the agent's explanation of why it cannot help>"
      - type: "file_not_exists"
        path: "<file the skill would create if it incorrectly activated>"
    rubric:
      - "<Recognition: agent identified why the skill does not apply>"
      - "<Restraint: agent did not attempt the skill's workflow>"
      - "<Redirection: agent suggested the correct alternative>"
    timeout: 120

yaml

scenarios:
  - name: "<描述Agent应完成的任务>"
    prompt: "<开发者的自然语言请求——请勿提及Skill>"
    setup:
      copy_test_files: true
    assertions:
      - type: "output_contains"
        value: "<正确响应必须包含的关键词>"
      - type: "exit_success"
    rubric:
      - "<结果：Agent应识别或生成的内容>"
      - "<结果：Agent应给出的修复方案或建议>"
      - "<结果：Agent应避免的错误方法>"
    timeout: 120

  - name: "<描述Skill不应适用的场景>"
    prompt: "<表面上匹配Skill但超出其范围的请求>"
    expect_activation: false
    setup:
      copy_test_files: true
    assertions:
      - type: "output_matches"
        pattern: "<匹配Agent解释无法提供帮助的模式>"
      - type: "file_not_exists"
        path: "<Skill错误激活时会创建的文件>"
    rubric:
      - "<识别：Agent识别出Skill不适用的原因>"
      - "<克制：Agent未尝试执行Skill的流程>"
      - "<引导：Agent提出了正确的替代方案>"
    timeout: 120

Validation Checklist

验证检查清单

After creating a test, verify:

Test directory matches

tests/<plugin>/<skill-name>/

for skills or

tests/<plugin>/agent.<agent-name>/

for agents

Target exists at

plugins/<plugin>/skills/<skill-name>/SKILL.md

(skill) or

plugins/<plugin>/agents/<agent-name>.agent.md

(agent)

Every scenario has
```
name
```
,
```
prompt
```
, at least one assertion, and rubric items
Prompts are written as natural developer requests (no skill/agent name references)
Assertions are broad enough that multiple valid approaches pass
Rubric items test outcomes, not specific techniques or vocabulary
Fixture files are present when
```
copy_test_files: true
```
is used
```
source
```
paths in setup files point to existing fixture files

additional_required_skills

additional_required_agents

names exist in the same plugin

Timeouts are reasonable for the scenario complexity
Non-activation scenarios use
```
expect_activation: false
```
and verify recognition, restraint, and redirection

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check

passes

创建测试后，请验证以下内容：

测试目录符合Skill的

tests/<plugin>/<skill-name>/

或Agent的

tests/<plugin>/agent.<agent-name>/

格式

目标对象存在于Skill的

plugins/<plugin>/skills/<skill-name>/SKILL.md

或Agent的

plugins/<plugin>/agents/<agent-name>.agent.md

路径下

每个场景都包含
```
name
```
、
```
prompt
```
、至少一个断言和评分标准项
提示词以开发者的自然语言撰写（未提及Skill/Agent名称）
断言足够宽泛，多种有效方案均可通过
评分标准项测试结果，而非特定技术或术语
当使用
```
copy_test_files: true
```
时，夹具文件已存在
设置文件中的
```
source
```
路径指向存在的夹具文件

additional_required_skills

additional_required_agents

中声明的名称在同一Plugin中存在

超时时间与场景复杂度匹配
非激活场景使用了
```
expect_activation: false
```
并验证了识别、克制和引导三点

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check

命令执行通过

Common Pitfalls

常见陷阱

Pitfall	Solution
Prompt mentions the skill by name	Rewrite as a natural developer request describing the problem
Prompt mentions the agent by name	Same as above — agent name in prompts biases the baseline
Rubric tests a specific diagnostic command	Rewrite to test the finding or outcome that command produces
Assertion gates on syntax the LLM already knows	Use a broader pattern or test the result instead
All rubric items test the same aspect	Diversify: test identification, fix quality, and error avoidance
Missing fixture files for `copy_test_files`	Add the required project/source files alongside eval.yaml
Timeout too short for builds	Use 300-600s for scenarios that compile or run benchmarks
Single scenario covers the entire skill	Break into focused scenarios testing different aspects
Compound rubric items testing multiple things	Split into separate, independently-evaluable items
No non-activation scenarios for skill with clear boundaries	Add `expect_activation: false` scenarios for each "When Not to Use" case
Agent test missing `additional_required_skills`	If the agent routes to specific skills, declare them so the isolated run loads them

陷阱	解决方案
提示词中提及Skill名称	重写为描述问题的开发者自然语言请求
提示词中提及Agent名称	同上——提示词中的Agent名称会使基准测试产生偏差
评分标准测试特定诊断命令	重写为测试该命令产生的发现或结果
断言依赖LLM已知的语法	使用更宽泛的模式或测试结果而非语法
所有评分标准项测试同一方面	多样化：测试识别能力、修复质量和错误规避
`copy_test_files` 对应的夹具文件缺失	在eval.yaml旁添加所需的项目/源文件
构建场景的超时时间过短	对于需要编译或运行基准测试的场景，设置为300-600秒
单个场景覆盖整个Skill	拆分为多个聚焦不同方面的场景
复合评分标准项包含多个测试点	拆分为独立可评估的项
有明确边界的Skill缺少非激活场景	为每个“不适用场景”添加 `expect_activation: false` 的场景
Agent测试缺少 `additional_required_skills`	如果Agent路由到特定Skill，需声明这些Skill以便隔离运行模式加载它们

create-skill-test

Original

Translation

Create Skill Test

创建Skill测试

When to Use

适用场景

When Not to Use

不适用场景

Inputs

输入项

Workflow

工作流程

Step 1: Locate the target and determine the test directory

步骤1：定位目标并确定测试目录

For skills:

针对Skill:

For agents (agent. prefix convention):

针对Agent（遵循agent.前缀约定）:

Step 2: Create the test directory and eval.yaml

步骤2：创建测试目录和eval.yaml

For skills:

针对Skill:

For agents:

针对Agent:

Step 3: Write scenarios

步骤3：编写测试场景

Scenario guidelines

场景编写准则

Step 4: Configure setup

步骤4：配置初始化设置

Option A: Copy test files (recommended for complex fixtures)

选项A：复制测试文件（适用于复杂夹具）

Option B: Inline files (good for small, self-contained scenarios)

选项B：内嵌文件（适用于小型独立场景）

Option C: Reference fixture files from a subdirectory

选项C：从子目录引用夹具文件

Setup commands (optional)

初始化命令（可选）

Scenario dependencies (optional)

场景依赖（可选）

Step 5: Write assertions

步骤5：编写断言

Assertion guidelines

断言编写准则

Step 6: Write rubric items

步骤6：编写评分标准项

The three rubric classifications (and how to stay in "outcome")

三种评分标准分类（以及如何聚焦“结果”类）

Rubric writing rules

评分标准编写规则

Examples

示例

Step 7: Add optional constraints

步骤7：添加可选约束

Step 8: Add non-activation scenarios with expect_activation: false

步骤8：添加expect_activation: false的非激活场景

How expect_activation: false works

expect_activation: false的工作方式

When to use non-activation scenarios

何时使用非激活场景

Example: Wrong input format

示例：错误的输入格式

Example: Out-of-scope request

示例：超出范围的请求

Example: Incompatible project type

示例：不兼容的项目类型

Rubric guidelines for non-activation scenarios

非激活场景的评分标准准则

Step 9: Validate the eval.yaml

步骤9：验证eval.yaml

For skills:

针对Skill:

For agents:

针对Agent:

eval.yaml Template

eval.yaml模板

Validation Checklist

验证检查清单

Common Pitfalls

Step 8: Add non-activation scenarios with
`expect_activation: false`

步骤8：添加
`expect_activation: false`
的非激活场景

How
`expect_activation: false`
works

`expect_activation: false`
的工作方式