create-skill-test

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Create Skill Test

创建Skill测试

This skill helps you scaffold evaluation tests (
eval.yaml
) for agent skills, ensuring they conform to the dotnet/skills repository conventions, pass the skill-validator checks, and avoid common overfitting pitfalls.
本Skill可帮助你为Agent Skill搭建评估测试文件(
eval.yaml
),确保其符合dotnet/skills仓库的规范,通过skill-validator检查,并避免常见的过拟合问题。

When to Use

适用场景

  • Creating a new
    eval.yaml
    test file for a skill
  • Adding scenarios to an existing eval file
  • Setting up test fixture files alongside eval definitions
  • Reviewing whether rubric items and assertions risk overfitting
  • 为Skill创建新的
    eval.yaml
    测试文件
  • 为现有评估文件添加测试场景
  • 为评估定义配置测试夹具文件
  • 检查评分标准项和断言是否存在过拟合风险

When Not to Use

不适用场景

  • Running or debugging existing tests (use the skill-validator directly)
  • Modifying the skill-validator tool itself
  • Creating or editing SKILL.md files (use the
    create-skill
    skill)
  • 运行或调试现有测试(请直接使用skill-validator)
  • 修改skill-validator工具本身
  • 创建或编辑SKILL.md文件(请使用
    create-skill
    Skill)

Inputs

输入项

InputRequiredDescription
Skill nameYesThe skill being tested (must match a skill under
plugins/<plugin>/skills/
)
Plugin nameYesThe plugin the skill belongs to (e.g.,
dotnet-msbuild
)
Skill contentRecommendedThe SKILL.md content to understand what the skill teaches
Scenario descriptionsRecommendedWhat situations the agent should be tested on
输入项是否必填描述
Skill名称要测试的Skill(必须与
plugins/<plugin>/skills/
下的Skill名称匹配)
Plugin名称Skill所属的Plugin(例如:
dotnet-msbuild
Skill内容推荐SKILL.md的内容,用于理解该Skill的功能
场景描述推荐要测试Agent的场景情况

Workflow

工作流程

Step 1: Locate the target and determine the test directory

步骤1:定位目标并确定测试目录

Tests live at:
undefined
测试文件的存储路径:
undefined

For skills:

针对Skill:

tests/<plugin>/<skill-name>/eval.yaml
tests/<plugin>/<skill-name>/eval.yaml

For agents (agent. prefix convention):

针对Agent(遵循agent.前缀约定):

tests/<plugin>/agent.<agent-name>/eval.yaml

For skills, verify the skill exists at `plugins/<plugin>/skills/<skill-name>/SKILL.md`. For agents, verify the agent exists at `plugins/<plugin>/agents/<agent-name>.agent.md`. Read the target content to understand what it does — this is critical for writing non-overfitted rubric items.
tests/<plugin>/agent.<agent-name>/eval.yaml

对于Skill,请确认该Skill存在于`plugins/<plugin>/skills/<skill-name>/SKILL.md`路径下。对于Agent,请确认该Agent存在于`plugins/<plugin>/agents/<agent-name>.agent.md`路径下。阅读目标内容以了解其功能——这对于编写非过拟合的评分标准项至关重要。

Step 2: Create the test directory and eval.yaml

步骤2:创建测试目录和eval.yaml

Create the directory and file:
undefined
创建对应的目录和文件:
undefined

For skills:

针对Skill:

tests/<plugin>/<skill-name>/ └── eval.yaml
tests/<plugin>/<skill-name>/ └── eval.yaml

For agents:

针对Agent:

tests/<plugin>/agent.<agent-name>/ └── eval.yaml

The `agent.` prefix disambiguates agent test directories from skill test directories that might share the same name.
tests/<plugin>/agent.<agent-name>/ └── eval.yaml

`agent.`前缀用于区分Agent测试目录与可能重名的Skill测试目录。

Step 3: Write scenarios

步骤3:编写测试场景

Each scenario needs a
name
,
prompt
, at least one
assertion
, and a
rubric
. Use this structure:
yaml
scenarios:
  - name: "Descriptive scenario name"
    prompt: "Natural language task description as a developer would phrase it"
    setup:
      copy_test_files: true          # OR use inline files
    assertions:
      - type: "output_contains"
        value: "expected text"
    rubric:
      - "The agent correctly identified the root cause"
      - "The agent suggested a concrete, actionable fix"
    timeout: 120
每个场景需要包含
name
prompt
、至少一个
assertion
rubric
。请使用以下结构:
yaml
scenarios:
  - name: "描述性的场景名称"
    prompt: "以开发者的自然语言描述任务"
    setup:
      copy_test_files: true          # 或使用内嵌文件
    assertions:
      - type: "output_contains"
        value: "预期文本"
    rubric:
      - "Agent正确识别了根本原因"
      - "Agent提出了具体且可执行的修复方案"
    timeout: 120

Scenario guidelines

场景编写准则

  • Name: Describe what is being tested, not how (e.g., "Diagnose missing package reference" not "Test binlog replay and error extraction").
  • Prompt: Write as a natural developer request. Never mention the skill name or instruct the agent to "use a skill." Neutral prompts prevent prompt overfitting.
  • Timeout: Default is 120 seconds. Use 300-600 for scenarios requiring builds, benchmarks, or multi-step operations.
  • 名称:描述要测试的内容,而非测试方式(例如:“诊断缺失的包引用”而非“测试binlog重放和错误提取”)。
  • 提示词:以开发者的自然语言请求撰写。切勿提及Skill名称或指示Agent“使用某个Skill”。中立的提示词可避免提示过拟合。
  • 超时时间:默认值为120秒。对于需要构建、基准测试或多步骤操作的场景,请设置为300-600秒。

Step 4: Configure setup

步骤4:配置初始化设置

Choose one of three setup strategies:
请选择以下三种初始化策略之一:

Option A: Copy test files (recommended for complex fixtures)

选项A:复制测试文件(适用于复杂夹具)

Place fixture files alongside
eval.yaml
and enable auto-copy:
yaml
setup:
  copy_test_files: true
All files in the directory (except
eval.yaml
) are copied into the agent's working directory.
将夹具文件放置在eval.yaml旁,并启用自动复制:
yaml
setup:
  copy_test_files: true
目录中的所有文件(除eval.yaml外)都会被复制到Agent的工作目录中。

Option B: Inline files (good for small, self-contained scenarios)

选项B:内嵌文件(适用于小型独立场景)

yaml
setup:
  files:
    - path: "MyProject/MyProject.csproj"
      content: |
        <Project Sdk="Microsoft.NET.Sdk">
          <PropertyGroup>
            <TargetFramework>net10.0</TargetFramework>
          </PropertyGroup>
        </Project>
    - path: "MyProject/Program.cs"
      content: |
        Console.WriteLine("Hello");
yaml
setup:
  files:
    - path: "MyProject/MyProject.csproj"
      content: |
        <Project Sdk="Microsoft.NET.Sdk">
          <PropertyGroup>
            <TargetFramework>net10.0</TargetFramework>
          </PropertyGroup>
        </Project>
    - path: "MyProject/Program.cs"
      content: |
        Console.WriteLine("Hello");

Option C: Reference fixture files from a subdirectory

选项C:从子目录引用夹具文件

yaml
setup:
  files:
    - path: "TestProject.csproj"
      source: "fixtures/scenario-a/TestProject.csproj"
Use this when multiple scenarios share a
fixtures/
directory with separate subdirectories.
yaml
setup:
  files:
    - path: "TestProject.csproj"
      source: "fixtures/scenario-a/TestProject.csproj"
当多个场景共享一个包含不同子目录的
fixtures/
目录时,可使用此方式。

Setup commands (optional)

初始化命令(可选)

Run shell commands before the agent starts (e.g., to build a project and generate artifacts):
yaml
setup:
  copy_test_files: true
  commands:
    - "dotnet build -bl:build.binlog"
在Agent启动前执行Shell命令(例如:构建项目并生成产物):
yaml
setup:
  copy_test_files: true
  commands:
    - "dotnet build -bl:build.binlog"

Scenario dependencies (optional)

场景依赖(可选)

Some agents route to specific skills, or some skills depend on sibling agents. In the isolated run, only the target is loaded — so the scenario must declare its dependencies using
additional_required_skills
and/or
additional_required_agents
:
yaml
setup:
  copy_test_files: true
  additional_required_skills:
    - binlog-failure-analysis    # loaded in isolated run alongside the target
  additional_required_agents:
    - build-perf                 # registered in isolated run alongside the target
  • Names are resolved from the same plugin's
    skills/
    or
    agents/
    directory.
  • These only affect the isolated run. The plugin run already loads everything; the baseline loads nothing.
  • Different scenarios of the same target can declare different dependencies (per-scenario granularity).
  • If a declared name cannot be resolved, the validator fails with an error.
部分Agent会路由到特定Skill,或部分Skill依赖同级Agent。在隔离运行模式下,仅会加载目标对象——因此场景必须通过
additional_required_skills
和/或
additional_required_agents
声明其依赖项:
yaml
setup:
  copy_test_files: true
  additional_required_skills:
    - binlog-failure-analysis    # 在隔离运行模式下与目标对象一同加载
  additional_required_agents:
    - build-perf                 # 在隔离运行模式下与目标对象一同注册
  • 名称将从同一Plugin的
    skills/
    agents/
    目录中解析。
  • 这些设置仅影响隔离运行模式。Plugin运行模式会加载所有内容;基准运行模式则不加载任何内容。
  • 同一目标对象的不同场景可以声明不同的依赖项(支持按场景粒度配置)。
  • 如果声明的名称无法解析,验证器会抛出错误。

Step 5: Write assertions

步骤5:编写断言

Assertions are hard pass/fail checks. Use them for objective, binary-verifiable criteria.
TypeRequired fieldsDescription
output_contains
value
Agent output contains text (case-insensitive)
output_not_contains
value
Agent output must NOT contain text
output_matches
pattern
Agent output matches regex
output_not_matches
pattern
Agent output does NOT match regex
file_exists
path
File matching glob exists in work dir
file_not_exists
path
No file matching glob exists
file_contains
path
,
value
File at glob path contains text
file_not_contains
path
,
value
File at glob path does NOT contain text
exit_success
Agent produced non-empty output
断言是硬性的通过/不通过检查,用于客观、可二元验证的标准。
类型必填字段描述
output_contains
value
Agent输出包含指定文本(不区分大小写)
output_not_contains
value
Agent输出不得包含指定文本
output_matches
pattern
Agent输出匹配指定正则表达式
output_not_matches
pattern
Agent输出不得匹配指定正则表达式
file_exists
path
工作目录中存在匹配通配符的文件
file_not_exists
path
工作目录中不存在匹配通配符的文件
file_contains
path
,
value
通配符路径下的文件包含指定文本
file_not_contains
path
,
value
通配符路径下的文件不得包含指定文本
exit_success
Agent生成了非空输出

Assertion guidelines

断言编写准则

  • Prefer broad assertions that multiple valid approaches would satisfy.
  • Avoid narrow assertions that gate on a specific syntax or flag the LLM already knows.
  • Use
    output_matches
    with regex alternation for flexible matching:
    "(root cause|primary error|underlying issue)"
    .
  • Use
    file_contains
    /
    file_not_contains
    to verify the agent modified files correctly.
  • Use
    output_not_contains
    and
    file_not_exists
    to verify the agent avoided incorrect actions.
  • 优先选择宽泛的断言,确保多种有效方案都能通过。
  • 避免狭窄的断言,例如依赖LLM已知的特定语法或标志。
  • 使用
    output_matches
    配合正则表达式的多选功能实现灵活匹配:
    "(root cause|primary error|underlying issue)"
  • 使用
    file_contains
    /
    file_not_contains
    验证Agent是否正确修改了文件。
  • 使用
    output_not_contains
    file_not_exists
    验证Agent是否避免了错误操作。

Step 6: Write rubric items

步骤6:编写评分标准项

Rubric items are evaluated by an LLM judge using pairwise comparison (baseline vs. skill-enhanced). Quality metrics (rubric-based at 40% weight plus overall judgment at 30%) together dominate the composite improvement score.
评分标准项由LLM评判通过两两比较(基准 vs 增强Skill)进行评估。质量指标(基于评分标准的40%权重加上整体判断的30%)在综合改进评分中占主导地位。

The three rubric classifications (and how to stay in "outcome")

三种评分标准分类(以及如何聚焦“结果”类)

The overfitting judge classifies each rubric item:
ClassificationDescriptionGoal
outcomeTests whether the agent reached a correct result. Describes WHAT, not HOW.Target this
techniqueTests whether the agent used a skill-specific procedure.Minimize
vocabularyTests whether the agent used specific terminology from the skill.Avoid
过拟合评判器会将每个评分标准项分类:
分类描述目标
结果(outcome)测试Agent是否达成正确结果。描述内容是“是什么”,而非“怎么做”。优先选择此类
技术(technique)测试Agent是否使用了Skill特定的流程。尽量减少
术语(vocabulary)测试Agent是否使用了Skill中的特定术语。完全避免

Rubric writing rules

评分标准编写规则

  1. Test outcomes, not methods. Write "Identified the root cause of the build failure" — not "Replayed the binlog using
    dotnet build /flp
    ."
  2. Allow alternative approaches. If multiple valid solutions exist, the rubric item should accept any of them.
  3. Never reference the skill by name or use phrasing copied directly from the SKILL.md.
  4. Don't test pre-existing LLM knowledge. If the LLM already knows something (common APIs, standard syntax, basic escaping), testing for it adds no signal.
  5. Test findings, not diagnostic steps. Write "Correctly determined that the root cause is a missing PackageReference" — not "Used
    dotnet restore
    to check package resolution."
  6. Each item should be independently evaluable. Avoid compound items that test multiple things.
  1. 测试结果,而非方法。例如编写“识别出构建失败的根本原因”——而非“使用
    dotnet build /flp
    重放binlog”。
  2. 允许替代方案。如果存在多种有效解决方案,评分标准项应接受其中任何一种。
  3. 切勿提及Skill名称或直接使用SKILL.md中的表述。
  4. 勿测试LLM已有的知识。如果LLM已经知晓某些内容(常见API、标准语法、基础转义),测试这些内容不会增加有效信号。
  5. 测试发现,而非诊断步骤。例如编写“正确判断根本原因是缺失PackageReference”——而非“使用
    dotnet restore
    检查包解析情况”。
  6. 每个项应可独立评估。避免包含多个测试点的复合项。

Examples

示例

Well-designed (outcome-focused):
yaml
rubric:
  - "Correctly identified the missing NuGet package as the root cause of the build failure"
  - "Recognized that downstream project failures were cascading from the root cause, not independent errors"
  - "Suggested a concrete fix that would resolve the root cause"
Overfitted (vocabulary/technique):
yaml
rubric:
  - "Replayed the binary log using 'dotnet build /flp:v=diag'"      # technique: gates on specific command
  - "Measured cold, warm, and no-op build scenarios"                  # vocabulary: uses skill's labels
  - "Used the --clreventlevel flag with dotnet trace collect"         # vocabulary: gates on specific flag
设计良好(聚焦结果):
yaml
rubric:
  - "正确识别出缺失的NuGet包是构建失败的根本原因"
  - "认识到下游项目失败是由根本原因引发的连锁反应,而非独立错误"
  - "提出了可解决根本原因的具体修复方案"
过拟合(术语/技术类):
yaml
rubric:
  - "使用'dotnet build /flp:v=diag'重放二进制日志"      # 技术类:依赖特定命令
  - "测量冷启动、热启动和无操作构建场景"                  # 术语类:使用Skill中的标签
  - "在dotnet trace collect中使用--clreventlevel标志"         # 术语类:依赖特定标志

Step 7: Add optional constraints

步骤7:添加可选约束

yaml
expect_tools: ["bash"]           # Agent must use these tools
reject_tools: ["create_file"]    # Agent must NOT use these tools
max_turns: 10                    # Maximum agent iterations
max_tokens: 5000                 # Maximum token budget
Use constraints sparingly — only when the scenario specifically requires or forbids certain agent behaviors.
yaml
expect_tools: ["bash"]           # Agent必须使用这些工具
reject_tools: ["create_file"]    # Agent不得使用这些工具
max_turns: 10                    # Agent的最大迭代次数
max_tokens: 5000                 # 最大token预算
请谨慎使用约束——仅当场景明确要求或禁止某些Agent行为时才添加。

Step 8: Add non-activation scenarios with
expect_activation: false

步骤8:添加
expect_activation: false
的非激活场景

Many skills have clear boundaries — situations where the skill should recognize it does not apply and decline gracefully. Test these boundaries using
expect_activation: false
.
许多Skill有明确的边界——即Skill应识别出自身不适用并优雅拒绝的场景。请使用
expect_activation: false
测试这些边界。

How
expect_activation: false
works

expect_activation: false
的工作方式

When a scenario has
expect_activation: false
:
  1. All three runs still execute (baseline, skilled-isolated, skilled-plugin) and assertions are evaluated on each. The flag does not change which runs are performed.
  2. Activation verdict is inverted — if the skill is not activated for this prompt, the evaluator reports it as
    ℹ️ not activated (expected)
    instead of treating it as a failure.
  3. The scenario is excluded from the noise test — the multi-skill activation test only runs positive (
    expect_activation: true
    ) scenarios.
当场景设置了
expect_activation: false
时:
  1. 仍会执行所有三种运行模式(基准、Skill隔离、Skill Plugin),并对每种模式的结果进行断言评估。该标志不会改变运行的模式。
  2. 激活判定结果会反转——如果Skill未针对该提示词激活,评估器会报告为
    ℹ️ 未激活(符合预期)
    ,而非视为失败。
  3. 该场景会被排除在噪声测试之外——多Skill激活测试仅运行
    expect_activation: true
    的场景。

When to use non-activation scenarios

何时使用非激活场景

Add
expect_activation: false
scenarios when the skill has explicit "When Not to Use" boundaries. Common patterns:
PatternExample
Wrong input formatSkill handles Android tombstones; scenario provides an iOS crash log
Out-of-scope requestSkill collects dumps; scenario asks to analyze a dump
Incompatible project typeSkill converts PackageReference to CPM; scenario has packages.config
Wrong framework versionSkill migrates .NET 8→9; scenario provides a .NET 8 app and asks for .NET 10 migration
Prerequisite not metSkill requires a specific file format that isn't present
当Skill有明确的“不适用场景”时,添加
expect_activation: false
场景。常见模式:
模式示例
错误的输入格式Skill处理Android tombstone;场景提供iOS崩溃日志
超出范围的请求Skill用于收集转储文件;场景要求分析转储文件
不兼容的项目类型Skill将PackageReference转换为CPM;场景使用packages.config
错误的框架版本Skill迁移.NET 8→9;场景提供.NET 8应用并要求迁移到.NET 10
未满足前置条件Skill需要特定文件格式,但该格式不存在

Example: Wrong input format

示例:错误的输入格式

yaml
- name: "Reject iOS crash log as wrong format"
  prompt: "I have a crash log file at crashlog_ios.txt from a crashed app. Please symbolicate the .NET runtime frames."
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_matches"
      pattern: "(iOS|Apple|not.*(Android|tombstone)|wrong.*(format|type))"
  rubric:
    - "Recognized that this is an iOS crash log, not an Android tombstone"
    - "Did NOT attempt to apply the Android tombstone symbolication workflow"
    - "Explained that iOS crash logs require a different symbolication process"
yaml
- name: "拒绝iOS崩溃日志(格式错误)"
  prompt: "我有一个来自崩溃应用的iOS崩溃日志文件crashlog_ios.txt。请为.NET运行时栈进行符号化处理。"
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_matches"
      pattern: "(iOS|Apple|不.*(Android|tombstone)|错误.*(格式|类型))"
  rubric:
    - "识别出这是iOS崩溃日志,而非Android tombstone"
    - "未尝试应用Android tombstone符号化流程"
    - "解释了iOS崩溃日志需要不同的符号化流程"

Example: Out-of-scope request

示例:超出范围的请求

yaml
- name: "Decline dump analysis request"
  prompt: |
    I already have a .dmp crash dump file from my .NET app. Can you help
    me analyze it to find the root cause of the crash?
  expect_activation: false
  assertions:
    - type: "output_matches"
      pattern: "(out of scope|not cover|does not|cannot|only.*collect)"
  rubric:
    - "Clearly states that dump analysis is out of scope for this skill"
    - "Does not attempt to open or analyze the dump file"
    - "Does not install analysis tools like dotnet-dump analyze, lldb, or windbg"
  timeout: 30
yaml
- name: "拒绝转储分析请求"
  prompt: |
    我已经有了.NET应用的.dmp崩溃转储文件。你能帮我分析它以找出崩溃的根本原因吗?
  expect_activation: false
  assertions:
    - type: "output_matches"
      pattern: "(超出范围|不涵盖|无法|不能|仅.*收集)"
  rubric:
    - "明确说明转储分析不在本Skill的范围内"
    - "未尝试打开或分析转储文件"
    - "未安装分析工具如dotnet-dump analyze、lldb或windbg"
  timeout: 30

Example: Incompatible project type

示例:不兼容的项目类型

yaml
- name: "Decline CPM conversion for packages.config project"
  prompt: "Convert my simple-packages-config/LegacyApp project to Central Package Management."
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_contains"
      value: "packages.config"
    - type: "file_not_exists"
      path: "simple-packages-config/Directory.Packages.props"
  rubric:
    - "Detected the project uses packages.config instead of PackageReference format"
    - "Informed the user that CPM requires PackageReference and cannot be applied to packages.config projects"
    - "Suggested migrating from packages.config to PackageReference first"
    - "Did not attempt to create Directory.Packages.props or modify any project files"
yaml
- name: "拒绝为packages.config项目转换为CPM"
  prompt: "将我的simple-packages-config/LegacyApp项目转换为Central Package Management。"
  expect_activation: false
  setup:
    copy_test_files: true
  assertions:
    - type: "output_contains"
      value: "packages.config"
    - type: "file_not_exists"
      path: "simple-packages-config/Directory.Packages.props"
  rubric:
    - "检测到项目使用packages.config而非PackageReference格式"
    - "告知用户CPM需要PackageReference,无法应用于packages.config项目"
    - "建议先从packages.config迁移到PackageReference"
    - "未尝试创建Directory.Packages.props或修改任何项目文件"

Rubric guidelines for non-activation scenarios

非激活场景的评分标准准则

Non-activation rubric items typically verify three things:
  1. Recognition — The agent identified why the skill doesn't apply.
  2. Restraint — The agent did NOT attempt the skill's workflow (no file modifications, no tool installs).
  3. Redirection — The agent suggested the correct alternative approach or next step.
非激活场景的评分标准项通常需要验证三点:
  1. 识别——Agent识别出Skill不适用的原因。
  2. 克制——Agent未尝试执行Skill的流程(无文件修改、无工具安装)。
  3. 引导——Agent提出了正确的替代方案或下一步操作。

Step 9: Validate the eval.yaml

步骤9:验证eval.yaml

Run the static validator:
bash
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>
Then run evaluation (at least 3 runs for reliable results):
bash
undefined
运行静态验证器:
bash
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>
然后运行评估(至少3次运行以获得可靠结果):
bash
undefined

For skills:

针对Skill:

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>

For agents:

针对Agent:

dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
undefined
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
undefined

eval.yaml Template

eval.yaml模板

yaml
scenarios:
  - name: "<Describe what the agent should accomplish>"
    prompt: "<Natural developer request — do not mention the skill>"
    setup:
      copy_test_files: true
    assertions:
      - type: "output_contains"
        value: "<key term that a correct response must include>"
      - type: "exit_success"
    rubric:
      - "<Outcome: what the agent should have identified or produced>"
      - "<Outcome: what fix or recommendation the agent should have given>"
      - "<Outcome: what incorrect approach the agent should have avoided>"
    timeout: 120

  - name: "<Describe situation where the skill should NOT apply>"
    prompt: "<Request that superficially matches the skill but falls outside its scope>"
    expect_activation: false
    setup:
      copy_test_files: true
    assertions:
      - type: "output_matches"
        pattern: "<pattern matching the agent's explanation of why it cannot help>"
      - type: "file_not_exists"
        path: "<file the skill would create if it incorrectly activated>"
    rubric:
      - "<Recognition: agent identified why the skill does not apply>"
      - "<Restraint: agent did not attempt the skill's workflow>"
      - "<Redirection: agent suggested the correct alternative>"
    timeout: 120
yaml
scenarios:
  - name: "<描述Agent应完成的任务>"
    prompt: "<开发者的自然语言请求——请勿提及Skill>"
    setup:
      copy_test_files: true
    assertions:
      - type: "output_contains"
        value: "<正确响应必须包含的关键词>"
      - type: "exit_success"
    rubric:
      - "<结果:Agent应识别或生成的内容>"
      - "<结果:Agent应给出的修复方案或建议>"
      - "<结果:Agent应避免的错误方法>"
    timeout: 120

  - name: "<描述Skill不应适用的场景>"
    prompt: "<表面上匹配Skill但超出其范围的请求>"
    expect_activation: false
    setup:
      copy_test_files: true
    assertions:
      - type: "output_matches"
        pattern: "<匹配Agent解释无法提供帮助的模式>"
      - type: "file_not_exists"
        path: "<Skill错误激活时会创建的文件>"
    rubric:
      - "<识别:Agent识别出Skill不适用的原因>"
      - "<克制:Agent未尝试执行Skill的流程>"
      - "<引导:Agent提出了正确的替代方案>"
    timeout: 120

Validation Checklist

验证检查清单

After creating a test, verify:
  • Test directory matches
    tests/<plugin>/<skill-name>/
    for skills or
    tests/<plugin>/agent.<agent-name>/
    for agents
  • Target exists at
    plugins/<plugin>/skills/<skill-name>/SKILL.md
    (skill) or
    plugins/<plugin>/agents/<agent-name>.agent.md
    (agent)
  • Every scenario has
    name
    ,
    prompt
    , at least one assertion, and rubric items
  • Prompts are written as natural developer requests (no skill/agent name references)
  • Assertions are broad enough that multiple valid approaches pass
  • Rubric items test outcomes, not specific techniques or vocabulary
  • Fixture files are present when
    copy_test_files: true
    is used
  • source
    paths in setup files point to existing fixture files
  • additional_required_skills
    /
    additional_required_agents
    names exist in the same plugin
  • Timeouts are reasonable for the scenario complexity
  • Non-activation scenarios use
    expect_activation: false
    and verify recognition, restraint, and redirection
  • dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check
    passes
创建测试后,请验证以下内容:
  • 测试目录符合Skill的
    tests/<plugin>/<skill-name>/
    或Agent的
    tests/<plugin>/agent.<agent-name>/
    格式
  • 目标对象存在于Skill的
    plugins/<plugin>/skills/<skill-name>/SKILL.md
    或Agent的
    plugins/<plugin>/agents/<agent-name>.agent.md
    路径下
  • 每个场景都包含
    name
    prompt
    、至少一个断言和评分标准项
  • 提示词以开发者的自然语言撰写(未提及Skill/Agent名称)
  • 断言足够宽泛,多种有效方案均可通过
  • 评分标准项测试结果,而非特定技术或术语
  • 当使用
    copy_test_files: true
    时,夹具文件已存在
  • 设置文件中的
    source
    路径指向存在的夹具文件
  • additional_required_skills
    /
    additional_required_agents
    中声明的名称在同一Plugin中存在
  • 超时时间与场景复杂度匹配
  • 非激活场景使用了
    expect_activation: false
    并验证了识别、克制和引导三点
  • dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check
    命令执行通过

Common Pitfalls

常见陷阱

PitfallSolution
Prompt mentions the skill by nameRewrite as a natural developer request describing the problem
Prompt mentions the agent by nameSame as above — agent name in prompts biases the baseline
Rubric tests a specific diagnostic commandRewrite to test the finding or outcome that command produces
Assertion gates on syntax the LLM already knowsUse a broader pattern or test the result instead
All rubric items test the same aspectDiversify: test identification, fix quality, and error avoidance
Missing fixture files for
copy_test_files
Add the required project/source files alongside eval.yaml
Timeout too short for buildsUse 300-600s for scenarios that compile or run benchmarks
Single scenario covers the entire skillBreak into focused scenarios testing different aspects
Compound rubric items testing multiple thingsSplit into separate, independently-evaluable items
No non-activation scenarios for skill with clear boundariesAdd
expect_activation: false
scenarios for each "When Not to Use" case
Agent test missing
additional_required_skills
If the agent routes to specific skills, declare them so the isolated run loads them
陷阱解决方案
提示词中提及Skill名称重写为描述问题的开发者自然语言请求
提示词中提及Agent名称同上——提示词中的Agent名称会使基准测试产生偏差
评分标准测试特定诊断命令重写为测试该命令产生的发现或结果
断言依赖LLM已知的语法使用更宽泛的模式或测试结果而非语法
所有评分标准项测试同一方面多样化:测试识别能力、修复质量和错误规避
copy_test_files
对应的夹具文件缺失
在eval.yaml旁添加所需的项目/源文件
构建场景的超时时间过短对于需要编译或运行基准测试的场景,设置为300-600秒
单个场景覆盖整个Skill拆分为多个聚焦不同方面的场景
复合评分标准项包含多个测试点拆分为独立可评估的项
有明确边界的Skill缺少非激活场景为每个“不适用场景”添加
expect_activation: false
的场景
Agent测试缺少
additional_required_skills
如果Agent路由到特定Skill,需声明这些Skill以便隔离运行模式加载它们