create-skill-test
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCreate Skill Test
创建Skill测试
This skill helps you scaffold evaluation tests () for agent skills, ensuring they conform to the dotnet/skills repository conventions, pass the skill-validator checks, and avoid common overfitting pitfalls.
eval.yaml本Skill可帮助你为Agent Skill搭建评估测试文件(),确保其符合dotnet/skills仓库的规范,通过skill-validator检查,并避免常见的过拟合问题。
eval.yamlWhen to Use
适用场景
- Creating a new test file for a skill
eval.yaml - Adding scenarios to an existing eval file
- Setting up test fixture files alongside eval definitions
- Reviewing whether rubric items and assertions risk overfitting
- 为Skill创建新的测试文件
eval.yaml - 为现有评估文件添加测试场景
- 为评估定义配置测试夹具文件
- 检查评分标准项和断言是否存在过拟合风险
When Not to Use
不适用场景
- Running or debugging existing tests (use the skill-validator directly)
- Modifying the skill-validator tool itself
- Creating or editing SKILL.md files (use the skill)
create-skill
- 运行或调试现有测试(请直接使用skill-validator)
- 修改skill-validator工具本身
- 创建或编辑SKILL.md文件(请使用Skill)
create-skill
Inputs
输入项
| Input | Required | Description |
|---|---|---|
| Skill name | Yes | The skill being tested (must match a skill under |
| Plugin name | Yes | The plugin the skill belongs to (e.g., |
| Skill content | Recommended | The SKILL.md content to understand what the skill teaches |
| Scenario descriptions | Recommended | What situations the agent should be tested on |
| 输入项 | 是否必填 | 描述 |
|---|---|---|
| Skill名称 | 是 | 要测试的Skill(必须与 |
| Plugin名称 | 是 | Skill所属的Plugin(例如: |
| Skill内容 | 推荐 | SKILL.md的内容,用于理解该Skill的功能 |
| 场景描述 | 推荐 | 要测试Agent的场景情况 |
Workflow
工作流程
Step 1: Locate the target and determine the test directory
步骤1:定位目标并确定测试目录
Tests live at:
undefined测试文件的存储路径:
undefinedFor skills:
针对Skill:
tests/<plugin>/<skill-name>/eval.yaml
tests/<plugin>/<skill-name>/eval.yaml
For agents (agent. prefix convention):
针对Agent(遵循agent.前缀约定):
tests/<plugin>/agent.<agent-name>/eval.yaml
For skills, verify the skill exists at `plugins/<plugin>/skills/<skill-name>/SKILL.md`. For agents, verify the agent exists at `plugins/<plugin>/agents/<agent-name>.agent.md`. Read the target content to understand what it does — this is critical for writing non-overfitted rubric items.tests/<plugin>/agent.<agent-name>/eval.yaml
对于Skill,请确认该Skill存在于`plugins/<plugin>/skills/<skill-name>/SKILL.md`路径下。对于Agent,请确认该Agent存在于`plugins/<plugin>/agents/<agent-name>.agent.md`路径下。阅读目标内容以了解其功能——这对于编写非过拟合的评分标准项至关重要。Step 2: Create the test directory and eval.yaml
步骤2:创建测试目录和eval.yaml
Create the directory and file:
undefined创建对应的目录和文件:
undefinedFor skills:
针对Skill:
tests/<plugin>/<skill-name>/
└── eval.yaml
tests/<plugin>/<skill-name>/
└── eval.yaml
For agents:
针对Agent:
tests/<plugin>/agent.<agent-name>/
└── eval.yaml
The `agent.` prefix disambiguates agent test directories from skill test directories that might share the same name.tests/<plugin>/agent.<agent-name>/
└── eval.yaml
`agent.`前缀用于区分Agent测试目录与可能重名的Skill测试目录。Step 3: Write scenarios
步骤3:编写测试场景
Each scenario needs a , , at least one , and a . Use this structure:
namepromptassertionrubricyaml
scenarios:
- name: "Descriptive scenario name"
prompt: "Natural language task description as a developer would phrase it"
setup:
copy_test_files: true # OR use inline files
assertions:
- type: "output_contains"
value: "expected text"
rubric:
- "The agent correctly identified the root cause"
- "The agent suggested a concrete, actionable fix"
timeout: 120每个场景需要包含、、至少一个和。请使用以下结构:
namepromptassertionrubricyaml
scenarios:
- name: "描述性的场景名称"
prompt: "以开发者的自然语言描述任务"
setup:
copy_test_files: true # 或使用内嵌文件
assertions:
- type: "output_contains"
value: "预期文本"
rubric:
- "Agent正确识别了根本原因"
- "Agent提出了具体且可执行的修复方案"
timeout: 120Scenario guidelines
场景编写准则
- Name: Describe what is being tested, not how (e.g., "Diagnose missing package reference" not "Test binlog replay and error extraction").
- Prompt: Write as a natural developer request. Never mention the skill name or instruct the agent to "use a skill." Neutral prompts prevent prompt overfitting.
- Timeout: Default is 120 seconds. Use 300-600 for scenarios requiring builds, benchmarks, or multi-step operations.
- 名称:描述要测试的内容,而非测试方式(例如:“诊断缺失的包引用”而非“测试binlog重放和错误提取”)。
- 提示词:以开发者的自然语言请求撰写。切勿提及Skill名称或指示Agent“使用某个Skill”。中立的提示词可避免提示过拟合。
- 超时时间:默认值为120秒。对于需要构建、基准测试或多步骤操作的场景,请设置为300-600秒。
Step 4: Configure setup
步骤4:配置初始化设置
Choose one of three setup strategies:
请选择以下三种初始化策略之一:
Option A: Copy test files (recommended for complex fixtures)
选项A:复制测试文件(适用于复杂夹具)
Place fixture files alongside and enable auto-copy:
eval.yamlyaml
setup:
copy_test_files: trueAll files in the directory (except ) are copied into the agent's working directory.
eval.yaml将夹具文件放置在eval.yaml旁,并启用自动复制:
yaml
setup:
copy_test_files: true目录中的所有文件(除eval.yaml外)都会被复制到Agent的工作目录中。
Option B: Inline files (good for small, self-contained scenarios)
选项B:内嵌文件(适用于小型独立场景)
yaml
setup:
files:
- path: "MyProject/MyProject.csproj"
content: |
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
</PropertyGroup>
</Project>
- path: "MyProject/Program.cs"
content: |
Console.WriteLine("Hello");yaml
setup:
files:
- path: "MyProject/MyProject.csproj"
content: |
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
</PropertyGroup>
</Project>
- path: "MyProject/Program.cs"
content: |
Console.WriteLine("Hello");Option C: Reference fixture files from a subdirectory
选项C:从子目录引用夹具文件
yaml
setup:
files:
- path: "TestProject.csproj"
source: "fixtures/scenario-a/TestProject.csproj"Use this when multiple scenarios share a directory with separate subdirectories.
fixtures/yaml
setup:
files:
- path: "TestProject.csproj"
source: "fixtures/scenario-a/TestProject.csproj"当多个场景共享一个包含不同子目录的目录时,可使用此方式。
fixtures/Setup commands (optional)
初始化命令(可选)
Run shell commands before the agent starts (e.g., to build a project and generate artifacts):
yaml
setup:
copy_test_files: true
commands:
- "dotnet build -bl:build.binlog"在Agent启动前执行Shell命令(例如:构建项目并生成产物):
yaml
setup:
copy_test_files: true
commands:
- "dotnet build -bl:build.binlog"Scenario dependencies (optional)
场景依赖(可选)
Some agents route to specific skills, or some skills depend on sibling agents. In the isolated run, only the target is loaded — so the scenario must declare its dependencies using and/or :
additional_required_skillsadditional_required_agentsyaml
setup:
copy_test_files: true
additional_required_skills:
- binlog-failure-analysis # loaded in isolated run alongside the target
additional_required_agents:
- build-perf # registered in isolated run alongside the target- Names are resolved from the same plugin's or
skills/directory.agents/ - These only affect the isolated run. The plugin run already loads everything; the baseline loads nothing.
- Different scenarios of the same target can declare different dependencies (per-scenario granularity).
- If a declared name cannot be resolved, the validator fails with an error.
部分Agent会路由到特定Skill,或部分Skill依赖同级Agent。在隔离运行模式下,仅会加载目标对象——因此场景必须通过和/或声明其依赖项:
additional_required_skillsadditional_required_agentsyaml
setup:
copy_test_files: true
additional_required_skills:
- binlog-failure-analysis # 在隔离运行模式下与目标对象一同加载
additional_required_agents:
- build-perf # 在隔离运行模式下与目标对象一同注册- 名称将从同一Plugin的或
skills/目录中解析。agents/ - 这些设置仅影响隔离运行模式。Plugin运行模式会加载所有内容;基准运行模式则不加载任何内容。
- 同一目标对象的不同场景可以声明不同的依赖项(支持按场景粒度配置)。
- 如果声明的名称无法解析,验证器会抛出错误。
Step 5: Write assertions
步骤5:编写断言
Assertions are hard pass/fail checks. Use them for objective, binary-verifiable criteria.
| Type | Required fields | Description |
|---|---|---|
| | Agent output contains text (case-insensitive) |
| | Agent output must NOT contain text |
| | Agent output matches regex |
| | Agent output does NOT match regex |
| | File matching glob exists in work dir |
| | No file matching glob exists |
| | File at glob path contains text |
| | File at glob path does NOT contain text |
| — | Agent produced non-empty output |
断言是硬性的通过/不通过检查,用于客观、可二元验证的标准。
| 类型 | 必填字段 | 描述 |
|---|---|---|
| | Agent输出包含指定文本(不区分大小写) |
| | Agent输出不得包含指定文本 |
| | Agent输出匹配指定正则表达式 |
| | Agent输出不得匹配指定正则表达式 |
| | 工作目录中存在匹配通配符的文件 |
| | 工作目录中不存在匹配通配符的文件 |
| | 通配符路径下的文件包含指定文本 |
| | 通配符路径下的文件不得包含指定文本 |
| — | Agent生成了非空输出 |
Assertion guidelines
断言编写准则
- Prefer broad assertions that multiple valid approaches would satisfy.
- Avoid narrow assertions that gate on a specific syntax or flag the LLM already knows.
- Use with regex alternation for flexible matching:
output_matches."(root cause|primary error|underlying issue)" - Use /
file_containsto verify the agent modified files correctly.file_not_contains - Use and
output_not_containsto verify the agent avoided incorrect actions.file_not_exists
- 优先选择宽泛的断言,确保多种有效方案都能通过。
- 避免狭窄的断言,例如依赖LLM已知的特定语法或标志。
- 使用配合正则表达式的多选功能实现灵活匹配:
output_matches。"(root cause|primary error|underlying issue)" - 使用/
file_contains验证Agent是否正确修改了文件。file_not_contains - 使用和
output_not_contains验证Agent是否避免了错误操作。file_not_exists
Step 6: Write rubric items
步骤6:编写评分标准项
Rubric items are evaluated by an LLM judge using pairwise comparison (baseline vs. skill-enhanced). Quality metrics (rubric-based at 40% weight plus overall judgment at 30%) together dominate the composite improvement score.
评分标准项由LLM评判通过两两比较(基准 vs 增强Skill)进行评估。质量指标(基于评分标准的40%权重加上整体判断的30%)在综合改进评分中占主导地位。
The three rubric classifications (and how to stay in "outcome")
三种评分标准分类(以及如何聚焦“结果”类)
The overfitting judge classifies each rubric item:
| Classification | Description | Goal |
|---|---|---|
| outcome | Tests whether the agent reached a correct result. Describes WHAT, not HOW. | Target this |
| technique | Tests whether the agent used a skill-specific procedure. | Minimize |
| vocabulary | Tests whether the agent used specific terminology from the skill. | Avoid |
过拟合评判器会将每个评分标准项分类:
| 分类 | 描述 | 目标 |
|---|---|---|
| 结果(outcome) | 测试Agent是否达成正确结果。描述内容是“是什么”,而非“怎么做”。 | 优先选择此类 |
| 技术(technique) | 测试Agent是否使用了Skill特定的流程。 | 尽量减少 |
| 术语(vocabulary) | 测试Agent是否使用了Skill中的特定术语。 | 完全避免 |
Rubric writing rules
评分标准编写规则
- Test outcomes, not methods. Write "Identified the root cause of the build failure" — not "Replayed the binlog using ."
dotnet build /flp - Allow alternative approaches. If multiple valid solutions exist, the rubric item should accept any of them.
- Never reference the skill by name or use phrasing copied directly from the SKILL.md.
- Don't test pre-existing LLM knowledge. If the LLM already knows something (common APIs, standard syntax, basic escaping), testing for it adds no signal.
- Test findings, not diagnostic steps. Write "Correctly determined that the root cause is a missing PackageReference" — not "Used to check package resolution."
dotnet restore - Each item should be independently evaluable. Avoid compound items that test multiple things.
- 测试结果,而非方法。例如编写“识别出构建失败的根本原因”——而非“使用重放binlog”。
dotnet build /flp - 允许替代方案。如果存在多种有效解决方案,评分标准项应接受其中任何一种。
- 切勿提及Skill名称或直接使用SKILL.md中的表述。
- 勿测试LLM已有的知识。如果LLM已经知晓某些内容(常见API、标准语法、基础转义),测试这些内容不会增加有效信号。
- 测试发现,而非诊断步骤。例如编写“正确判断根本原因是缺失PackageReference”——而非“使用检查包解析情况”。
dotnet restore - 每个项应可独立评估。避免包含多个测试点的复合项。
Examples
示例
Well-designed (outcome-focused):
yaml
rubric:
- "Correctly identified the missing NuGet package as the root cause of the build failure"
- "Recognized that downstream project failures were cascading from the root cause, not independent errors"
- "Suggested a concrete fix that would resolve the root cause"Overfitted (vocabulary/technique):
yaml
rubric:
- "Replayed the binary log using 'dotnet build /flp:v=diag'" # technique: gates on specific command
- "Measured cold, warm, and no-op build scenarios" # vocabulary: uses skill's labels
- "Used the --clreventlevel flag with dotnet trace collect" # vocabulary: gates on specific flag设计良好(聚焦结果):
yaml
rubric:
- "正确识别出缺失的NuGet包是构建失败的根本原因"
- "认识到下游项目失败是由根本原因引发的连锁反应,而非独立错误"
- "提出了可解决根本原因的具体修复方案"过拟合(术语/技术类):
yaml
rubric:
- "使用'dotnet build /flp:v=diag'重放二进制日志" # 技术类:依赖特定命令
- "测量冷启动、热启动和无操作构建场景" # 术语类:使用Skill中的标签
- "在dotnet trace collect中使用--clreventlevel标志" # 术语类:依赖特定标志Step 7: Add optional constraints
步骤7:添加可选约束
yaml
expect_tools: ["bash"] # Agent must use these tools
reject_tools: ["create_file"] # Agent must NOT use these tools
max_turns: 10 # Maximum agent iterations
max_tokens: 5000 # Maximum token budgetUse constraints sparingly — only when the scenario specifically requires or forbids certain agent behaviors.
yaml
expect_tools: ["bash"] # Agent必须使用这些工具
reject_tools: ["create_file"] # Agent不得使用这些工具
max_turns: 10 # Agent的最大迭代次数
max_tokens: 5000 # 最大token预算请谨慎使用约束——仅当场景明确要求或禁止某些Agent行为时才添加。
Step 8: Add non-activation scenarios with expect_activation: false
expect_activation: false步骤8:添加expect_activation: false
的非激活场景
expect_activation: falseMany skills have clear boundaries — situations where the skill should recognize it does not apply and decline gracefully. Test these boundaries using .
expect_activation: false许多Skill有明确的边界——即Skill应识别出自身不适用并优雅拒绝的场景。请使用测试这些边界。
expect_activation: falseHow expect_activation: false
works
expect_activation: falseexpect_activation: false
的工作方式
expect_activation: falseWhen a scenario has :
expect_activation: false- All three runs still execute (baseline, skilled-isolated, skilled-plugin) and assertions are evaluated on each. The flag does not change which runs are performed.
- Activation verdict is inverted — if the skill is not activated for this prompt, the evaluator reports it as instead of treating it as a failure.
ℹ️ not activated (expected) - The scenario is excluded from the noise test — the multi-skill activation test only runs positive () scenarios.
expect_activation: true
当场景设置了时:
expect_activation: false- 仍会执行所有三种运行模式(基准、Skill隔离、Skill Plugin),并对每种模式的结果进行断言评估。该标志不会改变运行的模式。
- 激活判定结果会反转——如果Skill未针对该提示词激活,评估器会报告为,而非视为失败。
ℹ️ 未激活(符合预期) - 该场景会被排除在噪声测试之外——多Skill激活测试仅运行的场景。
expect_activation: true
When to use non-activation scenarios
何时使用非激活场景
Add scenarios when the skill has explicit "When Not to Use" boundaries. Common patterns:
expect_activation: false| Pattern | Example |
|---|---|
| Wrong input format | Skill handles Android tombstones; scenario provides an iOS crash log |
| Out-of-scope request | Skill collects dumps; scenario asks to analyze a dump |
| Incompatible project type | Skill converts PackageReference to CPM; scenario has packages.config |
| Wrong framework version | Skill migrates .NET 8→9; scenario provides a .NET 8 app and asks for .NET 10 migration |
| Prerequisite not met | Skill requires a specific file format that isn't present |
当Skill有明确的“不适用场景”时,添加场景。常见模式:
expect_activation: false| 模式 | 示例 |
|---|---|
| 错误的输入格式 | Skill处理Android tombstone;场景提供iOS崩溃日志 |
| 超出范围的请求 | Skill用于收集转储文件;场景要求分析转储文件 |
| 不兼容的项目类型 | Skill将PackageReference转换为CPM;场景使用packages.config |
| 错误的框架版本 | Skill迁移.NET 8→9;场景提供.NET 8应用并要求迁移到.NET 10 |
| 未满足前置条件 | Skill需要特定文件格式,但该格式不存在 |
Example: Wrong input format
示例:错误的输入格式
yaml
- name: "Reject iOS crash log as wrong format"
prompt: "I have a crash log file at crashlog_ios.txt from a crashed app. Please symbolicate the .NET runtime frames."
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_matches"
pattern: "(iOS|Apple|not.*(Android|tombstone)|wrong.*(format|type))"
rubric:
- "Recognized that this is an iOS crash log, not an Android tombstone"
- "Did NOT attempt to apply the Android tombstone symbolication workflow"
- "Explained that iOS crash logs require a different symbolication process"yaml
- name: "拒绝iOS崩溃日志(格式错误)"
prompt: "我有一个来自崩溃应用的iOS崩溃日志文件crashlog_ios.txt。请为.NET运行时栈进行符号化处理。"
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_matches"
pattern: "(iOS|Apple|不.*(Android|tombstone)|错误.*(格式|类型))"
rubric:
- "识别出这是iOS崩溃日志,而非Android tombstone"
- "未尝试应用Android tombstone符号化流程"
- "解释了iOS崩溃日志需要不同的符号化流程"Example: Out-of-scope request
示例:超出范围的请求
yaml
- name: "Decline dump analysis request"
prompt: |
I already have a .dmp crash dump file from my .NET app. Can you help
me analyze it to find the root cause of the crash?
expect_activation: false
assertions:
- type: "output_matches"
pattern: "(out of scope|not cover|does not|cannot|only.*collect)"
rubric:
- "Clearly states that dump analysis is out of scope for this skill"
- "Does not attempt to open or analyze the dump file"
- "Does not install analysis tools like dotnet-dump analyze, lldb, or windbg"
timeout: 30yaml
- name: "拒绝转储分析请求"
prompt: |
我已经有了.NET应用的.dmp崩溃转储文件。你能帮我分析它以找出崩溃的根本原因吗?
expect_activation: false
assertions:
- type: "output_matches"
pattern: "(超出范围|不涵盖|无法|不能|仅.*收集)"
rubric:
- "明确说明转储分析不在本Skill的范围内"
- "未尝试打开或分析转储文件"
- "未安装分析工具如dotnet-dump analyze、lldb或windbg"
timeout: 30Example: Incompatible project type
示例:不兼容的项目类型
yaml
- name: "Decline CPM conversion for packages.config project"
prompt: "Convert my simple-packages-config/LegacyApp project to Central Package Management."
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_contains"
value: "packages.config"
- type: "file_not_exists"
path: "simple-packages-config/Directory.Packages.props"
rubric:
- "Detected the project uses packages.config instead of PackageReference format"
- "Informed the user that CPM requires PackageReference and cannot be applied to packages.config projects"
- "Suggested migrating from packages.config to PackageReference first"
- "Did not attempt to create Directory.Packages.props or modify any project files"yaml
- name: "拒绝为packages.config项目转换为CPM"
prompt: "将我的simple-packages-config/LegacyApp项目转换为Central Package Management。"
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_contains"
value: "packages.config"
- type: "file_not_exists"
path: "simple-packages-config/Directory.Packages.props"
rubric:
- "检测到项目使用packages.config而非PackageReference格式"
- "告知用户CPM需要PackageReference,无法应用于packages.config项目"
- "建议先从packages.config迁移到PackageReference"
- "未尝试创建Directory.Packages.props或修改任何项目文件"Rubric guidelines for non-activation scenarios
非激活场景的评分标准准则
Non-activation rubric items typically verify three things:
- Recognition — The agent identified why the skill doesn't apply.
- Restraint — The agent did NOT attempt the skill's workflow (no file modifications, no tool installs).
- Redirection — The agent suggested the correct alternative approach or next step.
非激活场景的评分标准项通常需要验证三点:
- 识别——Agent识别出Skill不适用的原因。
- 克制——Agent未尝试执行Skill的流程(无文件修改、无工具安装)。
- 引导——Agent提出了正确的替代方案或下一步操作。
Step 9: Validate the eval.yaml
步骤9:验证eval.yaml
Run the static validator:
bash
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>Then run evaluation (at least 3 runs for reliable results):
bash
undefined运行静态验证器:
bash
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check --plugin ./plugins/<plugin>然后运行评估(至少3次运行以获得可靠结果):
bash
undefinedFor skills:
针对Skill:
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/skills/<skill-name>
For agents:
针对Agent:
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
undefineddotnet run --project eng/skill-validator/src/SkillValidator.csproj -- evaluate
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
--runs 3
--tests-dir tests/<plugin>
plugins/<plugin>/agents/<agent-name>.agent.md
undefinedeval.yaml Template
eval.yaml模板
yaml
scenarios:
- name: "<Describe what the agent should accomplish>"
prompt: "<Natural developer request — do not mention the skill>"
setup:
copy_test_files: true
assertions:
- type: "output_contains"
value: "<key term that a correct response must include>"
- type: "exit_success"
rubric:
- "<Outcome: what the agent should have identified or produced>"
- "<Outcome: what fix or recommendation the agent should have given>"
- "<Outcome: what incorrect approach the agent should have avoided>"
timeout: 120
- name: "<Describe situation where the skill should NOT apply>"
prompt: "<Request that superficially matches the skill but falls outside its scope>"
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_matches"
pattern: "<pattern matching the agent's explanation of why it cannot help>"
- type: "file_not_exists"
path: "<file the skill would create if it incorrectly activated>"
rubric:
- "<Recognition: agent identified why the skill does not apply>"
- "<Restraint: agent did not attempt the skill's workflow>"
- "<Redirection: agent suggested the correct alternative>"
timeout: 120yaml
scenarios:
- name: "<描述Agent应完成的任务>"
prompt: "<开发者的自然语言请求——请勿提及Skill>"
setup:
copy_test_files: true
assertions:
- type: "output_contains"
value: "<正确响应必须包含的关键词>"
- type: "exit_success"
rubric:
- "<结果:Agent应识别或生成的内容>"
- "<结果:Agent应给出的修复方案或建议>"
- "<结果:Agent应避免的错误方法>"
timeout: 120
- name: "<描述Skill不应适用的场景>"
prompt: "<表面上匹配Skill但超出其范围的请求>"
expect_activation: false
setup:
copy_test_files: true
assertions:
- type: "output_matches"
pattern: "<匹配Agent解释无法提供帮助的模式>"
- type: "file_not_exists"
path: "<Skill错误激活时会创建的文件>"
rubric:
- "<识别:Agent识别出Skill不适用的原因>"
- "<克制:Agent未尝试执行Skill的流程>"
- "<引导:Agent提出了正确的替代方案>"
timeout: 120Validation Checklist
验证检查清单
After creating a test, verify:
- Test directory matches for skills or
tests/<plugin>/<skill-name>/for agentstests/<plugin>/agent.<agent-name>/ - Target exists at (skill) or
plugins/<plugin>/skills/<skill-name>/SKILL.md(agent)plugins/<plugin>/agents/<agent-name>.agent.md - Every scenario has ,
name, at least one assertion, and rubric itemsprompt - Prompts are written as natural developer requests (no skill/agent name references)
- Assertions are broad enough that multiple valid approaches pass
- Rubric items test outcomes, not specific techniques or vocabulary
- Fixture files are present when is used
copy_test_files: true - paths in setup files point to existing fixture files
source - /
additional_required_skillsnames exist in the same pluginadditional_required_agents - Timeouts are reasonable for the scenario complexity
- Non-activation scenarios use and verify recognition, restraint, and redirection
expect_activation: false - passes
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check
创建测试后,请验证以下内容:
- 测试目录符合Skill的或Agent的
tests/<plugin>/<skill-name>/格式tests/<plugin>/agent.<agent-name>/ - 目标对象存在于Skill的或Agent的
plugins/<plugin>/skills/<skill-name>/SKILL.md路径下plugins/<plugin>/agents/<agent-name>.agent.md - 每个场景都包含、
name、至少一个断言和评分标准项prompt - 提示词以开发者的自然语言撰写(未提及Skill/Agent名称)
- 断言足够宽泛,多种有效方案均可通过
- 评分标准项测试结果,而非特定技术或术语
- 当使用时,夹具文件已存在
copy_test_files: true - 设置文件中的路径指向存在的夹具文件
source - /
additional_required_skills中声明的名称在同一Plugin中存在additional_required_agents - 超时时间与场景复杂度匹配
- 非激活场景使用了并验证了识别、克制和引导三点
expect_activation: false - 命令执行通过
dotnet run --project eng/skill-validator/src/SkillValidator.csproj -- check
Common Pitfalls
常见陷阱
| Pitfall | Solution |
|---|---|
| Prompt mentions the skill by name | Rewrite as a natural developer request describing the problem |
| Prompt mentions the agent by name | Same as above — agent name in prompts biases the baseline |
| Rubric tests a specific diagnostic command | Rewrite to test the finding or outcome that command produces |
| Assertion gates on syntax the LLM already knows | Use a broader pattern or test the result instead |
| All rubric items test the same aspect | Diversify: test identification, fix quality, and error avoidance |
Missing fixture files for | Add the required project/source files alongside eval.yaml |
| Timeout too short for builds | Use 300-600s for scenarios that compile or run benchmarks |
| Single scenario covers the entire skill | Break into focused scenarios testing different aspects |
| Compound rubric items testing multiple things | Split into separate, independently-evaluable items |
| No non-activation scenarios for skill with clear boundaries | Add |
Agent test missing | If the agent routes to specific skills, declare them so the isolated run loads them |
| 陷阱 | 解决方案 |
|---|---|
| 提示词中提及Skill名称 | 重写为描述问题的开发者自然语言请求 |
| 提示词中提及Agent名称 | 同上——提示词中的Agent名称会使基准测试产生偏差 |
| 评分标准测试特定诊断命令 | 重写为测试该命令产生的发现或结果 |
| 断言依赖LLM已知的语法 | 使用更宽泛的模式或测试结果而非语法 |
| 所有评分标准项测试同一方面 | 多样化:测试识别能力、修复质量和错误规避 |
| 在eval.yaml旁添加所需的项目/源文件 |
| 构建场景的超时时间过短 | 对于需要编译或运行基准测试的场景,设置为300-600秒 |
| 单个场景覆盖整个Skill | 拆分为多个聚焦不同方面的场景 |
| 复合评分标准项包含多个测试点 | 拆分为独立可评估的项 |
| 有明确边界的Skill缺少非激活场景 | 为每个“不适用场景”添加 |
Agent测试缺少 | 如果Agent路由到特定Skill,需声明这些Skill以便隔离运行模式加载它们 |