skill-tester

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
@rules/test-matrix.md @rules/scenario-design.md @rules/evidence-reporting.md @references/prompt-pack-template.md
@rules/test-matrix.md @rules/scenario-design.md @rules/evidence-reporting.md @references/prompt-pack-template.md

Skill Tester

Skill Tester

Prove a skill works as intended before trusting it.
<purpose>
  • Test whether a skill triggers on the right user requests and stays inactive on the wrong ones.
  • Verify the skill's workflow, support-file routing, scripts/assets, and validation instructions against realistic usage.
  • Expand coverage around edge cases, boundary prompts, ambiguity, missing inputs, malformed resources, and regression risks.
</purpose>
<routing_rule>
Use
skill-tester
when the user wants to test, validate, QA, regression-test, or edge-case-test an existing skill or skill folder.
Use
skill-maker
when the main job is creating or structurally refactoring a skill. Use
autoresearch-skill
when the main job is repeated measured optimization across experiments. Use
qa
or project-specific QA skills when the target is an application feature rather than a skill.
Do not use
skill-tester
when:
  • there is no skill or skill draft to evaluate
  • the user wants only generic documentation review
  • the task is app/browser QA unrelated to skill behavior
  • the user has already requested a full experiment loop with scoring and mutations
</routing_rule>
<trigger_conditions>
Positive examples:
  • "Test
    skills/git-maker/
    and tell me whether it triggers correctly."
  • "Verify whether this skill works as intended, including edge cases." (Korean-language requests with the same meaning should also trigger.)
  • "Create a regression test pack for this skill's trigger and workflow behavior."
  • "Validate the
    SKILL.md
    , rules, references, and scripts before I ship this skill."
Negative examples:
  • "Create a new Codex skill for browser QA." Route to
    skill-maker
    .
  • "Run QA on my web app checkout flow." Route to app QA, not this skill.
  • "Optimize this skill through repeated benchmark experiments." Route to
    autoresearch-skill
    .
Boundary example:
  • "Review this skill and fix any issues you find." Start with
    skill-tester
    if the emphasis is evidence and failures; switch to
    skill-maker
    only for structural edits after the test findings are clear.
</trigger_conditions>
<supported_targets>
  • Skill folders containing
    SKILL.md
    and optional localized variants such as
    SKILL.ko.md
    .
  • Skill metadata, trigger descriptions, routing rules, and examples.
  • Directly linked
    rules/
    ,
    references/
    ,
    scripts/
    , and
    assets/
    .
  • Trigger prompt packs, workflow simulations, validation checklists, and regression reports.
  • Edge cases around ambiguity, missing inputs, conflicting instructions, unsupported targets, and resource drift.
</supported_targets>
<required_inputs>
Minimum input:
  1. Target skill path or pasted skill content.
  2. Intended job of the skill, if not obvious from metadata.
If either is missing, inspect local context first. Ask only when the target skill or intended behavior cannot be inferred safely.
Optional but useful:
  • Known prompts that should trigger.
  • Known prompts that should not trigger.
  • Expected outputs or workflow checkpoints.
  • Recent failures, regressions, or edge cases to reproduce.
</required_inputs>
<skill_architecture>
Load support files deliberately:
  • Use rules/test-matrix.md to choose what dimensions to test.
  • Use rules/scenario-design.md to write positive, negative, boundary, adversarial, and localization scenarios.
  • Use rules/evidence-reporting.md to report pass/fail evidence and next fixes.
  • Use
    scripts/validate-skill.mjs
    for deterministic static checks when a filesystem skill folder is available.
  • Use references/prompt-pack-template.md when the user asks for reusable regression tests or a prompt pack artifact.
Keep test evidence close to the target skill when the user asks for reusable artifacts; otherwise report findings inline.
</skill_architecture>
<workflow>
PhaseTaskOutput
0Identify target skill, intended behavior, and neighboring skills that might conflictTest scope
1Read
SKILL.md
and directly linked support files needed for the test
Baseline behavior map
2Build a scenario matrix covering positive, negative, boundary, edge, and regression casesTest matrix
3Run static anatomy checks and inspect support-file referencesStatic findings
4Simulate skill routing and workflow execution for each scenarioPass/fail table
5Classify failures by trigger, scope, resource placement, workflow, validation, or safetyRanked defects
6Recommend minimal fixes or hand off to
skill-maker
/
autoresearch-skill
when edits are needed
Evidence-backed report
</workflow>
<test_requirements>
Every meaningful skill test should include at least:
  • 3 positive trigger scenarios.
  • 2 negative trigger scenarios.
  • 2 boundary or ambiguous scenarios.
  • 2 edge-case scenarios, such as missing inputs, malformed paths, unsupported language, conflicting instructions, or absent support files.
  • 1 regression scenario for a known or likely failure.
For localized skills, include at least one scenario in each supported language when trigger behavior depends on language. In this repository, include at least one Korean positive or boundary request when testing skills that ship
SKILL.ko.md
.
</test_requirements>
<failure_taxonomy>
Classify each issue as one of:
  • trigger-miss
    : target request may not activate the skill.
  • trigger-overreach
    : unrelated request may activate the skill.
  • scope-conflict
    : neighboring skill or workflow owns the request better.
  • workflow-gap
    : instructions do not tell the agent what to do next.
  • resource-drift
    : linked files are missing, stale, duplicated, or misplaced.
  • validation-gap
    : completion can be claimed without evidence.
  • edge-case-gap
    : missing handling for realistic boundary conditions.
  • safety-gap
    : instructions allow risky or irreversible behavior without checks.
</failure_taxonomy>
<output_contract>
Default report format:
markdown
undefined
在信任某一技能前,先验证其是否按预期运行。
<purpose>
  • 测试技能是否在正确的用户请求下触发,在错误请求下保持未激活状态。
  • 结合真实使用场景,验证技能的工作流、支持文件路由、脚本/资源及验证说明是否符合要求。
  • 扩展对极端场景、边界提示、歧义内容、缺失输入、格式错误资源及回归风险的覆盖范围。
</purpose>
<routing_rule>
当用户希望测试、验证、QA、回归测试或极端场景测试现有技能或技能文件夹时,使用
skill-tester
当核心任务是创建或结构化重构技能时,使用
skill-maker
。 当核心任务是通过实验进行重复的量化优化时,使用
autoresearch-skill
。 当测试目标是应用功能而非技能时,使用
qa
或项目专属QA技能。
以下情况请勿使用
skill-tester
  • 无待评估的技能或技能草稿
  • 用户仅需要通用文档评审
  • 任务是与技能行为无关的应用/浏览器QA
  • 用户已请求包含评分与变体的完整实验循环
</routing_rule>
<trigger_conditions>
正向示例:
  • "测试
    skills/git-maker/
    并告知其触发是否正确。"
  • "验证该技能是否按预期运行,包括极端场景。"(含义相同的韩语请求也应触发)
  • "为该技能的触发与工作流行为创建回归测试包。"
  • "在我发布该技能前,验证
    SKILL.md
    、规则、参考资料及脚本。"
负向示例:
  • "为浏览器QA创建新的Codex技能。" 路由至
    skill-maker
  • "对我的网页应用结账流程执行QA。" 路由至应用QA,而非本技能。
  • "通过重复基准实验优化该技能。" 路由至
    autoresearch-skill
边界示例:
  • "评审该技能并修复发现的问题。" 如果重点在于证据与故障排查,先使用
    skill-tester
    ;仅在测试结果明确后,如需结构性编辑再切换至
    skill-maker
</trigger_conditions>
<supported_targets>
  • 包含
    SKILL.md
    及可选本地化变体(如
    SKILL.ko.md
    )的技能文件夹。
  • 技能元数据、触发描述、路由规则及示例。
  • 直接关联的
    rules/
    references/
    scripts/
    assets/
  • 触发提示包、工作流模拟、验证清单及回归报告。
  • 围绕歧义、缺失输入、冲突指令、不支持目标及资源偏移的极端场景。
</supported_targets>
<required_inputs>
最低输入要求:
  1. 目标技能路径或粘贴的技能内容。
  2. 技能的预期用途(若元数据中未明确体现)。
若任一信息缺失,先检查本地上下文。仅当无法安全推断目标技能或预期行为时,再向用户询问。
可选但有用的输入:
  • 已知应触发技能的提示词。
  • 已知不应触发技能的提示词。
  • 预期输出或工作流检查点。
  • 近期出现的故障、回归问题或需复现的极端场景。
</required_inputs>
<skill_architecture>
按需加载支持文件:
  • 使用rules/test-matrix.md选择测试维度。
  • 使用rules/scenario-design.md编写正向、负向、边界、对抗性及本地化场景。
  • 使用rules/evidence-reporting.md报告通过/失败证据及后续修复建议。
  • 当存在文件系统技能文件夹时,使用
    scripts/validate-skill.mjs
    执行确定性静态检查。
  • 当用户要求可复用的回归测试或提示包制品时,使用references/prompt-pack-template.md
当用户要求可复用制品时,将测试证据与目标技能关联存储;否则直接在报告中展示测试结果。
</skill_architecture>
<workflow>
阶段任务输出
0确定目标技能、预期行为及可能存在冲突的关联技能测试范围
1读取
SKILL.md
及测试所需的直接关联支持文件
基准行为映射
2构建覆盖正向、负向、边界、极端及回归场景的测试矩阵测试矩阵
3执行静态结构检查并检查支持文件引用静态检查结果
4针对每个场景模拟技能路由与工作流执行通过/失败表格
5按触发、范围、资源位置、工作流、验证或安全维度对故障进行分类分级缺陷列表
6推荐最小化修复方案,或在需要编辑时转交至
skill-maker
/
autoresearch-skill
基于证据的报告
</workflow>
<test_requirements>
每一项有意义的技能测试应至少包含:
  • 3个正向触发场景。
  • 2个负向触发场景。
  • 2个边界或歧义场景。
  • 2个极端场景,如缺失输入、路径格式错误、不支持语言、冲突指令或缺失支持文件。
  • 1个针对已知或潜在故障的回归场景。
对于本地化技能,当触发行为依赖语言时,每种支持语言至少包含一个场景。在本仓库中,测试包含
SKILL.ko.md
的技能时,至少需包含一个韩语正向或边界请求场景。
</test_requirements>
<failure_taxonomy>
将每个问题归类为以下类型之一:
  • trigger-miss
    : 目标请求可能无法激活技能。
  • trigger-overreach
    : 无关请求可能激活技能。
  • scope-conflict
    : 关联技能或工作流更适合处理该请求。
  • workflow-gap
    : 指令未告知Agent下一步操作。
  • resource-drift
    : 关联文件缺失、过期、重复或位置错误。
  • validation-gap
    : 无需证据即可声称完成任务。
  • edge-case-gap
    : 未处理真实边界条件。
  • safety-gap
    : 指令允许无检查的风险或不可逆行为。
</failure_taxonomy>
<output_contract>
默认报告格式:
markdown
undefined

Skill Test Report

技能测试报告

Target:
skills/example/
Intended behavior: ... Verdict: pass | pass-with-risks | fail
目标:
skills/example/
预期行为: ... 结论: 通过 | 存在风险 | 失败

Scenario results

场景结果

IDTypePrompt / conditionExpectedObservedResult
ID类型提示词 / 条件预期结果实际结果测试结果

Findings

测试发现

  1. [severity] [taxonomy] Evidence-backed issue and affected file/section.
  1. [严重程度] [分类] 基于证据的问题及受影响文件/章节。

Edge cases covered

覆盖的极端场景

  • ...
  • ...

Recommended fixes

推荐修复方案

  • Minimal next edit or handoff target.
  • 最小化后续编辑操作或转交目标。

Validation evidence

验证证据

  • Commands run, files read, and checks completed.

If the user asks for reusable tests, also create a prompt pack or checklist under the target skill's `references/` or a task-specific `.hypercore/` workspace.

</output_contract>

<validation_checklist>

Before declaring a skill tested, confirm:

- [ ] Target skill and directly linked resources were inspected.
- [ ] Intended behavior was inferred or supplied.
- [ ] Positive, negative, boundary, edge, and regression scenarios were covered.
- [ ] Trigger overlap with neighboring skills was considered.
- [ ] Static resource checks were run when a folder path exists.
- [ ] Failures were classified with evidence and minimal fix guidance.
- [ ] Remaining risks or untested areas were explicitly named.

</validation_checklist>
  • 执行的命令、读取的文件及完成的检查项。

若用户要求可复用测试,还需在目标技能的`references/`或任务专属`.hypercore/`工作区中创建提示包或检查清单。

</output_contract>

<validation_checklist>

在宣布技能测试完成前,确认以下事项:

- [ ] 已检查目标技能及直接关联资源。
- [ ] 已推断或获取技能的预期行为。
- [ ] 已覆盖正向、负向、边界、极端及回归场景。
- [ ] 已考虑与关联技能的触发重叠情况。
- [ ] 当存在文件夹路径时,已执行静态资源检查。
- [ ] 已为故障分类并提供证据及最小化修复指导。
- [ ] 已明确列出剩余风险或未测试区域。

</validation_checklist>