skill-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Evaluation

Skill 评估

Evaluate skills as operational procedures. A good research skill changes future agent behavior on realistic tasks and leaves inspectable artifacts.
将技能作为操作流程进行评估。优质的研究类Skill能够改变Agent在实际任务中的后续行为,并留下可检查的成果物。

Read First

必读内容

  • references/skill-evaluation-policy.md
  • references/external-skill-recommendations.md
  • references/skill-evaluation-policy.md
  • references/external-skill-recommendations.md

Workflow

工作流程

  1. Define the failure mode the skill should prevent.
  2. Write pressure scenarios using real academic tasks: bad PDFs, missing DOI, unsupported claim, messy repo, ambiguous SOTA, unreliable notebook, or MCP outage.
  3. Run or reason through the baseline behavior without assuming the skill helps.
  4. Revise the skill frontmatter so it triggers on the right user intents.
  5. Keep
    SKILL.md
    lean; move detailed policies to
    references/
    .
  6. Validate local structure with
    python3 scripts/validate_skills.py
    .
  7. Install-list the package with
    npx -y skills add <repo> --list
    .
  8. Record recommended external skills separately from custom internal skills.
  1. 明确该Skill需要防范的故障模式。
  2. 结合真实学术任务编写压力测试场景:损坏的PDF、缺失的DOI、无依据的主张、混乱的代码仓库、模糊的SOTA、不可靠的Notebook或MCP中断。
  3. 在不假设该Skill会生效的前提下,运行或推演基准行为。
  4. 修改Skill的前置内容,使其能对正确的用户意图做出触发响应。
  5. 保持
    SKILL.md
    简洁;将详细策略移至
    references/
    目录下。
  6. 使用
    python3 scripts/validate_skills.py
    验证本地结构。
  7. 使用
    npx -y skills add <repo> --list
    将包添加至安装列表。
  8. 将推荐的外部Skill与自定义内部Skill分开记录。

Review Criteria

评审标准

  • trigger description starts with
    Use when
  • no workflow summary in frontmatter
  • references exist and are loaded only when needed
  • outputs map to the research project contract
  • citations and evidence rules are explicit
  • skill does not duplicate a stronger external skill without a reason
  • 触发描述以
    Use when
    开头
  • 前置内容中不包含工作流程摘要
  • 参考文件存在且仅在需要时加载
  • 输出内容与研究项目契约匹配
  • 引用和证据规则明确
  • 无合理理由的情况下,Skill不得与更完善的外部Skill重复

Do Not

禁止事项

  • Create one giant "research agent" skill.
  • Hide fragile procedures in prose when a script or test can check them.
  • Copy external skills blindly without adapting output paths and repository contracts.
  • 创建一个庞大的“研究Agent”Skill。
  • 当可用脚本或测试检查时,不得将易出错的流程隐藏在文本描述中。
  • 不得盲目复制外部Skill而不调整输出路径和仓库契约。