skill-evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Evaluator (WIP)

Skill 评估工具(开发中)

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.
依据Anthropic官方的Agent技能编写最佳实践评估技能。生成包含评分和可落地建议的结构化评估报告。

Quick Start

快速开始

  1. Read the skill's SKILL.md and understand its purpose
  2. Run automated validation:
    scripts/validate_skill.py <skill-path>
  3. Perform manual evaluation against criteria below
  4. Generate evaluation report with scores and recommendations
  1. 阅读技能的SKILL.md文件并理解其用途
  2. 运行自动化验证:
    scripts/validate_skill.py <skill-path>
  3. 根据以下标准进行人工评估
  4. 生成包含评分和建议的评估报告

Evaluation Workflow

评估流程

Step 1: Automated Validation

步骤1:自动化验证

Run the validation script first:
bash
scripts/validate_skill.py <path/to/skill>
This checks:
  • SKILL.md exists with valid YAML frontmatter
  • Name follows conventions (lowercase, hyphens, max 64 chars)
  • Description is present and under 1024 chars
  • Body is under 500 lines
  • File references are one-level deep
首先运行验证脚本:
bash
scripts/validate_skill.py <path/to/skill>
该脚本会检查:
  • 是否存在带有有效YAML前置内容的SKILL.md文件
  • 名称是否符合规范(小写、连字符、最多64个字符)
  • 是否存在描述且长度不超过1024个字符
  • 正文内容不超过500行
  • 文件引用仅为一级深度

Step 2: Manual Evaluation

步骤2:人工评估

Evaluate each dimension and assign a score (1-5):
评估每个维度并给出1-5分的评分:

A. Naming (Weight: 10%)

A. 命名(权重:10%)

ScoreCriteria
5Gerund form (-ing), clear purpose, memorable
4Descriptive, follows conventions
3Acceptable but could be clearer
2Vague or misleading
1Violates naming rules
Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.
Good:
processing-pdfs
,
analyzing-spreadsheets
,
building-dashboards
Bad:
pdf
,
my-skill
,
ClaudeHelper
,
anthropic-tools
评分标准
5动名词形式(-ing结尾)、用途清晰、易于记忆
4描述性强、符合规范
3可接受但可更清晰
2模糊或有误导性
1违反命名规则
规则:最多64个字符,仅包含小写字母+数字+连字符,不得使用保留词(anthropic、claude),不得包含XML标签。
示例(优秀)
processing-pdfs
,
analyzing-spreadsheets
,
building-dashboards
示例(不佳)
pdf
,
my-skill
,
ClaudeHelper
,
anthropic-tools

B. Description (Weight: 20%)

B. 描述(权重:20%)

ScoreCriteria
5Clear functionality + specific activation triggers + third person
4Good description with some triggers
3Adequate but missing triggers or vague
2Too brief or unclear purpose
1Missing or unhelpful
Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."
评分标准
5功能清晰 + 明确的触发场景 + 使用第三人称
4描述良好,包含部分触发场景
3内容足够但缺少触发场景或表述模糊
2过于简短或用途不明确
1缺失或无实际帮助
必须包含:技能的功能 以及 使用时机。 示例(优秀):"从PDF中提取文本。当处理PDF文档以进行文本提取、表单解析或内容分析时使用。" 示例(不佳):"一款用于PDF的技能。" 或 "帮助处理文档。"

C. Content Quality (Weight: 30%)

C. 内容质量(权重:30%)

ScoreCriteria
5Concise, assumes Claude intelligence, actionable instructions
4Generally good, minor verbosity
3Some unnecessary explanations or redundancy
2Overly verbose or confusing
1Bloated, explains obvious concepts
Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.
评分标准
5简洁明了、假设Claude具备相应智能、包含可执行指令
4整体良好,仅存在少量冗余
3存在一些不必要的解释或冗余内容
2过于冗长或表述混乱
1内容臃肿、解释显而易见的概念
自问:"Claude真的需要这段解释吗?" 删除任何Claude已具备的知识内容。

D. Structure & Organization (Weight: 25%)

D. 结构与组织(权重:25%)

ScoreCriteria
5Excellent progressive disclosure, clear navigation, optimal length
4Good organization, appropriate file splits
3Acceptable but could be better organized
2Poor organization, missing references, or bloated SKILL.md
1No structure, everything dumped in SKILL.md
Check:
  • SKILL.md under 500 lines
  • References are one-level deep (no nested chains)
  • Long reference files (>100 lines) have table of contents
  • Uses forward slashes in all paths
评分标准
5渐进式披露设计出色、导航清晰、长度最优
4组织良好、文件拆分合理
3可接受但可进一步优化组织方式
2组织混乱、缺少引用或SKILL.md内容臃肿
1无结构可言,所有内容堆砌在SKILL.md中
检查项
  • SKILL.md内容不超过500行
  • 引用仅为一级深度(无嵌套链式引用)
  • 长引用文件(>100行)包含目录
  • 所有路径使用正斜杠

E. Degrees of Freedom (Weight: 10%)

E. 自由度设置(权重:10%)

ScoreCriteria
5Perfect match: high freedom for flexible tasks, low for fragile operations
4Generally appropriate freedom levels
3Acceptable but could be better calibrated
2Mismatched: too rigid or too loose
1Completely wrong freedom level for the task type
Guideline:
  • High freedom (text): Multiple valid approaches, context-dependent
  • Medium freedom (parameterized): Preferred pattern exists, some variation OK
  • Low freedom (specific scripts): Fragile operations, exact sequence required
评分标准
5完美匹配:灵活任务设置高自由度,易出错操作设置低自由度
4自由度设置整体合理
3可接受但可进一步校准
2匹配不当:过于僵化或过于松散
1完全错误的自由度设置,与任务类型不符
指导原则
  • 高自由度(文本类):存在多种有效处理方式、依赖上下文
  • 中等自由度(参数化):存在首选模式,允许一定变化
  • 低自由度(特定脚本):易出错操作,需要严格遵循步骤序列

F. Anti-Pattern Check (Weight: 5%)

F. 反模式检查(权重:5%)

Deduct points for each anti-pattern found:
  • Too many options without clear recommendation (-1)
  • Time-sensitive information with date conditionals (-1)
  • Inconsistent terminology (-1)
  • Windows-style paths (backslashes) (-1)
  • Deeply nested references (more than one level) (-2)
  • Scripts that punt error handling to Claude (-1)
  • Magic numbers without justification (-1)
每发现一个反模式扣除相应分数:
  • 提供过多选项但无明确推荐(-1分)
  • 包含带日期条件的时效性信息(-1分)
  • 术语不一致(-1分)
  • 使用Windows风格路径(反斜杠)(-1分)
  • 深度嵌套引用(超过一级)(-2分)
  • 将错误处理推给Claude的脚本(-1分)
  • 无合理依据的魔法数字(-1分)

Step 3: Generate Report

步骤3:生成报告

Use this template:
markdown
undefined
使用以下模板:
markdown
undefined

Skill Evaluation Report: [skill-name]

Skill 评估报告:[skill-name]

Summary

摘要

  • Overall Score: X.X/5.0
  • Recommendation: [Ready for publication / Needs minor improvements / Needs major revision]
  • 整体评分:X.X/5.0
  • 建议:[可发布 / 需要小幅改进 / 需要大幅修订]

Dimension Scores

维度评分

DimensionScoreWeightWeighted
NamingX/510%X.XX
DescriptionX/520%X.XX
Content QualityX/530%X.XX
StructureX/525%X.XX
Degrees of FreedomX/510%X.XX
Anti-PatternsX/55%X.XX
Total100%X.XX
维度评分权重加权得分
命名X/510%X.XX
描述X/520%X.XX
内容质量X/530%X.XX
结构X/525%X.XX
自由度设置X/510%X.XX
反模式检查X/55%X.XX
总计100%X.XX

Strengths

优势

  • [List 2-3 things done well]
  • [列出2-3项做得好的地方]

Areas for Improvement

改进方向

  • [List specific issues with actionable fixes]
  • [列出具体问题及可落地的修复方案]

Anti-Patterns Found

发现的反模式

  • [List any anti-patterns detected]
  • [列出检测到的反模式]

Recommendations

建议

  1. [Priority 1 fix]
  2. [Priority 2 fix]
  3. [Priority 3 fix]
  1. [优先级1修复项]
  2. [优先级2修复项]
  3. [优先级3修复项]

Pre-Publication Checklist

发布前检查清单

  • Description is specific with activation triggers
  • SKILL.md under 500 lines
  • One-level-deep file references
  • Forward slashes in all paths
  • No time-sensitive information
  • Consistent terminology
  • Concrete examples provided
  • Scripts handle errors explicitly
  • All configuration values justified
  • Required packages listed
  • Tested with Haiku, Sonnet, Opus
undefined
  • 描述包含具体的触发场景
  • SKILL.md内容不超过500行
  • 文件引用仅为一级深度
  • 所有路径使用正斜杠
  • 无时效性信息
  • 术语一致
  • 提供具体示例
  • 脚本明确处理错误
  • 所有配置值有合理依据
  • 列出所需依赖包
  • 已在Haiku、Sonnet、Opus上测试
undefined

Score Interpretation

评分解读

Score RangeRatingAction
4.5 - 5.0ExcellentReady for publication
4.0 - 4.4GoodMinor improvements recommended
3.0 - 3.9AcceptableSeveral improvements needed
2.0 - 2.9Needs WorkMajor revision required
1.0 - 1.9PoorFundamental redesign needed
评分范围评级行动建议
4.5 - 5.0优秀可直接发布
4.0 - 4.4良好建议小幅改进
3.0 - 3.9合格需要多项改进
2.0 - 2.9待改进需要大幅修订
1.0 - 1.9较差需要彻底重新设计

References

参考资料

  • references/evaluation-criteria.md - Detailed evaluation criteria with examples
  • references/scoring-rubric.md - Complete scoring rubric and edge cases
  • references/evaluation-criteria.md - 包含示例的详细评估标准
  • references/scoring-rubric.md - 完整评分细则及边缘情况说明

Examples

示例

See evaluations/ for example evaluation scenarios.
查看 evaluations/ 目录获取示例评估场景。