skill-evaluator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Evaluator (WIP)

Skill 评估工具（开发中）

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.

依据Anthropic官方的Agent技能编写最佳实践评估技能。生成包含评分和可落地建议的结构化评估报告。

Quick Start

快速开始

Read the skill's SKILL.md and understand its purpose
Run automated validation:
```
scripts/validate_skill.py <skill-path>
```
Perform manual evaluation against criteria below
Generate evaluation report with scores and recommendations

阅读技能的SKILL.md文件并理解其用途
运行自动化验证：
```
scripts/validate_skill.py <skill-path>
```
根据以下标准进行人工评估
生成包含评分和建议的评估报告

Evaluation Workflow

评估流程

Step 1: Automated Validation

步骤1：自动化验证

Run the validation script first:

bash

scripts/validate_skill.py <path/to/skill>

This checks:

SKILL.md exists with valid YAML frontmatter
Name follows conventions (lowercase, hyphens, max 64 chars)
Description is present and under 1024 chars
Body is under 500 lines
File references are one-level deep

首先运行验证脚本：

bash

scripts/validate_skill.py <path/to/skill>

该脚本会检查：

是否存在带有有效YAML前置内容的SKILL.md文件
名称是否符合规范（小写、连字符、最多64个字符）
是否存在描述且长度不超过1024个字符
正文内容不超过500行
文件引用仅为一级深度

Step 2: Manual Evaluation

步骤2：人工评估

Evaluate each dimension and assign a score (1-5):

评估每个维度并给出1-5分的评分：

A. Naming (Weight: 10%)

A. 命名（权重：10%）

Score	Criteria
5	Gerund form (-ing), clear purpose, memorable
4	Descriptive, follows conventions
3	Acceptable but could be clearer
2	Vague or misleading
1	Violates naming rules

Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.

Good:

processing-pdfs

analyzing-spreadsheets

building-dashboards

Bad:

pdf

my-skill

ClaudeHelper

anthropic-tools

评分	标准
5	动名词形式（-ing结尾）、用途清晰、易于记忆
4	描述性强、符合规范
3	可接受但可更清晰
2	模糊或有误导性
1	违反命名规则

规则：最多64个字符，仅包含小写字母+数字+连字符，不得使用保留词（anthropic、claude），不得包含XML标签。

示例（优秀）：

processing-pdfs

analyzing-spreadsheets

building-dashboards

示例（不佳）：

pdf

my-skill

ClaudeHelper

anthropic-tools

B. Description (Weight: 20%)

B. 描述（权重：20%）

Score	Criteria
5	Clear functionality + specific activation triggers + third person
4	Good description with some triggers
3	Adequate but missing triggers or vague
2	Too brief or unclear purpose
1	Missing or unhelpful

Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."

评分	标准
5	功能清晰 + 明确的触发场景 + 使用第三人称
4	描述良好，包含部分触发场景
3	内容足够但缺少触发场景或表述模糊
2	过于简短或用途不明确
1	缺失或无实际帮助

必须包含：技能的功能以及使用时机。 示例（优秀）："从PDF中提取文本。当处理PDF文档以进行文本提取、表单解析或内容分析时使用。" 示例（不佳）："一款用于PDF的技能。" 或 "帮助处理文档。"

C. Content Quality (Weight: 30%)

C. 内容质量（权重：30%）

Score	Criteria
5	Concise, assumes Claude intelligence, actionable instructions
4	Generally good, minor verbosity
3	Some unnecessary explanations or redundancy
2	Overly verbose or confusing
1	Bloated, explains obvious concepts

Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.

评分	标准
5	简洁明了、假设Claude具备相应智能、包含可执行指令
4	整体良好，仅存在少量冗余
3	存在一些不必要的解释或冗余内容
2	过于冗长或表述混乱
1	内容臃肿、解释显而易见的概念

自问："Claude真的需要这段解释吗？" 删除任何Claude已具备的知识内容。

D. Structure & Organization (Weight: 25%)

D. 结构与组织（权重：25%）

Score	Criteria
5	Excellent progressive disclosure, clear navigation, optimal length
4	Good organization, appropriate file splits
3	Acceptable but could be better organized
2	Poor organization, missing references, or bloated SKILL.md
1	No structure, everything dumped in SKILL.md

Check:

SKILL.md under 500 lines
References are one-level deep (no nested chains)
Long reference files (>100 lines) have table of contents
Uses forward slashes in all paths

评分	标准
5	渐进式披露设计出色、导航清晰、长度最优
4	组织良好、文件拆分合理
3	可接受但可进一步优化组织方式
2	组织混乱、缺少引用或SKILL.md内容臃肿
1	无结构可言，所有内容堆砌在SKILL.md中

检查项：

SKILL.md内容不超过500行
引用仅为一级深度（无嵌套链式引用）
长引用文件（>100行）包含目录
所有路径使用正斜杠

E. Degrees of Freedom (Weight: 10%)

E. 自由度设置（权重：10%）

Score	Criteria
5	Perfect match: high freedom for flexible tasks, low for fragile operations
4	Generally appropriate freedom levels
3	Acceptable but could be better calibrated
2	Mismatched: too rigid or too loose
1	Completely wrong freedom level for the task type

Guideline:

High freedom (text): Multiple valid approaches, context-dependent
Medium freedom (parameterized): Preferred pattern exists, some variation OK
Low freedom (specific scripts): Fragile operations, exact sequence required

评分	标准
5	完美匹配：灵活任务设置高自由度，易出错操作设置低自由度
4	自由度设置整体合理
3	可接受但可进一步校准
2	匹配不当：过于僵化或过于松散
1	完全错误的自由度设置，与任务类型不符

指导原则：

高自由度（文本类）：存在多种有效处理方式、依赖上下文
中等自由度（参数化）：存在首选模式，允许一定变化
低自由度（特定脚本）：易出错操作，需要严格遵循步骤序列

F. Anti-Pattern Check (Weight: 5%)

F. 反模式检查（权重：5%）

Step 3: Generate Report

步骤3：生成报告

Use this template:

markdown

undefined

使用以下模板：

markdown

undefined

Skill Evaluation Report: [skill-name]

Skill 评估报告：[skill-name]

Summary

摘要

Overall Score: X.X/5.0
Recommendation: [Ready for publication / Needs minor improvements / Needs major revision]

整体评分：X.X/5.0
建议：[可发布 / 需要小幅改进 / 需要大幅修订]

Dimension Scores

维度评分

Dimension	Score	Weight	Weighted
Naming	X/5	10%	X.XX
Description	X/5	20%	X.XX
Content Quality	X/5	30%	X.XX
Structure	X/5	25%	X.XX
Degrees of Freedom	X/5	10%	X.XX
Anti-Patterns	X/5	5%	X.XX
Total		100%	X.XX

维度	评分	权重	加权得分
命名	X/5	10%	X.XX
描述	X/5	20%	X.XX
内容质量	X/5	30%	X.XX
结构	X/5	25%	X.XX
自由度设置	X/5	10%	X.XX
反模式检查	X/5	5%	X.XX
总计		100%	X.XX

Strengths

优势

[List 2-3 things done well]

[列出2-3项做得好的地方]

Areas for Improvement

改进方向

[List specific issues with actionable fixes]

[列出具体问题及可落地的修复方案]

Anti-Patterns Found

发现的反模式

[List any anti-patterns detected]

[列出检测到的反模式]

Recommendations

建议

[Priority 1 fix]
[Priority 2 fix]
[Priority 3 fix]

[优先级1修复项]
[优先级2修复项]
[优先级3修复项]

Pre-Publication Checklist

发布前检查清单

Description is specific with activation triggers
SKILL.md under 500 lines
One-level-deep file references
Forward slashes in all paths
No time-sensitive information
Consistent terminology
Concrete examples provided
Scripts handle errors explicitly
All configuration values justified
Required packages listed
Tested with Haiku, Sonnet, Opus

undefined

描述包含具体的触发场景
SKILL.md内容不超过500行
文件引用仅为一级深度
所有路径使用正斜杠
无时效性信息
术语一致
提供具体示例
脚本明确处理错误
所有配置值有合理依据
列出所需依赖包
已在Haiku、Sonnet、Opus上测试

undefined

Score Interpretation

评分解读

Score Range	Rating	Action
4.5 - 5.0	Excellent	Ready for publication
4.0 - 4.4	Good	Minor improvements recommended
3.0 - 3.9	Acceptable	Several improvements needed
2.0 - 2.9	Needs Work	Major revision required
1.0 - 1.9	Poor	Fundamental redesign needed

评分范围	评级	行动建议
4.5 - 5.0	优秀	可直接发布
4.0 - 4.4	良好	建议小幅改进
3.0 - 3.9	合格	需要多项改进
2.0 - 2.9	待改进	需要大幅修订
1.0 - 1.9	较差	需要彻底重新设计

References

参考资料

references/evaluation-criteria.md - Detailed evaluation criteria with examples
references/scoring-rubric.md - Complete scoring rubric and edge cases

references/evaluation-criteria.md - 包含示例的详细评估标准
references/scoring-rubric.md - 完整评分细则及边缘情况说明

Examples

示例

See evaluations/ for example evaluation scenarios.

查看 evaluations/ 目录获取示例评估场景。