skill_evaluator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Evaluator (WIP)
Skill Evaluator(开发中)
Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.
依据Anthropic官方的Agent技能编写最佳实践评估技能。生成带有分数和可执行建议的结构化评估报告。
Quick Start
快速开始
- Read the skill's SKILL.md and understand its purpose
- Run automated validation:
scripts/validate_skill.py <skill-path> - Perform manual evaluation against criteria below
- Generate evaluation report with scores and recommendations
- 阅读技能的SKILL.md并理解其用途
- 运行自动化验证:
scripts/validate_skill.py <skill-path> - 根据以下标准执行人工评估
- 生成包含分数和建议的评估报告
Evaluation Workflow
评估工作流
Step 1: Automated Validation
步骤1:自动化验证
Run the validation script first:
bash
scripts/validate_skill.py <path/to/skill>This checks:
- SKILL.md exists with valid YAML frontmatter
- Name follows conventions (lowercase, hyphens, max 64 chars)
- Description is present and under 1024 chars
- Body is under 500 lines
- File references are one-level deep
首先运行验证脚本:
bash
scripts/validate_skill.py <path/to/skill>此脚本检查:
- SKILL.md是否存在且包含有效的YAML前置元数据
- 名称是否遵循规范(小写、连字符、最多64个字符)
- 是否存在描述且长度不超过1024个字符
- 正文内容不超过500行
- 文件引用仅为一级深度
Step 2: Manual Evaluation
步骤2:人工评估
Evaluate each dimension and assign a score (1-5):
评估每个维度并给出1-5分的分数:
A. Naming (Weight: 10%)
A. 命名(权重:10%)
| Score | Criteria |
|---|---|
| 5 | Gerund form (-ing), clear purpose, memorable |
| 4 | Descriptive, follows conventions |
| 3 | Acceptable but could be clearer |
| 2 | Vague or misleading |
| 1 | Violates naming rules |
Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.
Good: , ,
Bad: , , ,
processing-pdfsanalyzing-spreadsheetsbuilding-dashboardspdfmy-skillClaudeHelperanthropic-tools| 分数 | 标准 |
|---|---|
| 5 | 动名词形式(-ing结尾)、用途清晰、易于记忆 |
| 4 | 描述性强、遵循规范 |
| 3 | 可接受但可更清晰 |
| 2 | 模糊或误导性 |
| 1 | 违反命名规则 |
规则:最多64个字符,仅使用小写字母+数字+连字符,无保留词(anthropic、claude),无XML标签。
示例良好命名:, ,
示例不良命名:, , ,
processing-pdfsanalyzing-spreadsheetsbuilding-dashboardspdfmy-skillClaudeHelperanthropic-toolsB. Description (Weight: 20%)
B. 描述(权重:20%)
| Score | Criteria |
|---|---|
| 5 | Clear functionality + specific activation triggers + third person |
| 4 | Good description with some triggers |
| 3 | Adequate but missing triggers or vague |
| 2 | Too brief or unclear purpose |
| 1 | Missing or unhelpful |
Must include: What the skill does AND when to use it.
Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis."
Bad: "A skill for PDFs." or "Helps with documents."
| 分数 | 标准 |
|---|---|
| 5 | 清晰说明功能+具体触发场景+使用第三人称 |
| 4 | 描述良好,包含部分触发场景 |
| 3 | 足够但缺少触发场景或表述模糊 |
| 2 | 过于简短或用途不明确 |
| 1 | 缺失或无帮助 |
必须包含:技能的功能以及使用时机。
示例良好描述:"从PDF中提取文本。当处理PDF文档以进行文本提取、表单解析或内容分析时使用。"
示例不良描述:"一款针对PDF的技能。" 或 "帮助处理文档。"
C. Content Quality (Weight: 30%)
C. 内容质量(权重:30%)
| Score | Criteria |
|---|---|
| 5 | Concise, assumes Claude intelligence, actionable instructions |
| 4 | Generally good, minor verbosity |
| 3 | Some unnecessary explanations or redundancy |
| 2 | Overly verbose or confusing |
| 1 | Bloated, explains obvious concepts |
Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.
| 分数 | 标准 |
|---|---|
| 5 | 简洁、默认Claude具备相关智能、提供可执行指令 |
| 4 | 整体良好,仅存在轻微冗余 |
| 3 | 存在一些不必要的解释或冗余内容 |
| 2 | 过于冗长或表述混乱 |
| 1 | 内容臃肿、解释显而易见的概念 |
自问:"Claude真的需要这个解释吗?" 删除任何Claude已知的内容。
D. Structure & Organization (Weight: 25%)
D. 结构与组织(权重:25%)
| Score | Criteria |
|---|---|
| 5 | Excellent progressive disclosure, clear navigation, optimal length |
| 4 | Good organization, appropriate file splits |
| 3 | Acceptable but could be better organized |
| 2 | Poor organization, missing references, or bloated SKILL.md |
| 1 | No structure, everything dumped in SKILL.md |
Check:
- SKILL.md under 500 lines
- References are one-level deep (no nested chains)
- Long reference files (>100 lines) have table of contents
- Uses forward slashes in all paths
| 分数 | 标准 |
|---|---|
| 5 | 渐进式披露设计优秀、导航清晰、长度最优 |
| 4 | 组织良好、文件拆分合理 |
| 3 | 可接受但可进一步优化组织 |
| 2 | 组织混乱、缺少引用或SKILL.md内容臃肿 |
| 1 | 无结构,所有内容堆砌在SKILL.md中 |
检查项:
- SKILL.md内容不超过500行
- 引用仅为一级深度(无嵌套链)
- 长引用文件(>100行)包含目录
- 所有路径使用正斜杠
E. Degrees of Freedom (Weight: 10%)
E. 自由度设置(权重:10%)
| Score | Criteria |
|---|---|
| 5 | Perfect match: high freedom for flexible tasks, low for fragile operations |
| 4 | Generally appropriate freedom levels |
| 3 | Acceptable but could be better calibrated |
| 2 | Mismatched: too rigid or too loose |
| 1 | Completely wrong freedom level for the task type |
Guideline:
- High freedom (text): Multiple valid approaches, context-dependent
- Medium freedom (parameterized): Preferred pattern exists, some variation OK
- Low freedom (specific scripts): Fragile operations, exact sequence required
| 分数 | 标准 |
|---|---|
| 5 | 完美匹配:灵活任务设置高自由度,易出错操作设置低自由度 |
| 4 | 自由度设置整体合理 |
| 3 | 可接受但可进一步校准 |
| 2 | 不匹配:过于严格或过于宽松 |
| 1 | 完全错误的自由度设置,与任务类型不符 |
指导原则:
- 高自由度(文本类):多种有效方法、依赖上下文
- 中等自由度(参数化):存在首选模式,允许一定变化
- 低自由度(特定脚本):易出错操作,需要精确执行序列
F. Anti-Pattern Check (Weight: 5%)
F. 反模式检查(权重:5%)
Deduct points for each anti-pattern found:
- Too many options without clear recommendation (-1)
- Time-sensitive information with date conditionals (-1)
- Inconsistent terminology (-1)
- Windows-style paths (backslashes) (-1)
- Deeply nested references (more than one level) (-2)
- Scripts that punt error handling to Claude (-1)
- Magic numbers without justification (-1)
每发现一个反模式扣除相应分数:
- 过多选项但无明确推荐(-1分)
- 包含带日期条件的时效性信息(-1分)
- 术语不一致(-1分)
- Windows风格路径(反斜杠)(-1分)
- 深度嵌套引用(超过一级)(-2分)
- 将错误处理推给Claude的脚本(-1分)
- 无合理说明的魔法数字(-1分)
Step 3: Generate Report
步骤3:生成报告
Use this template:
markdown
undefined使用以下模板:
markdown
undefinedSkill Evaluation Report: [skill-name]
技能评估报告:[skill-name]
Summary
摘要
- Overall Score: X.X/5.0
- Recommendation: [Ready for publication / Needs minor improvements / Needs major revision]
- 整体得分:X.X/5.0
- 建议:[可发布 / 需要小幅改进 / 需要大幅修订]
Dimension Scores
维度得分
| Dimension | Score | Weight | Weighted |
|---|---|---|---|
| Naming | X/5 | 10% | X.XX |
| Description | X/5 | 20% | X.XX |
| Content Quality | X/5 | 30% | X.XX |
| Structure | X/5 | 25% | X.XX |
| Degrees of Freedom | X/5 | 10% | X.XX |
| Anti-Patterns | X/5 | 5% | X.XX |
| Total | 100% | X.XX |
| 维度 | 分数 | 权重 | 加权得分 |
|---|---|---|---|
| 命名 | X/5 | 10% | X.XX |
| 描述 | X/5 | 20% | X.XX |
| 内容质量 | X/5 | 30% | X.XX |
| 结构 | X/5 | 25% | X.XX |
| 自由度设置 | X/5 | 10% | X.XX |
| 反模式 | X/5 | 5% | X.XX |
| 总计 | 100% | X.XX |
Strengths
优势
- [List 2-3 things done well]
- [列出2-3项做得好的内容]
Areas for Improvement
改进方向
- [List specific issues with actionable fixes]
- [列出具体问题及可执行的修复方案]
Anti-Patterns Found
发现的反模式
- [List any anti-patterns detected]
- [列出检测到的所有反模式]
Recommendations
建议
- [Priority 1 fix]
- [Priority 2 fix]
- [Priority 3 fix]
- [优先级1修复项]
- [优先级2修复项]
- [优先级3修复项]
Pre-Publication Checklist
发布前检查清单
- Description is specific with activation triggers
- SKILL.md under 500 lines
- One-level-deep file references
- Forward slashes in all paths
- No time-sensitive information
- Consistent terminology
- Concrete examples provided
- Scripts handle errors explicitly
- All configuration values justified
- Required packages listed
- Tested with Haiku, Sonnet, Opus
undefined- 描述包含具体的激活触发场景
- SKILL.md内容不超过500行
- 文件引用仅为一级深度
- 所有路径使用正斜杠
- 无时效性信息
- 术语一致
- 提供具体示例
- 脚本明确处理错误
- 所有配置值均有说明
- 列出所需依赖包
- 已在Haiku、Sonnet、Opus上测试
undefinedScore Interpretation
得分解读
| Score Range | Rating | Action |
|---|---|---|
| 4.5 - 5.0 | Excellent | Ready for publication |
| 4.0 - 4.4 | Good | Minor improvements recommended |
| 3.0 - 3.9 | Acceptable | Several improvements needed |
| 2.0 - 2.9 | Needs Work | Major revision required |
| 1.0 - 1.9 | Poor | Fundamental redesign needed |
| 得分范围 | 评级 | 操作建议 |
|---|---|---|
| 4.5 - 5.0 | 优秀 | 可发布 |
| 4.0 - 4.4 | 良好 | 建议小幅改进 |
| 3.0 - 3.9 | 可接受 | 需要多项改进 |
| 2.0 - 2.9 | 需要优化 | 需要大幅修订 |
| 1.0 - 1.9 | 较差 | 需要重新设计 |
References
参考资料
- references/evaluation-criteria.md - Detailed evaluation criteria with examples
- references/scoring-rubric.md - Complete scoring rubric and edge cases
Skill Evaluator v1.1 - Enhanced
- references/evaluation-criteria.md - 包含示例的详细评估标准
- references/scoring-rubric.md - 完整的评分规则及边缘情况说明
Skill Evaluator v1.1 - 增强版
🔄 Workflow
🔄 工作流
Aşama 1: Structural Analysis
步骤1:结构分析
- Compliance: Dosya yapısı (,
scripts/) standarta uyuyor mu?references/ - Metadata: YAML frontmatter (,
name) eksiksiz ve valid mi?description - Modularity: Skill çok mu büyük? Bölünmesi gerekiyor mu? (Single Responsibility Principle).
- 合规性:文件结构(、
scripts/)是否符合标准?references/ - 元数据:YAML前置元数据(、
name)是否完整且有效?description - 模块化:技能是否过大?是否需要拆分?(单一职责原则)
Aşama 2: Content & Semantic Review
步骤2:内容与语义审查
- Clarity: Talimatlar emir kipiyle (Imperative) ve net yazılmış mı? Belirsizlik var mı?
- Context Efficiency: "Gereksiz nezaket" veya "aşırı açıklama" var mı? Token israfı önlenmeli.
- Safety: Skill tehlikeli bir işlem (dosya silme, yetkisiz erişim) öneriyor mu?
- 清晰度:指令是否使用祈使语气且表述清晰?是否存在模糊性?
- 上下文效率:是否存在"不必要的礼貌用语"或"过度解释"?应避免浪费Token。
- 安全性:技能是否建议危险操作(如删除文件、未授权访问)?
Aşama 3: Functionality Verification
步骤3:功能验证
- Script Audit: içindeki Python/Bash kodları güvenli ve çalışır durumda mı?
scripts/ - Reference Check: dosyaları gerçekten gerekli mi? Yoksa
references/içine mi gömülmeli?SKILL.md - Usability: Bir kullanıcı (veya ajan) bu skill'i okuyup hemen kullanabilir mi?
- 脚本审核:目录下的Python/Bash代码是否安全且可运行?
scripts/ - 引用检查:目录下的文件是否真的必要?还是应嵌入到
references/中?SKILL.md - 易用性:用户(或Agent)阅读该技能后能否立即使用?
Kontrol Noktaları
检查点
| Aşama | Doğrulama |
|---|---|
| 1 | Skill adı ve açıklaması birbiriyle tutarlı mı? |
| 2 | Anti-pattern (örn: Hardcoded path) tespit edildi mi? |
| 3 | Puanlama rubriğine göre objektif bir skor (1-5) verildi mi? |
| 步骤 | 验证内容 |
|---|---|
| 1 | 技能名称与描述是否一致? |
| 2 | 是否检测到反模式(如硬编码路径)? |
| 3 | 是否根据评分规则给出了客观的1-5分分数? |