skill-judge

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Judge

Skill Judge

Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.

根据官方规范及从17+官方示例中提炼的模式,对Agent Skill进行评估。

Core Philosophy

核心理念

What is a Skill?

什么是Skill?

A Skill is NOT a tutorial. A Skill is a knowledge externalization mechanism.
Traditional AI knowledge is locked in model parameters. To teach new capabilities:
Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months
Skills change this:
Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant
This is the paradigm shift from "training AI" to "educating AI" — like a hot-swappable LoRA adapter that requires no training. You edit a Markdown file in natural language, and the model's behavior changes.
Skill不是教程,而是一种知识外化机制
传统AI知识被锁在模型参数中。要教授新能力:
传统方式:收集数据 → GPU集群 → 训练 → 部署新版本
成本:10,000美元 - 1,000,000+美元
周期:数周至数月
Skill彻底改变了这一模式:
Skill方式:编辑SKILL.md → 保存 → 下次调用立即生效
成本:0美元
周期:即时
这是从“训练AI”到“教育AI”的范式转变——就像无需训练的热插拔LoRA适配器。你用自然语言编辑Markdown文件,模型的行为就会随之改变。

The Core Formula

核心公式

Good Skill = Expert-only Knowledge − What Claude Already Knows
A Skill's value is measured by its knowledge delta — the gap between what it provides and what the model already knows.
  • Expert-only knowledge: Decision trees, trade-offs, edge cases, anti-patterns, domain-specific thinking frameworks — things that take years of experience to accumulate
  • What Claude already knows: Basic concepts, standard library usage, common programming patterns, general best practices
When a Skill explains "what is PDF" or "how to write a for-loop", it's compressing knowledge Claude already has. This is token waste — context window is a public resource shared with system prompts, conversation history, other Skills, and user requests.
优质Skill = 专家专属知识 − Claude已掌握的知识
Skill的价值由其知识增量衡量——即它提供的内容与模型已掌握内容之间的差距。
  • 专家专属知识:决策树、权衡取舍、边缘案例、反模式、领域特定思维框架——这些需要数年经验才能积累的内容
  • Claude已掌握的知识:基础概念、标准库用法、常见编程模式、通用最佳实践
当Skill解释“什么是PDF”或“如何编写for循环”时,它是在压缩Claude已有的知识。这是令牌浪费——上下文窗口是与系统提示、对话历史、其他Skill和用户请求共享的公共资源。

Tool vs Skill

工具 vs Skill

ConceptEssenceFunctionExample
ToolWhat model CAN doExecute actionsbash, read_file, write_file, WebSearch
SkillWhat model KNOWS how to doGuide decisionsPDF processing, MCP building, frontend design
Tools define capability boundaries — without bash tool, model can't execute commands. Skills inject knowledge — without frontend-design Skill, model produces generic UI.
The equation:
General Agent + Excellent Skill = Domain Expert Agent
Same Claude model, different Skills loaded, becomes different experts.
概念本质功能示例
工具模型能做什么执行操作bash, read_file, write_file, WebSearch
Skill模型知道如何做什么指导决策PDF处理、MCP构建、前端设计
工具定义能力边界——没有bash工具,模型无法执行命令。 Skill注入知识——没有frontend-design Skill,模型只会生成通用UI。
等式
通用Agent + 优质Skill = 领域专家Agent
同一个Claude模型,加载不同的Skill,就会成为不同的专家。

Three Types of Knowledge in Skills

Skill中的三类知识

When evaluating, categorize each section:
TypeDefinitionTreatment
ExpertClaude genuinely doesn't know thisMust keep — this is the Skill's value
ActivationClaude knows but may not think ofKeep if brief — serves as reminder
RedundantClaude definitely knows thisShould delete — wastes tokens
The art of Skill design is maximizing Expert content, using Activation sparingly, and eliminating Redundant ruthlessly.

评估时,需将每个部分分类:
类型定义处理方式
专家级Claude确实不知道的内容必须保留——这是Skill的价值所在
激活型Claude知道但可能没想到的内容若简洁则保留——起到提醒作用
冗余型Claude肯定知道的内容应删除——浪费令牌
Skill设计的艺术在于最大化专家级内容,谨慎使用激活型内容,彻底消除冗余型内容。

Evaluation Dimensions (120 points total)

评估维度(总分120分)

D1: Knowledge Delta (20 points) — THE CORE DIMENSION

D1:知识增量(20分)——核心维度

The most important dimension. Does the Skill add genuine expert knowledge?
ScoreCriteria
0-5Explains basics Claude knows (what is X, how to write code, standard library tutorials)
6-10Mixed: some expert knowledge diluted by obvious content
11-15Mostly expert knowledge with minimal redundancy
16-20Pure knowledge delta — every paragraph earns its tokens
Red flags (instant score ≤5):
  • "What is [basic concept]" sections
  • Step-by-step tutorials for standard operations
  • Explaining how to use common libraries
  • Generic best practices ("write clean code", "handle errors")
  • Definitions of industry-standard terms
Green flags (indicators of high knowledge delta):
  • Decision trees for non-obvious choices ("when X fails, try Y because Z")
  • Trade-offs only an expert would know ("A is faster but B handles edge case C")
  • Edge cases from real-world experience
  • "NEVER do X because [non-obvious reason]"
  • Domain-specific thinking frameworks
Evaluation questions:
  1. For each section, ask: "Does Claude already know this?"
  2. If explaining something, ask: "Is this explaining TO Claude or FOR Claude?"
  3. Count paragraphs that are Expert vs Activation vs Redundant

最重要的维度。Skill是否添加了真正的专家知识?
分数标准
0-5解释Claude已掌握的基础知识(什么是X、如何编写代码、标准库教程)
6-10混合内容:一些专家知识被明显冗余的内容稀释
11-15大部分为专家知识,冗余内容极少
16-20纯知识增量——每一段内容都物有所值
危险信号(立即得分≤5):
  • “什么是[基础概念]”章节
  • 标准操作的分步教程
  • 解释常用库的用法
  • 通用最佳实践(“编写简洁代码”、“处理错误”)
  • 行业标准术语的定义
积极信号(高知识增量的指标):
  • 非明显选择的决策树(“当X失败时,尝试Y,因为Z”)
  • 只有专家才知道的权衡(“A更快,但B能处理边缘案例C”)
  • 来自实际经验的边缘案例
  • “绝对不要做X,因为[非明显原因]”
  • 领域特定思维框架
评估问题
  1. 对每个章节,问:“Claude已经知道这个吗?”
  2. 如果是解释内容,问:“这是给Claude讲解,还是为Claude准备的?”
  3. 统计专家级、激活型、冗余型内容的段落数量

D2: Mindset + Appropriate Procedures (15 points)

D2:思维模式+恰当流程(15分)

Does the Skill transfer expert thinking patterns along with necessary domain-specific procedures?
The difference between experts and novices isn't "knowing how to operate" — it's "how to think about the problem." But thinking patterns alone aren't enough when Claude lacks domain-specific procedural knowledge.
Key distinction:
TypeExampleValue
Thinking patterns"Before designing, ask: What makes this memorable?"High — shapes decision-making
Domain-specific procedures"OOXML workflow: unpack → edit XML → validate → pack"High — Claude may not know this
Generic procedures"Step 1: Open file, Step 2: Edit, Step 3: Save"Low — Claude already knows
ScoreCriteria
0-3Only generic procedures Claude already knows
4-7Has domain procedures but lacks thinking frameworks
8-11Good balance: thinking patterns + domain-specific workflows
12-15Expert-level: shapes thinking AND provides procedures Claude wouldn't know
What counts as valuable procedures:
  • Workflows Claude hasn't been trained on (new tools, proprietary systems)
  • Correct ordering that's non-obvious (e.g., "validate BEFORE packing, not after")
  • Critical steps that are easy to miss (e.g., "MUST recalculate formulas after editing")
  • Domain-specific sequences (e.g., MCP server's 4-phase development process)
What counts as redundant procedures:
  • Generic file operations (open, read, write, save)
  • Standard programming patterns (loops, conditionals, error handling)
  • Common library usage that's well-documented
Expert thinking patterns look like:
markdown
Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makes this solution memorable?
Valuable domain procedures look like:
markdown
undefined
Skill是否传递了专家的思维模式以及必要的领域特定流程?
专家与新手的区别不在于“知道如何操作”——而在于“如何思考问题”。但当Claude缺乏领域特定流程知识时,仅靠思维模式是不够的。
关键区别
类型示例价值
思维模式“设计前,问自己:什么让这个设计令人难忘?”高——塑造决策方式
领域特定流程“OOXML工作流:解压→编辑XML→验证→打包”高——Claude可能不知道这些
通用流程“步骤1:打开文件,步骤2:编辑,步骤3:保存”低——Claude已经知道
分数标准
0-3仅包含Claude已掌握的通用流程
4-7有领域流程,但缺乏思维框架
8-11平衡良好:思维模式+领域特定工作流
12-15专家级:既塑造思维,又提供Claude不知道的流程
有价值的流程包括
  • Claude未经过训练的工作流(新工具、专有系统)
  • 非明显的正确顺序(例如,“验证在打包之前,而不是之后”)
  • 容易遗漏的关键步骤(例如,“编辑后必须重新计算公式”)
  • 领域特定序列(例如,MCP服务器的4阶段开发流程)
冗余流程包括
  • 通用文件操作(打开、读取、写入、保存)
  • 标准编程模式(循环、条件判断、错误处理)
  • 文档完善的常用库用法
专家思维模式示例
markdown
在[操作]之前,问自己:
- **目的**:这解决了什么问题?谁会使用它?
- **约束**:隐藏的要求是什么?
- **差异化**:什么让这个解决方案令人难忘?
有价值的领域流程示例
markdown
undefined

Redlining Workflow (Claude wouldn't know this sequence)

红线标注工作流(Claude不知道这个序列)

  1. Convert to markdown:
    pandoc --track-changes=all
  2. Map text to XML: grep for text in document.xml
  3. Implement changes in batches of 3-10
  4. Pack and verify: check ALL changes were applied

**Redundant generic procedures look like**:
```markdown
Step 1: Open the file
Step 2: Find the section
Step 3: Make the change
Step 4: Save and test
The test:
  1. Does it tell Claude WHAT to think about? (thinking patterns)
  2. Does it tell Claude HOW to do things it wouldn't know? (domain procedures)
A good Skill provides both when needed.

  1. 转换为Markdown:
    pandoc --track-changes=all
  2. 映射文本到XML:在document.xml中grep查找文本
  3. 批量实施3-10处更改
  4. 打包并验证:检查所有更改是否已应用

**冗余通用流程示例**:
```markdown
步骤1:打开文件
步骤2:找到章节
步骤3:进行更改
步骤4:保存并测试
测试方法
  1. 它是否告诉Claude要思考什么?(思维模式)
  2. 它是否告诉Claude如何做它不知道的事情?(领域流程)
优质Skill会在需要时同时提供这两者。

D3: Anti-Pattern Quality (15 points)

D3:反模式质量(15分)

Does the Skill have effective NEVER lists?
Why this matters: Half of expert knowledge is knowing what NOT to do. A senior designer sees purple gradient on white background and instinctively cringes — "too AI-generated." This intuition for "what absolutely not to do" comes from stepping on countless landmines.
Claude hasn't stepped on these landmines. It doesn't know Inter font is overused, doesn't know purple gradients are the signature of AI-generated content. Good Skills must explicitly state these "absolute don'ts."
ScoreCriteria
0-3No anti-patterns mentioned
4-7Generic warnings ("avoid errors", "be careful", "consider edge cases")
8-11Specific NEVER list with some reasoning
12-15Expert-grade anti-patterns with WHY — things only experience teaches
Expert anti-patterns (specific + reason):
markdown
NEVER use generic AI-generated aesthetics like:
- Overused font families (Inter, Roboto, Arial)
- Cliched color schemes (particularly purple gradients on white backgrounds)
- Predictable layouts and component patterns
- Default border-radius on everything
Weak anti-patterns (vague, no reasoning):
markdown
Avoid making mistakes.
Be careful with edge cases.
Don't write bad code.
The test: Would an expert read the anti-pattern list and say "yes, I learned this the hard way"? Or would they say "this is obvious to everyone"?

Skill是否有有效的“绝对不要”列表?
为什么重要:专家知识的一半是知道不要做什么。资深设计师看到白色背景上的紫色渐变会本能地皱眉——“太像AI生成的了”。这种“绝对不要做什么”的直觉来自踩过无数的坑。
Claude没有踩过这些坑。它不知道Inter字体被过度使用,不知道紫色渐变是AI生成内容的标志。优质Skill必须明确列出这些“绝对禁忌”。
分数标准
0-3未提及反模式
4-7通用警告(“避免错误”、“小心”、“考虑边缘案例”)
8-11具体的“绝对不要”列表,附带部分理由
12-15专家级反模式,附带原因——只有经验才能教会的内容
专家级反模式(具体+理由):
markdown
绝对不要使用通用AI生成的美学风格,例如:
- 过度使用的字体家族(Inter、Roboto、Arial)
- 陈词滥调的配色方案(尤其是白色背景上的紫色渐变)
- 可预测的布局和组件模式
- 所有元素都使用默认圆角
薄弱的反模式(模糊,无理由):
markdown
避免犯错。
小心边缘案例。
不要写糟糕的代码。
测试方法:专家看到反模式列表会说“是的,我是通过惨痛教训学到的”?还是会说“这对每个人来说都很明显”?

D4: Specification Compliance — Especially Description (15 points)

D4:规范合规性——尤其关注描述质量(15分)

Does the Skill follow official format requirements? Special focus on description quality.
ScoreCriteria
0-5Missing frontmatter or invalid format
6-10Has frontmatter but description is vague or incomplete
11-13Valid frontmatter, description has WHAT but weak on WHEN
14-15Perfect: comprehensive description with WHAT, WHEN, and trigger keywords
Frontmatter requirements:
  • name
    : lowercase, alphanumeric + hyphens only, ≤64 characters
  • description
    : THE MOST CRITICAL FIELD — determines if skill gets used at all

Why description is THE MOST IMPORTANT field:
┌─────────────────────────────────────────────────────────────────────┐
│  SKILL ACTIVATION FLOW                                              │
│                                                                     │
│  User Request → Agent sees ALL skill descriptions → Decides which  │
│                 (only descriptions, not bodies!)     to activate    │
│                                                                     │
│  If description doesn't match → Skill NEVER gets loaded            │
│  If description is vague → Skill might not trigger when it should  │
│  If description lacks keywords → Skill is invisible to the Agent   │
└─────────────────────────────────────────────────────────────────────┘
The brutal truth: A Skill with perfect content but poor description is useless — it will never be activated. The description is the only chance to tell the Agent "use me in these situations."

Description must answer THREE questions:
  1. WHAT: What does this Skill do? (functionality)
  2. WHEN: In what situations should it be used? (trigger scenarios)
  3. KEYWORDS: What terms should trigger this Skill? (searchable terms)
Excellent description (all three elements):
yaml
description: "Comprehensive document creation, editing, and analysis with support
for tracked changes, comments, formatting preservation, and text extraction.
When Claude needs to work with professional documents (.docx files) for:
(1) Creating new documents, (2) Modifying or editing content,
(3) Working with tracked changes, (4) Adding comments, or any other document tasks"
Analysis:
  • WHAT: creation, editing, analysis, tracked changes, comments
  • WHEN: "When Claude needs to work with... for: (1)... (2)... (3)..."
  • KEYWORDS: .docx files, tracked changes, professional documents
Poor description (missing elements):
yaml
description: "处理文档相关功能"
Problems:
  • WHAT: vague ("文档相关功能" — what specifically?)
  • WHEN: missing (when should Agent use this?)
  • KEYWORDS: missing (no ".docx", no specific scenarios)
Another poor example:
yaml
description: "A helpful skill for various tasks"
This is useless — Agent has no idea when to activate it.

Description quality checklist:
  • Lists specific capabilities (not just "helps with X")
  • Includes explicit trigger scenarios ("Use when...", "When user asks for...")
  • Contains searchable keywords (file extensions, domain terms, action verbs)
  • Specific enough that Agent knows EXACTLY when to use it
  • Includes scenarios where this skill MUST be used (not just "can be used")

Skill是否遵循官方格式要求?特别关注描述质量。
分数标准
0-5缺少前置元数据或格式无效
6-10有前置元数据,但描述模糊或不完整
11-13有效的前置元数据,描述包含功能但使用场景薄弱
14-15完美:全面的描述包含功能、使用场景和触发关键词
前置元数据要求
  • name
    :小写,仅包含字母数字和连字符,≤64字符
  • description
    最关键的字段——决定Skill是否会被使用

为什么描述是最重要的字段
┌─────────────────────────────────────────────────────────────────────┐
│  SKILL激活流程                                                      │
│                                                                     │
│  用户请求 → Agent查看所有Skill描述 → 决定激活哪一个                  │
│                (仅查看描述,不查看正文!)                          │
│                                                                     │
│ 如果描述不匹配 → Skill永远不会被激活                                │
│ 如果描述模糊 → Skill可能在应该激活时没有被触发                      │
│ 如果描述缺少关键词 → Skill对Agent来说是不可见的                    │
└─────────────────────────────────────────────────────────────────────┘
残酷的事实:内容完美但描述糟糕的Skill是无用的——它永远不会被激活。描述是告诉Agent“在这些场景下使用我”的唯一机会。

描述必须回答三个问题
  1. 是什么:这个Skill能做什么?(功能)
  2. 何时用:应该在什么场景下使用?(触发场景)
  3. 关键词:哪些术语应该触发这个Skill?(可搜索术语)
优秀描述示例(包含所有三个要素):
yaml
description: "全面的文档创建、编辑和分析,支持修订跟踪、批注、格式保留和文本提取。
当Claude需要处理专业文档(.docx文件)时使用:
(1) 创建新文档,(2) 修改或编辑内容,
(3) 处理修订跟踪,(4) 添加批注,或任何其他文档任务"
分析:
  • 是什么:创建、编辑、分析、修订跟踪、批注
  • 何时用:“当Claude需要处理...时:(1)...(2)...(3)...”
  • 关键词:.docx文件、修订跟踪、专业文档
糟糕描述示例(缺少要素):
yaml
description: "处理文档相关功能"
问题:
  • 是什么:模糊(“文档相关功能”——具体是什么?)
  • 何时用:缺失(Agent应该何时使用?)
  • 关键词:缺失(没有“.docx”,没有具体场景)
另一个糟糕示例
yaml
description: "适用于各种任务的有用Skill"
这完全无用——Agent不知道何时激活它。

描述质量检查清单
  • 列出具体功能(不只是“帮助处理X”)
  • 包含明确的触发场景(“当...时使用”、“当用户请求...时”)
  • 包含可搜索关键词(文件扩展名、领域术语、动作动词)
  • 足够具体,让Agent确切知道何时使用
  • 包含必须使用该Skill的场景(不只是“可以使用”)

D5: Progressive Disclosure (15 points)

D5:渐进式披露(15分)

Does the Skill implement proper content layering?
Skill loading has three layers:
Layer 1: Metadata (always in memory)
         Only name + description
         ~100 tokens per skill

Layer 2: SKILL.md Body (loaded after triggering)
         Detailed guidelines, code examples, decision trees
         Ideal: < 500 lines

Layer 3: Resources (loaded on demand)
         scripts/, references/, assets/
         No limit
ScoreCriteria
0-5Everything dumped in SKILL.md (>500 lines, no structure)
6-10Has references but unclear when to load them
11-13Good layering with MANDATORY triggers present
14-15Perfect: decision trees + explicit triggers + "Do NOT Load" guidance
For Skills WITH references directory, check Loading Trigger Quality:
Trigger QualityCharacteristics
PoorReferences listed at end, no loading guidance
MediocreSome triggers but not embedded in workflow
GoodMANDATORY triggers in workflow steps
ExcellentScenario detection + conditional triggers + "Do NOT Load"
The loading problem:
Loading too little ◄─────────────────────────────────► Loading too much
- References sit unused                    - Wastes context space
- Agent doesn't know when to load          - Irrelevant info dilutes key content
- Knowledge is there but never accessed    - Unnecessary token overhead
Good loading trigger (embedded in workflow):
markdown
undefined
Skill是否实现了适当的内容分层?
Skill加载分为三层:
第一层:元数据(始终在内存中)
         仅包含名称+描述
         每个Skill约100令牌

第二层:SKILL.md正文(触发后加载)
         详细指南、代码示例、决策树
         理想:< 500行

第三层:资源(按需加载)
         scripts/, references/, assets/
         无限制
分数标准
0-5所有内容都堆在SKILL.md中(>500行,无结构)
6-10有参考文件,但加载时机不明确
11-13分层良好,包含强制加载触发点
14-15完美:决策树+明确触发点+“请勿加载”指导
对于包含references目录的Skill,检查加载触发质量:
触发质量特征
参考文件仅在末尾列出,无加载指导
一般有一些触发点,但未嵌入工作流
工作流步骤中包含强制加载触发点
优秀场景检测+条件触发+“请勿加载”
加载问题
加载过少 ◄─────────────────────────────────► 加载过多
- 参考文件未被使用                    - 浪费上下文空间
- Agent不知道何时加载                - 无关内容稀释关键信息
- 知识存在但从未被访问                - 不必要的令牌开销
良好的加载触发示例(嵌入工作流):
markdown
undefined

Creating New Document

创建新文档

MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read
docx-js.md
(~500 lines) completely from start to finish. NEVER set any range limits when reading this file.
Do NOT load
ooxml.md
or
redlining.md
for this task.

**Bad loading trigger** (just listed):
```markdown
强制要求 - 阅读整个文件:在开始之前,你必须完全阅读
docx-js.md
(约500行)。 阅读此文件时绝对不要设置任何范围限制。
请勿加载
ooxml.md
redlining.md
用于此任务。

**糟糕的加载触发示例**(仅列出):
```markdown

References

参考

  • docx-js.md - for creating documents
  • ooxml.md - for editing
  • redlining.md - for tracking changes

**For simple Skills** (no references, <100 lines): Score based on conciseness and self-containment.

---
  • docx-js.md - 用于创建文档
  • ooxml.md - 用于编辑
  • redlining.md - 用于修订跟踪

**对于简单Skill**(无参考文件,<100行):根据简洁性和自包含性评分。

---

D6: Freedom Calibration (15 points)

D6:自由度校准(15分)

Is the level of specificity appropriate for the task's fragility?
Different tasks need different levels of constraint. This is about matching freedom to fragility.
ScoreCriteria
0-5Severely mismatched (rigid scripts for creative tasks, vague for fragile ops)
6-10Partially appropriate, some mismatches
11-13Good calibration for most scenarios
14-15Perfect freedom calibration throughout
The freedom spectrum:
Task TypeShould HaveWhyExample Skill
Creative/DesignHigh freedomMultiple valid approaches, differentiation is valuefrontend-design
Code reviewMedium freedomPrinciples exist but judgment requiredcode-review
File format operationsLow freedomOne wrong byte corrupts file, consistency criticaldocx, xlsx, pdf
High freedom (text-based instructions):
markdown
Commit to a BOLD aesthetic direction. Pick an extreme: brutally minimal,
maximalist chaos, retro-futuristic, organic natural...
Medium freedom (pseudocode or parameterized):
markdown
Review priority:
1. Security vulnerabilities (must fix)
2. Logic errors (must fix)
3. Performance issues (should fix)
4. Maintainability (optional)
Low freedom (specific scripts, exact steps):
markdown
**MANDATORY**: Use exact script in `scripts/create-doc.py`
Parameters: --title "X" --author "Y"
Do NOT modify the script.
The test: Ask "if Agent makes a mistake, what's the consequence?"
  • High consequence → Low freedom
  • Low consequence → High freedom

特定程度是否与任务的脆弱性相匹配?
不同任务需要不同程度的约束。这关乎自由度与脆弱性的匹配。
分数标准
0-5严重不匹配(创意任务用严格脚本,脆弱操作用模糊指导)
6-10部分匹配,存在一些不匹配
11-13大多数场景校准良好
14-15全程完美校准自由度
自由度范围
任务类型应具备原因示例Skill
创意/设计高自由度多种有效方法,差异化是价值所在frontend-design
代码评审中等自由度存在原则,但需要判断code-review
文件格式操作低自由度一个错误字节就会损坏文件,一致性至关重要docx, xlsx, pdf
高自由度(基于文本的指导):
markdown
采用大胆的美学方向。选择一个极端:极简主义、极繁主义、复古未来主义、有机自然风格...
中等自由度(伪代码或参数化):
markdown
评审优先级:
1. 安全漏洞(必须修复)
2. 逻辑错误(必须修复)
3. 性能问题(应该修复)
4. 可维护性(可选)
低自由度(具体脚本,精确步骤):
markdown
**强制要求**:使用`scripts/create-doc.py`中的精确脚本
参数:--title "X" --author "Y"
请勿修改此脚本。
测试方法:问“如果Agent犯错,后果是什么?”
  • 高后果 → 低自由度
  • 低后果 → 高自由度

D7: Pattern Recognition (10 points)

D7:模式识别(10分)

Does the Skill follow an established official pattern?
Through analyzing 17 official Skills, we identified 5 main design patterns:
Pattern~LinesKey CharacteristicsExampleWhen to Use
Mindset~50Thinking > technique, strong NEVER list, high freedomfrontend-designCreative tasks requiring taste
Navigation~30Minimal SKILL.md, routes to sub-filesinternal-commsMultiple distinct scenarios
Philosophy~150Two-step: Philosophy → Express, emphasizes craftcanvas-designArt/creation requiring originality
Process~200Phased workflow, checkpoints, medium freedommcp-builderComplex multi-step projects
Tool~300Decision trees, code examples, low freedomdocx, pdf, xlsxPrecise operations on specific formats
ScoreCriteria
0-3No recognizable pattern, chaotic structure
4-6Partially follows a pattern with significant deviations
7-8Clear pattern with minor deviations
9-10Masterful application of appropriate pattern
Pattern selection guide:
Your Task CharacteristicsRecommended Pattern
Needs taste and creativityMindset (~50 lines)
Needs originality and craft qualityPhilosophy (~150 lines)
Has multiple distinct sub-scenariosNavigation (~30 lines)
Complex multi-step projectProcess (~200 lines)
Precise operations on specific formatTool (~300 lines)

Skill是否遵循已确立的官方模式?
通过分析17个官方Skill,我们确定了5种主要设计模式:
模式约行数关键特征示例使用场景
思维模式~50思维>技术,强大的“绝对不要”列表,高自由度frontend-design需要品味的创意任务
导航型~30极简SKILL.md,路由到子文件internal-comms多个不同场景
理念型~150两步:理念→表达,强调工艺canvas-design需要原创性的艺术/创作
流程型~200分阶段工作流,检查点,中等自由度mcp-builder复杂多步骤项目
工具型~300决策树,代码示例,低自由度docx, pdf, xlsx特定格式的精确操作
分数标准
0-3无可识别模式,结构混乱
4-6部分遵循模式,但有重大偏差
7-8模式清晰,有轻微偏差
9-10熟练应用适当的模式
模式选择指南
你的任务特征推荐模式
需要品味和创意思维模式(~50行)
需要原创性和工艺质量理念型(~150行)
有多个不同子场景导航型(~30行)
复杂多步骤项目流程型(~200行)
特定格式的精确操作工具型(~300行)

D8: Practical Usability (15 points)

D8:实际可用性(15分)

Can an Agent actually use this Skill effectively?
ScoreCriteria
0-5Confusing, incomplete, contradictory, or untested guidance
6-10Usable but with noticeable gaps
11-13Clear guidance for common cases
14-15Comprehensive coverage including edge cases and error handling
Check for:
  • Decision trees: For multi-path scenarios, is there clear guidance on which path to take?
  • Code examples: Do they actually work? Or are they pseudocode that breaks?
  • Error handling: What if the main approach fails? Are fallbacks provided?
  • Edge cases: Are unusual but realistic scenarios covered?
  • Actionability: Can Agent immediately act, or needs to figure things out?
Good usability (decision tree + fallback):
markdown
| Task | Primary Tool | Fallback | When to Use Fallback |
|------|-------------|----------|----------------------|
| Read text | pdftotext | PyMuPDF | Need layout info |
| Extract tables | camelot-py | tabula-py | camelot fails |

**Common issues**:
- Scanned PDF: pdftotext returns blank → Use OCR first
- Encrypted PDF: Permission error → Use PyMuPDF with password
Poor usability (vague):
markdown
Use appropriate tools for PDF processing.
Handle errors properly.
Consider edge cases.

Agent能否实际有效使用这个Skill?
分数标准
0-5指导混乱、不完整、矛盾或未经测试
6-10可用但存在明显差距
11-13常见场景指导清晰
14-15全面覆盖,包括边缘案例和错误处理
检查要点
  • 决策树:对于多路径场景,是否有清晰的路径选择指导?
  • 代码示例:它们真的能运行吗?还是会出错的伪代码?
  • 错误处理:如果主要方法失败怎么办?是否有备选方案?
  • 边缘案例:是否覆盖了不常见但现实的场景?
  • 可操作性:Agent能否立即行动,还是需要自行摸索?
良好可用性示例(决策树+备选方案):
markdown
| 任务 | 主要工具 | 备选方案 | 何时使用备选方案 |
|------|-------------|----------|----------------------|
| 读取文本 | pdftotext | PyMuPDF | 需要布局信息时 |
| 提取表格 | camelot-py | tabula-py | camelot失败时 |

**常见问题**- 扫描版PDF:pdftotext返回空白 → 先使用OCR
- 加密PDF:权限错误 → 使用带密码的PyMuPDF
糟糕可用性示例(模糊):
markdown
使用适当的工具进行PDF处理。
正确处理错误。
考虑边缘案例。

NEVER Do When Evaluating

评估时绝对不要做的事

  • NEVER give high scores just because it "looks professional" or is well-formatted
  • NEVER ignore token waste — every redundant paragraph should result in deduction
  • NEVER let length impress you — a 43-line Skill can outperform a 500-line Skill
  • NEVER skip mentally testing the decision trees — do they actually lead to correct choices?
  • NEVER forgive explaining basics with "but it provides helpful context"
  • NEVER overlook missing anti-patterns — if there's no NEVER list, that's a significant gap
  • NEVER assume all procedures are valuable — distinguish domain-specific from generic
  • NEVER undervalue the description field — poor description = skill never gets used
  • NEVER put "when to use" info only in the body — Agent only sees description before loading

  • 绝对不要仅仅因为Skill“看起来专业”或格式良好就给高分
  • 绝对不要忽略令牌浪费——每一段冗余内容都应该扣分
  • 绝对不要被长度打动——43行的Skill可能比500行的Skill表现更好
  • 绝对不要跳过对决策树的测试——它们真的能引导出正确的选择吗?
  • 绝对不要用“但它提供了有用的上下文”来原谅基础内容的解释
  • 绝对不要忽略缺失的反模式——如果没有“绝对不要”列表,这是一个重大缺陷
  • 绝对不要假设所有流程都有价值——区分领域特定和通用流程
  • 绝对不要低估描述字段的价值——糟糕的描述=Skill永远不会被使用
  • 绝对不要只在正文中放置“何时使用”信息——Agent在加载前仅查看描述

Evaluation Protocol

评估流程

Step 1: First Pass — Knowledge Delta Scan

步骤1:首次扫描——知识增量检查

Read SKILL.md completely and for each section ask:
"Does Claude already know this?"
Mark each section as:
  • [E] Expert: Claude genuinely doesn't know this — value-add
  • [A] Activation: Claude knows but brief reminder is useful — acceptable
  • [R] Redundant: Claude definitely knows this — should be deleted
Calculate rough ratio: E:A:R
  • Good Skill: >70% Expert, <20% Activation, <10% Redundant
  • Mediocre Skill: 40-70% Expert, high Activation
  • Bad Skill: <40% Expert, high Redundant
完整阅读SKILL.md,对每个章节问:
“Claude已经知道这个吗?”
将每个章节标记为:
  • [E] 专家级:Claude确实不知道——增值内容
  • [A] 激活型:Claude知道但简短提醒有用——可接受
  • [R] 冗余型:Claude肯定知道——应删除
计算大致比例:E:A:R
  • 优质Skill:>70%专家级,<20%激活型,<10%冗余型
  • 中等Skill:40-70%专家级,高激活型
  • 劣质Skill:<40%专家级,高冗余型

Step 2: Structure Analysis

步骤2:结构分析

[ ] Check frontmatter validity
[ ] Count total lines in SKILL.md
[ ] List all reference files and their sizes
[ ] Identify which pattern the Skill follows
[ ] Check for loading triggers (if references exist)
[ ] 检查前置元数据有效性
[ ] 统计SKILL.md总行数
[ ] 列出所有参考文件及其大小
[ ] 识别Skill遵循的模式
[ ] 检查加载触发点(如果有参考文件)

Step 3: Score Each Dimension

步骤3:为每个维度评分

For each of the 8 dimensions:
  1. Find specific evidence (quote relevant lines)
  2. Assign score with one-line justification
  3. Note specific improvements if score < max
对8个维度中的每个维度:
  1. 找到具体证据(引用相关行)
  2. 给出分数并附上一行理由
  3. 如果分数未达满分,记录具体改进建议

Step 4: Calculate Total & Grade

步骤4:计算总分和等级

Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
Max = 120 points
Grade Scale (percentage-based):
GradePercentageMeaning
A90%+ (108+)Excellent — production-ready expert Skill
B80-89% (96-107)Good — minor improvements needed
C70-79% (84-95)Adequate — clear improvement path
D60-69% (72-83)Below Average — significant issues
F<60% (<72)Poor — needs fundamental redesign
总分 = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8
满分 = 120分
等级划分(基于百分比):
等级百分比含义
A90%+ (108+)优秀——可投入生产的专家级Skill
B80-89% (96-107)良好——需要小幅度改进
C70-79% (84-95)合格——有清晰的改进路径
D60-69% (72-83)低于平均水平——存在重大问题
F<60% (<72)差——需要彻底重新设计

Step 5: Generate Report

步骤5:生成报告

markdown
undefined
markdown
undefined

Skill Evaluation Report: [Skill Name]

Skill评估报告:[Skill名称]

Summary

摘要

  • Total Score: X/120 (X%)
  • Grade: [A/B/C/D/F]
  • Pattern: [Mindset/Navigation/Philosophy/Process/Tool]
  • Knowledge Ratio: E:A:R = X:Y:Z
  • Verdict: [One sentence assessment]
  • 总分:X/120 (X%)
  • 等级:[A/B/C/D/F]
  • 模式:[思维模式/导航型/理念型/流程型/工具型]
  • 知识比例:E:A:R = X:Y:Z
  • 结论:[一句话评估]

Dimension Scores

维度得分

DimensionScoreMaxNotes
D1: Knowledge DeltaX20
D2: Mindset vs MechanicsX15
D3: Anti-Pattern QualityX15
D4: Specification ComplianceX15
D5: Progressive DisclosureX15
D6: Freedom CalibrationX15
D7: Pattern RecognitionX10
D8: Practical UsabilityX15
维度得分满分备注
D1:知识增量X20
D2:思维模式与流程X15
D3:反模式质量X15
D4:规范合规性X15
D5:渐进式披露X15
D6:自由度校准X15
D7:模式识别X10
D8:实际可用性X15

Critical Issues

关键问题

[List must-fix problems that significantly impact the Skill's effectiveness]
[列出严重影响Skill有效性的必须修复问题]

Top 3 Improvements

三大改进建议

  1. [Highest impact improvement with specific guidance]
  2. [Second priority improvement]
  3. [Third priority improvement]
  1. [影响最大的改进,附具体指导]
  2. [第二优先级改进]
  3. [第三优先级改进]

Detailed Analysis

详细分析

[For each dimension scoring below 80%, provide:
  • What's missing or problematic
  • Specific examples from the Skill
  • Concrete suggestions for improvement]

---
[对每个得分低于80%的维度,提供:
  • 缺失或有问题的内容
  • Skill中的具体示例
  • 具体改进建议]

---

Common Failure Patterns

常见失败模式

Pattern 1: The Tutorial

模式1:教程型

Symptom: Explains what PDF is, how Python works, basic library usage
Root cause: Author assumes Skill should "teach" the model
Fix: Claude already knows this. Delete all basic explanations.
     Focus on expert decisions, trade-offs, and anti-patterns.
症状:解释什么是PDF、Python如何工作、基础库用法
根本原因:作者认为Skill应该“教”模型
修复:Claude已经知道这些。删除所有基础解释。
     专注于专家决策、权衡和反模式。

Pattern 2: The Dump

模式2:堆砌型

Symptom: SKILL.md is 800+ lines with everything included
Root cause: No progressive disclosure design
Fix: Core routing and decision trees in SKILL.md (<300 lines ideal)
     Detailed content in references/, loaded on-demand
症状:SKILL.md有800+行,包含所有内容
根本原因:没有渐进式披露设计
修复:核心路由和决策树放在SKILL.md中(理想<300行)
     详细内容放在references/中,按需加载

Pattern 3: The Orphan References

模式3:孤立参考型

Symptom: References directory exists but files are never loaded
Root cause: No explicit loading triggers
Fix: Add "MANDATORY - READ ENTIRE FILE" at workflow decision points
     Add "Do NOT Load" to prevent over-loading
症状:存在references目录,但文件从未被加载
根本原因:没有明确的加载触发点
修复:在工作流决策点添加“强制要求 - 阅读整个文件”
     添加“请勿加载”以防止过度加载

Pattern 4: The Checkbox Procedure

模式4: checkbox流程型

Symptom: Step 1, Step 2, Step 3... mechanical procedures
Root cause: Author thinks in procedures, not thinking frameworks
Fix: Transform into "Before doing X, ask yourself..."
     Focus on decision principles, not operation sequences
症状:步骤1、步骤2、步骤3...机械流程
根本原因:作者以流程而非思维框架思考
修复:转换为“在做X之前,问自己...”
     专注于决策原则,而非操作序列

Pattern 5: The Vague Warning

模式5:模糊警告型

Symptom: "Be careful", "avoid errors", "consider edge cases"
Root cause: Author knows things can go wrong but hasn't articulated specifics
Fix: Specific NEVER list with concrete examples and non-obvious reasons
     "NEVER use X because [specific problem that takes experience to learn]"
症状:“小心”、“避免错误”、“考虑边缘案例”
根本原因:作者知道可能出错,但未明确说明具体内容
修复:具体的“绝对不要”列表,附带具体示例和非明显原因
     “绝对不要使用X,因为[需要经验才能学到的具体问题]”

Pattern 6: The Invisible Skill

模式6:隐形Skill型

Symptom: Great content but skill rarely gets activated
Root cause: Description is vague, missing keywords, or lacks trigger scenarios
Fix: Description must answer WHAT, WHEN, and include KEYWORDS
     "Use when..." + specific scenarios + searchable terms

Example fix:
BAD:  "Helps with document tasks"
GOOD: "Create, edit, and analyze .docx files. Use when working with
       Word documents, tracked changes, or professional document formatting."
症状:内容优秀,但很少被激活
根本原因:描述模糊、缺少关键词或触发场景
修复:描述必须回答是什么、何时用,并包含关键词
     “当...时使用”+具体场景+可搜索术语

修复示例:
差:  “帮助处理文档任务”
好:  “创建、编辑和分析.docx文件。在处理Word文档、修订跟踪或专业文档格式时使用。”

Pattern 7: The Wrong Location

模式7:位置错误型

Symptom: "When to use this Skill" section in body, not in description
Root cause: Misunderstanding of three-layer loading
Fix: Move all triggering information to description field
     Body is only loaded AFTER triggering decision is made
症状:“何时使用此Skill”部分在正文中,而非描述中
根本原因:误解三层加载机制
修复:将所有触发信息移至描述字段
     正文仅在触发决策后才会被加载

Pattern 8: The Over-Engineered

模式8:过度工程型

Symptom: README.md, CHANGELOG.md, INSTALLATION_GUIDE.md, CONTRIBUTING.md
Root cause: Treating Skill like a software project
Fix: Delete all auxiliary files. Only include what Agent needs for the task.
     No documentation about the Skill itself.
症状:包含README.md、CHANGELOG.md、INSTALLATION_GUIDE.md、CONTRIBUTING.md
根本原因:将Skill视为软件项目
修复:删除所有辅助文件。仅保留Agent完成任务所需的内容。
     不要包含关于Skill本身的文档。

Pattern 9: The Freedom Mismatch

模式9:自由度不匹配型

Symptom: Rigid scripts for creative tasks, vague guidance for fragile operations
Root cause: Not considering task fragility
Fix: High freedom for creative (principles, not steps)
     Low freedom for fragile (exact scripts, no parameters)

症状:创意任务用严格脚本,脆弱操作用模糊指导
根本原因:未考虑任务的脆弱性
修复:创意任务→高自由度(原则)
     脆弱操作→低自由度(精确脚本)

Quick Reference Checklist

快速参考检查清单

┌─────────────────────────────────────────────────────────────────────────┐
│  SKILL EVALUATION QUICK CHECK                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  KNOWLEDGE DELTA (most important):                                      │
│    [ ] No "What is X" explanations for basic concepts                   │
│    [ ] No step-by-step tutorials for standard operations                │
│    [ ] Has decision trees for non-obvious choices                       │
│    [ ] Has trade-offs only experts would know                           │
│    [ ] Has edge cases from real-world experience                        │
│                                                                         │
│  MINDSET + PROCEDURES:                                                  │
│    [ ] Transfers thinking patterns (how to think about problems)        │
│    [ ] Has "Before doing X, ask yourself..." frameworks                 │
│    [ ] Includes domain-specific procedures Claude wouldn't know         │
│    [ ] Distinguishes valuable procedures from generic ones              │
│                                                                         │
│  ANTI-PATTERNS:                                                         │
│    [ ] Has explicit NEVER list                                          │
│    [ ] Anti-patterns are specific, not vague                            │
│    [ ] Includes WHY (non-obvious reasons)                               │
│                                                                         │
│  SPECIFICATION (description is critical!):                              │
│    [ ] Valid YAML frontmatter                                           │
│    [ ] name: lowercase, ≤64 chars                                       │
│    [ ] description answers: WHAT does it do?                            │
│    [ ] description answers: WHEN should it be used?                     │
│    [ ] description contains trigger KEYWORDS                            │
│    [ ] description is specific enough for Agent to know when to use     │
│                                                                         │
│  STRUCTURE:                                                             │
│    [ ] SKILL.md < 500 lines (ideal < 300)                               │
│    [ ] Heavy content in references/                                     │
│    [ ] Loading triggers embedded in workflow                            │
│    [ ] Has "Do NOT Load" for preventing over-loading                    │
│                                                                         │
│  FREEDOM:                                                               │
│    [ ] Creative tasks → High freedom (principles)                       │
│    [ ] Fragile operations → Low freedom (exact scripts)                 │
│                                                                         │
│  USABILITY:                                                             │
│    [ ] Decision trees for multi-path scenarios                          │
│    [ ] Working code examples                                            │
│    [ ] Error handling and fallbacks                                     │
│    [ ] Edge cases covered                                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│  SKILL评估快速检查                                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│ 知识增量(最重要):                                                    │
│    [ ] 没有对基础概念的“什么是X”解释                                   │
│    [ ] 没有标准操作的分步教程                                        │
│    [ ] 有非明显选择的决策树                                           │
│    [ ] 有只有专家才知道的权衡                                         │
│    [ ] 有来自实际经验的边缘案例                                        │
│                                                                         │
│ 思维模式+流程:                                                          │
│    [ ] 传递思维模式(如何思考问题)                                      │
│    [ ] 有“在做X之前,问自己...”框架                                   │
│    [ ] 包含Claude不知道的领域特定流程                                   │
│    [ ] 区分有价值的流程和通用流程                                      │
│                                                                         │
│ 反模式:                                                               │
│    [ ] 有明确的“绝对不要”列表                                          │
│    [ ] 反模式具体,不模糊                                              │
│    [ ] 包含原因(非明显理由)                                           │
│                                                                         │
│ 规范(描述至关重要!):                                                │
│    [ ] 有效的YAML前置元数据                                           │
│    [ ] name:小写,≤64字符                                           │
│    [ ] 描述回答:它能做什么?(是什么)                            │
│    [ ] 描述回答:应该何时使用?(何时用)                     │
│    [ ] 描述包含触发关键词                            │
│    [ ] 描述足够具体,让Agent知道何时使用     │
│                                                                         │
│ 结构:                                                             │
│    [ ] SKILL.md < 500行(理想< 300)                               │
│    [ ] 大量内容在references/中                                     │
│    [ ] 加载触发点嵌入工作流                            │
│    [ ] 有“请勿加载”以防止过度加载                    │
│                                                                         │
│ 自由度:                                                               │
│    [ ] 创意任务 → 高自由度(原则)                       │
│    [ ] 脆弱操作 → 低自由度(精确脚本)                 │
│                                                                         │
│ 可用性:                                                             │
│    [ ] 多路径场景有决策树                          │
│    [ ] 可运行的代码示例                                            │
│    [ ] 错误处理和备选方案                                     │
│    [ ] 覆盖边缘案例                                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Meta-Question

核心问题

When evaluating any Skill, always return to this fundamental question:
"Would an expert in this domain, looking at this Skill, say: 'Yes, this captures knowledge that took me years to learn'?"
If the answer is yes → the Skill has genuine value. If the answer is no → it's compressing what Claude already knows.
The best Skills are compressed expert brains — they take a designer's 10 years of aesthetic accumulation and compress it into 43 lines, or a document expert's operational experience into a 200-line decision tree.
What gets compressed must be things Claude doesn't have. Otherwise, it's garbage compression.

评估任何Skill时,始终回到这个根本问题:
“该领域的专家看到这个Skill时,会说: '是的,这捕捉了我花了数年时间才学到的知识'吗?”
如果答案是肯定的 → Skill具有真正的价值。 如果答案是否定的 → 它只是在压缩Claude已经知道的内容。
最好的Skill是压缩的专家大脑——它们将设计师10年的美学积累压缩成43行,或将文档专家的操作经验压缩成200行的决策树。
被压缩的必须是Claude没有的内容。否则,就是无效压缩。

Self-Evaluation Note

自我评估说明

This Skill (skill-judge) should itself pass evaluation:
  • Knowledge Delta: Provides specific evaluation criteria Claude wouldn't generate on its own
  • Mindset: Shapes how to think about Skill quality, not just checklist items
  • Anti-Patterns: "NEVER Do When Evaluating" section with specific don'ts
  • Specification: Valid frontmatter with comprehensive description
  • Progressive Disclosure: Self-contained, no external references needed
  • Freedom: Medium freedom appropriate for evaluation task
  • Pattern: Follows Tool pattern with decision frameworks
  • Usability: Clear protocol, report template, quick reference
Evaluate this Skill against itself as a calibration exercise.
本Skill(skill-judge)本身也应该通过评估:
  • 知识增量:提供Claude无法自行生成的具体评估标准
  • 思维模式:塑造如何思考Skill质量,而非仅提供检查项
  • 反模式:“评估时绝对不要做的事”部分包含具体禁忌
  • 规范:有效的前置元数据,包含全面的描述
  • 渐进式披露:自包含,无需外部参考
  • 自由度:中等自由度,适合评估任务
  • 模式:遵循工具型模式,包含决策框架
  • 可用性:清晰的流程、报告模板、快速参考
将本Skill与自身进行评估,作为校准练习。