evaluation-methodology
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEvaluation Methodology
评估方法论
This document is the authoritative reference for how PluginEval measures plugin and skill quality.
It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge
thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips.
Related: Full rubric anchors
本文档是PluginEval评估插件与Skill质量的权威参考指南。内容包含三层评估体系、十大评分维度、综合评分公式、质量徽章阈值、反模式标记、Elo排名以及可落地的优化建议。
相关链接:完整评分准则锚点
The Three Evaluation Layers
三层评估体系
PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for
each applicable dimension, and later layers override or blend with earlier ones according to
per-dimension blend weights.
PluginEval由三个互补的评估层组成。每个层会为适用的维度生成0.0到1.0之间的分数,后续层会根据各维度的融合权重覆盖或与前层分数融合。
Layer 1 — Static Analysis
第一层——静态分析
Speed: < 2 seconds. No LLM calls. Deterministic.
The static analyzer () runs six sub-checks directly against the parsed SKILL.md:
layers/static.py| Sub-check | What it measures |
|---|---|
| Name presence, description length, trigger-phrase quality |
| Output/input documentation, code block count, orchestrator anti-pattern |
| Line count vs. sweet-spot (200–600 lines), references/ and assets/ bonuses |
| Heading density, code blocks, examples section, troubleshooting section |
| MUST/NEVER/ALWAYS density, duplicate-line repetition ratio |
| Cross-references to other skills/agents, "related"/"see also" mentions |
These six sub-checks feed directly into six of the ten final dimensions (via
mapping). The remaining four dimensions — , ,
, and part of — receive no static contribution and rely
entirely on Layer 2 and/or Layer 3.
STATIC_TO_DIMENSIONoutput_qualityscope_calibrationrobustnesstriggering_accuracyAnti-pattern penalty is applied multiplicatively to the Layer 1 score:
penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)Each additional detected anti-pattern reduces the score by 5%, flooring at 50%.
速度: < 2秒。无需调用LLM。结果具有确定性。
静态分析器()直接对解析后的SKILL.md文件执行六项子检查:
layers/static.py| 子检查项 | 测量内容 |
|---|---|
| 名称是否存在、描述长度、触发短语质量 |
| 输出/输入文档、代码块数量、编排反模式 |
| 行数是否处于最优区间(200–600行)、references/和assets/目录加分项 |
| 标题密度、代码块、示例章节、故障排查章节 |
| MUST/NEVER/ALWAYS指令密度、重复行占比 |
| 与其他Skill/Agent的交叉引用、“相关”/“另请参阅”提及情况 |
这六项子检查直接对应最终十大维度中的六个维度(通过映射)。剩余四个维度——、、以及的部分评分,完全依赖第二层和/或第三层的结果。
STATIC_TO_DIMENSIONoutput_qualityscope_calibrationrobustnesstriggering_accuracy反模式惩罚会以乘法方式作用于第一层分数:
penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)每检测到一个额外的反模式,分数会降低5%,最低降至50%。
Layer 2 — LLM Judge
第二层——LLM评估器
Speed: 30–90 seconds. One or more LLM calls (Sonnet by default). Non-deterministic.
The agent reads the SKILL.md and any files, then scores four
dimensions using anchored rubrics (see references/rubrics.md):
eval-judgereferences/- Triggering accuracy — F1 score derived from 10 mental test prompts
- Orchestration fitness — Worker purity assessment (0–1 rubric)
- Output quality — Simulates 3 realistic tasks; assesses instruction quality
- Scope calibration — Judges depth and breadth relative to the skill's category
The judge returns a structured JSON object (no markdown fences) that the eval engine merges
into the composite. When , scores are averaged and Cohen's kappa is reported as
an inter-judge agreement metric.
judges > 1速度: 30–90秒。调用一次或多次LLM(默认使用Sonnet模型)。结果具有非确定性。
eval-judgereferences/- 触发准确性——基于10个测试提示生成的F1分数
- 编排适配性——Worker纯度评估(0–1分准则)
- 输出质量——模拟3个真实任务;评估指令质量
- 范围校准——评估Skill的深度与广度是否符合其类别定位
评估器会返回结构化JSON对象(无markdown围栏),由评估引擎合并到综合评分中。当时,分数会取平均值,并报告Cohen's kappa值作为评估者间一致性指标。
judges > 1Layer 3 — Monte Carlo Simulation
第三层——蒙特卡洛模拟
Speed: 5–20 minutes. N=50 simulated Agent SDK invocations (default). Statistical.
Monte Carlo runs real prompts through the skill and records:
N- Activation rate — Fraction of prompts that triggered the skill
- Output consistency — Coefficient of variation (CV) across quality scores
- Failure rate — Error/crash fraction with Clopper-Pearson exact CIs
- Token efficiency — Median token count, IQR, outlier count
The Layer 3 composite formula:
mc_score = 0.40 × activation_rate
+ 0.30 × (1 − min(1.0, CV))
+ 0.20 × (1 − failure_rate)
+ 0.10 × efficiency_normwhere .
efficiency_norm = max(0, 1 − median_tokens / 8000)速度: 5–20分钟。默认执行N=50次模拟Agent SDK调用。结果基于统计分析。
蒙特卡洛模拟会将N个真实提示输入Skill并记录以下数据:
- 激活率——触发Skill的提示占比
- 输出一致性——质量分数的变异系数(CV)
- 失败率——错误/崩溃占比及Clopper-Pearson精确置信区间
- Token效率——中位数Token数、四分位距、异常值数量
第三层综合评分公式:
mc_score = 0.40 × activation_rate
+ 0.30 × (1 − min(1.0, CV))
+ 0.20 × (1 − failure_rate)
+ 0.10 × efficiency_norm其中。
efficiency_norm = max(0, 1 − median_tokens / 8000)Composite Scoring Formula
综合评分公式
The final score is a weighted blend across all three layers for each dimension, then summed:
composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty最终分数是各维度在三层评估中的加权融合结果之和:
composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penaltyDimension Weights
维度权重
| Dimension | Weight | Why it matters |
|---|---|---|
| 0.25 | A skill that never fires — or fires incorrectly — has no value |
| 0.20 | Skills must be pure workers; supervisor logic belongs in agents |
| 0.15 | Correct, complete output is the primary deliverable |
| 0.12 | Neither a stub nor a bloated monster |
| 0.10 | SKILL.md is lean; detail lives in references/ |
| 0.06 | Minimal context waste per invocation |
| 0.05 | Handles edge cases without crashing |
| 0.03 | Correct sections in the right order |
| 0.02 | Working, copy-paste-ready examples |
| 0.02 | Cross-references; no duplication with siblings |
| 维度 | 权重 | 重要性 |
|---|---|---|
| 0.25 | 无法触发或触发错误的Skill毫无价值 |
| 0.20 | Skill必须是纯Worker;管理者逻辑应属于Agent |
| 0.15 | 正确、完整的输出是核心交付物 |
| 0.12 | 既不能过于简略,也不能冗余臃肿 |
| 0.10 | SKILL.md应简洁;详细内容放在references/目录中 |
| 0.06 | 每次调用的上下文浪费最小化 |
| 0.05 | 能处理边缘情况而不崩溃 |
| 0.03 | 正确的章节结构与顺序 |
| 0.02 | 可直接复制使用的可用示例 |
| 0.02 | 交叉引用;避免与其他同类Skill重复 |
Layer Blend Weights
层融合权重
Each dimension draws from different layers at different ratios. With all three layers active
( or ):
--depth deepcertify| Dimension | Static | Judge | Monte Carlo |
|---|---|---|---|
| 0.15 | 0.25 | 0.60 |
| 0.10 | 0.70 | 0.20 |
| 0.00 | 0.40 | 0.60 |
| 0.30 | 0.55 | 0.15 |
| 0.80 | 0.20 | 0.00 |
| 0.40 | 0.10 | 0.50 |
| 0.00 | 0.20 | 0.80 |
| 0.90 | 0.10 | 0.00 |
| 0.30 | 0.70 | 0.00 |
| 0.85 | 0.15 | 0.00 |
At (static + judge only), blends are renormalized to drop the Monte Carlo
column. At (static only), all weight falls on Layer 1.
--depth standard--depth quick每个维度会从不同层获取不同比例的分数。当三层全部启用时(或模式):
--depth deepcertify| 维度 | 静态分析 | LLM评估器 | 蒙特卡洛模拟 |
|---|---|---|---|
| 0.15 | 0.25 | 0.60 |
| 0.10 | 0.70 | 0.20 |
| 0.00 | 0.40 | 0.60 |
| 0.30 | 0.55 | 0.15 |
| 0.80 | 0.20 | 0.00 |
| 0.40 | 0.10 | 0.50 |
| 0.00 | 0.20 | 0.80 |
| 0.90 | 0.10 | 0.00 |
| 0.30 | 0.70 | 0.00 |
| 0.85 | 0.15 | 0.00 |
在模式下(仅静态分析+LLM评估器),融合权重会重新归一化,移除蒙特卡洛模拟列。在模式下(仅静态分析),所有权重均分配给第一层。
--depth standard--depth quickBlended Score Calculation
融合分数计算
For a given depth, the blended score for dimension is:
dblended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
─────────────────────────────────────────────────────
Σ( layer_weight[d][layer] for available layers )This normalization ensures that skipping Monte Carlo at standard depth doesn't artificially
deflate scores.
对于指定评估深度,维度的融合分数计算公式为:
dblended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
─────────────────────────────────────────────────────
Σ( layer_weight[d][layer] for available layers )这种归一化确保在standard深度下跳过蒙特卡洛模拟不会人为拉低分数。
Interpreting Dimension Scores
维度分数解读
Each dimension score is a float in . The CLI converts it to a letter grade:
[0.0, 1.0]| Grade | Score range | Meaning |
|---|---|---|
| A | 0.90 – 1.00 | Excellent — no meaningful improvement needed |
| B | 0.80 – 0.89 | Good — minor gaps only |
| C | 0.70 – 0.79 | Adequate — one or two clear improvement areas |
| D | 0.60 – 0.69 | Marginal — needs targeted work |
| F | < 0.60 | Failing — significant remediation required |
When reading a report, focus first on the lowest-graded dimension that has the highest weight.
A D in (weight 0.25) costs far more than a D in
(weight 0.02).
triggering_accuracyecosystem_coherenceConfidence intervals appear in the report when Layer 2 or Layer 3 ran. Narrow CIs (± < 5
points) indicate stable scores. Wide CIs suggest inconsistency — often caused by an ambiguous
description or instructions that work for some prompt styles but not others.
每个维度的分数为0.0到1.0之间的浮点数。CLI会将其转换为字母等级:
| 等级 | 分数范围 | 含义 |
|---|---|---|
| A | 0.90 – 1.00 | 优秀——无需实质性改进 |
| B | 0.80 – 0.89 | 良好——仅存在微小差距 |
| C | 0.70 – 0.79 | 合格——存在1-2个明确的改进方向 |
| D | 0.60 – 0.69 | 边缘——需要针对性优化 |
| F | < 0.60 | 不合格——需要重大整改 |
查看报告时,应首先关注权重最高的低分维度。(权重0.25)得分为D的影响远大于(权重0.02)得分为D的影响。
triggering_accuracyecosystem_coherence当第二层或第三层运行时,报告中会显示置信区间。窄置信区间(± <5分)表示分数稳定。宽置信区间表明结果不一致——通常由模糊的描述或仅适用于部分提示风格的指令导致。
Quality Badges
质量徽章
Badges require both a composite score threshold AND an Elo threshold (when Elo is available).
The logic checks composite first, then Elo if provided:
Badge.from_scores()| Badge | Composite | Elo | Meaning |
|---|---|---|---|
| Platinum ★★★★★ | ≥ 90 | ≥ 1600 | Reference quality — suitable for gold corpus |
| Gold ★★★★ | ≥ 80 | ≥ 1500 | Production ready |
| Silver ★★★ | ≥ 70 | ≥ 1400 | Functional, has improvement opportunities |
| Bronze ★★ | ≥ 60 | ≥ 1300 | Minimum viable — not yet recommended for users |
| — | < 60 | any | Does not meet minimum bar |
The Elo threshold is skipped when Elo has not been computed (i.e., at quick or standard depth
without ). A skill can earn a badge on composite score alone in those cases.
certify获取徽章需要同时满足综合分数阈值和Elo阈值(当Elo分数可用时)。逻辑会先检查综合分数,再检查Elo分数(若提供):
Badge.from_scores()| 徽章 | 综合分数 | Elo分数 | 含义 |
|---|---|---|---|
| 白金 ★★★★★ | ≥ 90 | ≥ 1600 | 参考级质量——适合纳入黄金语料库 |
| 黄金 ★★★★ | ≥ 80 | ≥ 1500 | 可用于生产环境 |
| 白银 ★★★ | ≥ 70 | ≥ 1400 | 功能完整,存在优化空间 |
| 青铜 ★★ | ≥ 60 | ≥ 1300 | 最小可用——暂不推荐给用户使用 |
| — | < 60 | 任意 | 未达到最低标准 |
当Elo分数未计算时(即quick或standard深度且未使用模式),会跳过Elo阈值。此时Skill仅通过综合分数即可获取徽章。
certifyAnti-Pattern Flags
反模式标记
The static analyzer detects five anti-patterns. Each carries a severity multiplier that feeds
into the penalty formula.
静态分析器会检测五种反模式。每种反模式都有严重程度乘数,会纳入惩罚公式。
OVER_CONSTRAINED
OVER_CONSTRAINED(过度约束)
Trigger: More than 15 occurrences of MUST, ALWAYS, or NEVER in the SKILL.md.
Problem: Overly prescriptive instructions reduce model flexibility, increase token overhead,
and signal that the author is trying to micromanage every output rather than providing
principled guidance.
Fix: Audit every MUST/ALWAYS/NEVER. Replace directive language with explanatory framing
where possible. Reserve hard constraints for genuine safety or correctness requirements. Target
fewer than 10 such directives per 100 lines.
触发条件: SKILL.md中MUST、ALWAYS或NEVER出现次数超过15次。
问题: 过于严格的指令会降低模型灵活性,增加Token开销,表明作者试图管控每一处输出,而非提供原则性指导。
修复方案: 审核所有MUST/ALWAYS/NEVER指令。尽可能将指令性语言替换为解释性表述。仅在涉及安全或正确性的真正硬约束时保留强制指令。目标是每100行中此类指令少于10次。
EMPTY_DESCRIPTION
EMPTY_DESCRIPTION(描述为空)
Trigger: The frontmatter field is fewer than 20 characters after stripping.
descriptionProblem: Without a meaningful description, the Claude Code plugin system cannot determine
when to invoke the skill. The skill becomes invisible to autonomous invocation.
Fix: Write a description of at least 60–120 characters that includes:
- A "Use this skill when..." or "Use when..." trigger clause
- Two or more concrete contexts separated by commas or "or"
触发条件: 前置元数据中的字段去除空白字符后长度不足20个字符。
description问题: 没有有意义的描述,Claude Code插件系统无法确定何时调用该Skill,导致Skill无法被自动触发。
修复方案: 撰写长度为60-120字符的描述,包含:
- 一个“Use this skill when...”或“Use when...”触发分句
- 两个或多个具体场景,用逗号或“or”分隔
MISSING_TRIGGER
MISSING_TRIGGER(缺少触发语句)
Trigger: The description does not contain "use when", "use this skill when",
"use proactively", or "trigger when" (case-insensitive).
Problem: Even a long description is useless for autonomous invocation if it doesn't
include a clear trigger signal. The system's routing model needs an explicit cue.
Fix: Prepend "Use this skill when..." to the description, followed by specific scenarios.
Example: "Use this skill when measuring plugin quality, interpreting score reports, or
explaining badge thresholds to a team."
触发条件: 描述中不包含“use when”、“use this skill when”、“use proactively”或“trigger when”(不区分大小写)。
问题: 即使描述很长,如果没有明确的触发信号,也无法用于自动调用。系统的路由模型需要明确的提示。
修复方案: 在描述开头添加“Use this skill when...”,后跟具体场景。示例:“Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team.”
BLOATED_SKILL
BLOATED_SKILL(Skill臃肿)
Trigger: SKILL.md exceeds 800 lines AND the skill has no directory.
references/Problem: A monolithic SKILL.md forces the entire document into context on every invocation,
wasting tokens on content only needed in edge cases.
Fix: Create a directory and move supporting material there:
references/- Detailed rubrics →
references/rubrics.md - Extended examples →
references/examples.md - Configuration reference →
references/config.md
The SKILL.md should link to these files with so the model
can fetch them on demand.
[text](references/filename.md)触发条件: SKILL.md超过800行且未创建目录。
references/问题: 单一庞大的SKILL.md会在每次调用时将整个文档纳入上下文,浪费Token在仅边缘情况才需要的内容上。
修复方案: 创建目录并将支撑材料移至该目录:
references/- 详细评分准则 →
references/rubrics.md - 扩展示例 →
references/examples.md - 配置参考 →
references/config.md
SKILL.md中应使用链接到这些文件,以便模型按需获取。
[text](references/filename.md)ORPHAN_REFERENCE
ORPHAN_REFERENCE(孤立引用)
Trigger: SKILL.md contains a markdown link where
does not exist in the directory.
[text](references/filename)filenamereferences/Problem: Dead links waste tokens on context that will never resolve and confuse the model.
Fix: Either create the missing reference file or remove the dead link.
触发条件: SKILL.md包含指向的markdown链接,但该文件在目录中不存在。
references/filenamereferences/问题: 无效链接会浪费上下文Token并造成模型困惑。
修复方案: 创建缺失的参考文件或移除无效链接。
DEAD_CROSS_REF
DEAD_CROSS_REF(无效交叉引用)
Trigger: SKILL.md references another skill or agent by relative path and that path
cannot be resolved from the skills/ directory.
Problem: Broken ecosystem links undermine the plugin's coherence score and may cause
the model to attempt navigation to non-existent files.
Fix: Verify the referenced skill exists. Update the path or remove the reference.
触发条件: SKILL.md通过相对路径引用另一个Skill或Agent,但该路径无法从skills/目录解析。
问题: 损坏的生态系统链接会降低插件的一致性分数,并可能导致模型尝试访问不存在的文件。
修复方案: 验证被引用的Skill是否存在。更新路径或移除引用。
Elo Ranking
Elo排名
PluginEval uses an Elo/Bradley-Terry rating system to rank a skill against the gold corpus.
Starting rating: 1500 (the corpus median by convention).
K-factor: 32 (standard for moderate-stakes ratings).
Expected score formula (standard Elo):
E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))Rating update after each matchup:
new_rating = old_rating + 32 × (actual_score − expected_score)where is 1.0 for a win, 0.5 for a draw, 0.0 for a loss.
actual_scoreConfidence intervals are computed via 500-sample bootstrap, reported as 95% CI.
Corpus percentile reflects pairwise win rate against the gold corpus.
Position bias check: Pairs are evaluated in both orders; disagreements are flagged.
The command builds the corpus index from a plugins directory:
plugin-eval initbash
plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpusPluginEval使用Elo/Bradley-Terry评分系统将Skill与黄金语料库进行排名比较。
初始评分: 1500(语料库中位数惯例)。
K因子: 32(适用于中等重要性评分的标准值)。
预期分数公式(标准Elo算法):
E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))每次对决后的评分更新:
new_rating = old_rating + 32 × (actual_score − expected_score)其中在获胜时为1.0,平局时为0.5,失败时为0.0。
actual_score置信区间通过500次抽样bootstrap计算,报告95%置信区间。语料库百分位反映与黄金语料库的成对胜率。位置偏差检查: 成对对决会按两种顺序进行评估;不一致的结果会被标记。
plugin-eval initbash
plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpusCLI Reference
CLI参考
Score a skill (quick static analysis only)
评分Skill(仅快速静态分析)
bash
plugin-eval score ./path/to/skill --depth quickReturns Layer 1 results in < 2 seconds. Useful for fast feedback during authoring.
bash
plugin-eval score ./path/to/skill --depth quick在2秒内返回第一层结果,适用于创作过程中的快速反馈。
Score with LLM judge (default)
使用LLM评估器评分(默认模式)
bash
plugin-eval score ./path/to/skillRuns static + LLM judge (standard depth). Takes 30–90 seconds.
bash
plugin-eval score ./path/to/skill运行静态分析+LLM评估器(standard深度),耗时30-90秒。
Score with full output as JSON
以JSON格式输出完整评分结果
bash
plugin-eval score ./path/to/skill --output jsonEmits structured JSON including , , and
. Suitable for CI integration:
composite.scorecomposite.dimensionslayers[0].anti_patternsbash
plugin-eval score ./path/to/skill --depth quick --output json --threshold 70bash
plugin-eval score ./path/to/skill --output json输出结构化JSON,包括、和,适合集成到CI流程:
composite.scorecomposite.dimensionslayers[0].anti_patternsbash
plugin-eval score ./path/to/skill --depth quick --output json --threshold 70exits with code 1 if score < 70
若分数<70,退出码为1
undefinedundefinedFull certification (all three layers + Elo)
完整认证(三层评估+Elo排名)
bash
plugin-eval certify ./path/to/skillRuns static + LLM judge + Monte Carlo (50 simulations) + Elo ranking. Takes 15–20 minutes.
Assigns a quality badge. Use before publishing a skill to the marketplace.
bash
plugin-eval certify ./path/to/skill运行静态分析+LLM评估器+蒙特卡洛模拟(50次)+Elo排名,耗时15-20分钟。会分配质量徽章,适合在将Skill发布到应用市场前使用。
Head-to-head comparison
一对一对比
bash
plugin-eval compare ./skill-a ./skill-bEvaluates both skills at quick depth and prints a dimension-by-dimension comparison table.
Useful for deciding between two implementations or measuring improvement before/after a
rewrite.
bash
plugin-eval compare ./skill-a ./skill-b在quick深度下评估两个Skill,并按维度输出对比表格,适合在两个实现方案中做选择,或衡量重写前后的改进情况。
Initialize corpus for Elo
初始化Elo语料库
bash
plugin-eval init ./pluginsBuilds the local corpus index at . Required before Elo ranking works.
~/.plugineval/corpusbash
plugin-eval init ./plugins在路径下构建本地语料库索引,Elo排名功能需要先执行此操作。
~/.plugineval/corpusScripting the Composite Formula
编写综合评分公式脚本
Reproduce the composite score offline (pre-commit hook, CI gate):
python
def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
"""Replicate the PluginEval composite formula."""
WEIGHTS = {
"triggering_accuracy": 0.25,
"orchestration_fitness": 0.20,
"output_quality": 0.15,
"scope_calibration": 0.12,
"progressive_disclosure": 0.10,
"token_efficiency": 0.06,
"robustness": 0.05,
"structural_completeness":0.03,
"code_template_quality": 0.02,
"ecosystem_coherence": 0.02,
}
raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
return round(raw * 100 * penalty, 2)离线重现综合分数(用于预提交钩子、CI gate):
python
def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
"""Replicate the PluginEval composite formula."""
WEIGHTS = {
"triggering_accuracy": 0.25,
"orchestration_fitness": 0.20,
"output_quality": 0.15,
"scope_calibration": 0.12,
"progressive_disclosure": 0.10,
"token_efficiency": 0.06,
"robustness": 0.05,
"structural_completeness":0.03,
"code_template_quality": 0.02,
"ecosystem_coherence": 0.02,
}
raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
return round(raw * 100 * penalty, 2)Example: a skill with a weak triggering score
Example: a skill with a weak triggering score
scores = {
"triggering_accuracy": 0.65, # D — needs description work
"orchestration_fitness": 0.85,
"output_quality": 0.80,
# … fill in remaining 7 dimensions …
}
scores = {
"triggering_accuracy": 0.65, # D — needs description work
"orchestration_fitness": 0.85,
"output_quality": 0.80,
# … fill in remaining 7 dimensions …
}
composite_score(scores, anti_pattern_count=1) → ~76.5
composite_score(scores, anti_pattern_count=1) → ~76.5
undefinedundefinedJSON Output Format
JSON输出格式
Top-level shape of :
--output jsonjson
{
"composite": { "score": 76.5, "badge": "Silver", "elo": null },
"dimensions": {
"triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
"orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
},
"layers": [
{ "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
{ "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
]
}Parse in CI to gate deployments:
composite.scorebash
score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
echo "Quality gate failed: score $score < 70"
exit 1
fi--output jsonjson
{
"composite": { "score": 76.5, "badge": "Silver", "elo": null },
"dimensions": {
"triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
"orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
},
"layers": [
{ "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
{ "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
]
}在CI中解析以实现部署门禁:
composite.scorebash
score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
echo "Quality gate failed: score $score < 70"
exit 1
fiTips for Improving a Skill's Score
Skill分数提升建议
Work through dimensions in weight order. The largest gains come from fixing the top-weighted
dimensions first.
按维度权重顺序进行优化。权重最高的维度能带来最大的分数提升。
Which Dimension to Improve First
优先优化哪个维度
Use this table when a score report shows multiple D/F grades and you need to prioritize effort.
| Dimension | Weight | Typical fix effort | Score impact / hour | Fix first if… |
|---|---|---|---|---|
| 0.25 | Low — description rewrite | High | Score < 70 overall |
| 0.20 | Medium — restructure sections | High | Skill mixes worker + supervisor logic |
| 0.15 | Medium — add examples | Medium | Judge score < 0.70 |
| 0.12 | Low — move content to references/ | Medium | File is < 100 or > 800 lines |
| 0.10 | Low — create references/ dir | Medium | No references/ directory exists |
| 0.06 | Low — reduce MUST/ALWAYS/NEVER | Low | Anti-pattern count ≥ 3 |
| 0.05 | Low — add Troubleshooting section | Low | No edge-case handling documented |
| 0.03 | Very low — add headings/code blocks | Low | Fewer than 4 H2 headings |
| 0.02 | Very low — add language tags | Very low | Code blocks missing language tags |
| 0.02 | Very low — add Related section | Very low | No cross-references at all |
Rule of thumb: Fix before anything else — at weight 0.25 it delivers
more composite-score gain per hour than all low-weight dimensions combined.
triggering_accuracy当分数报告显示多个D/F等级时,可使用下表确定优化优先级:
| 维度 | 权重 | 典型修复工作量 | 每小时分数提升 | 优先优化场景 |
|---|---|---|---|---|
| 0.25 | 低——重写描述 | 高 | 整体分数<70 |
| 0.20 | 中——重构章节 | 高 | Skill混合了Worker与管理者逻辑 |
| 0.15 | 中——添加示例 | 中 | 评估器分数<0.70 |
| 0.12 | 低——将内容移至references/ | 中 | 文件行数<100或>800 |
| 0.10 | 低——创建references/目录 | 中 | 无references/目录 |
| 0.06 | 低——减少MUST/ALWAYS/NEVER | 低 | 反模式数量≥3 |
| 0.05 | 低——添加故障排查章节 | 低 | 未记录边缘情况处理 |
| 0.03 | 极低——添加标题/代码块 | 低 | H2标题少于4个 |
| 0.02 | 极低——添加语言标签 | 极低 | 代码块缺少语言标签 |
| 0.02 | 极低——添加相关章节 | 极低 | 无任何交叉引用 |
经验法则: 优先修复——其权重为0.25,每小时带来的综合分数提升超过所有低权重维度的总和。
triggering_accuracyTriggering Accuracy (weight 0.25)
触发准确性(权重0.25)
- Include "Use this skill when..." followed by 3–4 comma-separated specific contexts.
- Add "proactively" if the skill should auto-activate without an explicit user request.
- Mental test: write 5 prompts that should trigger it and 5 that should not — does your description discriminate? If not, add or tighten the context phrases.
- 包含“Use this skill when...”后跟3-4个逗号分隔的具体场景。
- 若Skill应在无用户明确请求时自动激活,添加“proactively”。
- 测试:编写5个应触发Skill的提示和5个不应触发的提示——你的描述能否区分?若不能,添加或收紧场景表述。
Orchestration Fitness (weight 0.20)
编排适配性(权重0.20)
- Document what the skill receives and what it returns — not what it orchestrates.
- Avoid "orchestrate", "coordinate", "dispatch", "manage workflow" in SKILL.md.
- Include an "Output format" section and 2+ code blocks showing concrete worker behavior.
- 记录Skill接收和返回的内容——而非其编排的内容。
- SKILL.md中避免使用“orchestrate”、“coordinate”、“dispatch”、“manage workflow”等词汇。
- 包含“输出格式”章节及2个以上展示具体Worker行为的代码块。
Output Quality (weight 0.15)
输出质量(权重0.15)
- Give specific, actionable instructions — not just goals.
- Cover at least one edge case explicitly (empty input, malformed data, etc.).
- Include an examples section showing representative inputs and expected outputs.
- The more concrete the instructions, the higher the judge will score this dimension.
- 提供具体、可执行的指令——而非仅目标。
- 明确覆盖至少一种边缘情况(空输入、格式错误的数据等)。
- 包含示例章节,展示代表性输入和预期输出。
- 指令越具体,评估器给该维度的分数越高。
Scope Calibration (weight 0.12)
范围校准(权重0.12)
- Target 200–600 lines. Below 100 is a stub; above 800 without is bloat.
references/ - Move background reading, extended examples, and reference tables to .
references/ - Very narrow skills should be merged with a sibling; very broad ones should be split.
- 目标行数为200-600行。少于100行过于简略;超过800行且无目录则过于臃肿。
references/ - 将背景知识、扩展示例和参考表格移至目录。
references/ - 范围过窄的Skill应与同类Skill合并;范围过宽的Skill应拆分。
Progressive Disclosure (weight 0.10)
渐进式披露(权重0.10)
- Add a directory (earns 0.15–0.25 bonus) and keep SKILL.md focused on the execution path. An
references/directory adds a further bonus.assets/
- 添加目录(可获得0.15-0.25分加分),并保持SKILL.md聚焦于执行流程。
references/目录可带来额外加分。assets/
Token Efficiency (weight 0.06)
Token效率(权重0.06)
- Audit MUST/ALWAYS/NEVER count. Target < 1 per 10 lines.
- Consolidate near-duplicate bullet points and repeated-structure tables.
- 审核MUST/ALWAYS/NEVER的数量,目标是每10行少于1次。
- 合并重复的项目符号和结构重复的表格。
Robustness (weight 0.05)
鲁棒性(权重0.05)
- Add a "Troubleshooting" or "Edge Cases" section covering at least 3 failure modes.
- State what the skill returns when it cannot complete its task.
- 添加“故障排查”或“边缘情况”章节,覆盖至少3种失败模式。
- 说明Skill无法完成任务时的返回内容。
Structural Completeness (weight 0.03)
结构完整性(权重0.03)
- Ensure at least 4 H2/H3 headings, 3 code blocks, an Examples section, and a Troubleshooting section.
- 确保至少有4个H2/H3标题、3个代码块、一个示例章节和一个故障排查章节。
Code Template Quality (weight 0.02)
代码模板质量(权重0.02)
- All code blocks must be syntactically valid and copy-paste ready with language tags.
- 所有代码块必须语法正确,且带有语言标签以便直接复制使用。
Ecosystem Coherence (weight 0.02)
生态系统一致性(权重0.02)
- Add a "## Related" section listing sibling skills or agents with relative paths.
- Avoid duplicating content that already exists in another skill — link to it instead.
- 添加“## 相关”章节,列出同类Skill或Agent的相对路径。
- 避免重复其他Skill已有的内容——改为链接到该Skill。
Troubleshooting
故障排查
"Score is much lower than expected after adding content"
“添加内容后分数远低于预期”
The anti-pattern penalty compounds. Run with and inspect
. If you have 5+ anti-patterns, the multiplier can reduce your
score to 75% of its raw value regardless of how good the content is. Fix the flags first.
--output jsonlayers[0].anti_patterns反模式惩罚会累积。使用运行并查看。若存在5个以上反模式,乘数会将分数降至原始分数的75%,无论内容质量如何。应先修复这些标记。
--output jsonlayers[0].anti_patterns"triggering_accuracy is low despite a detailed description"
“尽管描述详细,触发准确性仍很低”
The scorer looks for specific syntactic patterns, not just length.
Verify your description contains the phrase "Use this skill when" or "Use when" (exact
phrasing matters — it's a regex match). Also check that you have multiple use cases separated
by commas or "or" to earn the specificity bonus.
_description_pushiness_description_pushiness"LLM judge scores vary significantly between runs"
“LLM评估器的分数在多次运行中差异显著”
This is expected for ambiguous skills. The judge generates 10 mental test prompts
non-deterministically. Improve score stability by tightening the description and adding
concrete examples. When , averaged scores will be more stable. Use
with which runs Monte Carlo to get statistically-bounded scores.
judges > 1--depth deepcertify对于模糊的Skill,这种情况是正常的。评估器会非确定性地生成10个测试提示。通过收紧描述和添加具体示例可提升分数稳定性。当时,平均分数会更稳定。使用模式的命令,会运行蒙特卡洛模拟以获得有统计边界的分数。
judges > 1--depth deepcertify"progressive_disclosure score is low even though the file is the right length"
“尽管文件长度合适,渐进式披露分数仍很低”
Check whether the file is in the 200–600 line sweet spot. Files shorter than 100 lines
score only 0.20 on this sub-check. Also confirm that files are not empty —
the scorer checks for non-empty reference files, not just the directory.
references/检查文件是否处于200-600行的最优区间。少于100行的文件在此子检查中仅得0.20分。同时确认目录下的文件非空——评分器会检查非空的参考文件,而非仅目录存在。
references/"compare shows my rewrite scores lower than the original"
“compare显示重写后的分数低于原始版本”
Quick depth () only runs static analysis. If the rewrite moved content to
and shortened SKILL.md significantly, static scores for structural completeness
may drop even though overall quality improved. Run for a fairer comparison
that includes the LLM judge's assessment of content quality.
--depth quickreferences/--depth standardQuick深度()仅运行静态分析。若重写将内容移至并大幅缩短SKILL.md,结构完整性的静态分数可能会下降,尽管整体质量有所提升。使用模式进行更公平的比较,该模式会包含LLM评估器对内容质量的评估。
--depth quickreferences/--depth standardReferences
参考资料
- Full Rubric Anchors — all 4 judge dimensions
- 完整评分准则锚点——所有4个评估器维度
Related Agents
相关Agent
- eval-judge () — the LLM judge that scores Layer 2 dimensions (
../../agents/eval-judge.md,triggering_accuracy,orchestration_fitness,output_quality). Invoke directly when you need to re-run only the judge layer or inspect its reasoning.scope_calibration - eval-orchestrator () — the top-level orchestrator that sequences all three layers, merges results, assigns badges, and writes the final report. Invoke when running a full certification pass or comparing two skills head-to-head.
../../agents/eval-orchestrator.md
- eval-judge ()——为第二层维度(
../../agents/eval-judge.md、triggering_accuracy、orchestration_fitness、output_quality)评分的LLM评估器。当你需要仅重新运行评估器层或查看其推理过程时,可直接调用。scope_calibration - eval-orchestrator ()——顶层编排器,负责协调三层评估、合并结果、分配徽章并生成最终报告。在运行完整认证或一对一对比两个Skill时调用。
../../agents/eval-orchestrator.md