evaluation-methodology

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Evaluation Methodology

评估方法论

This document is the authoritative reference for how PluginEval measures plugin and skill quality. It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips.
Related: Full rubric anchors

本文档是PluginEval评估插件与Skill质量的权威参考指南。内容包含三层评估体系、十大评分维度、综合评分公式、质量徽章阈值、反模式标记、Elo排名以及可落地的优化建议。
相关链接:完整评分准则锚点

The Three Evaluation Layers

三层评估体系

PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for each applicable dimension, and later layers override or blend with earlier ones according to per-dimension blend weights.
PluginEval由三个互补的评估层组成。每个层会为适用的维度生成0.0到1.0之间的分数,后续层会根据各维度的融合权重覆盖或与前层分数融合。

Layer 1 — Static Analysis

第一层——静态分析

Speed: < 2 seconds. No LLM calls. Deterministic.
The static analyzer (
layers/static.py
) runs six sub-checks directly against the parsed SKILL.md:
Sub-checkWhat it measures
frontmatter_quality
Name presence, description length, trigger-phrase quality
orchestration_wiring
Output/input documentation, code block count, orchestrator anti-pattern
progressive_disclosure
Line count vs. sweet-spot (200–600 lines), references/ and assets/ bonuses
structural_completeness
Heading density, code blocks, examples section, troubleshooting section
token_efficiency
MUST/NEVER/ALWAYS density, duplicate-line repetition ratio
ecosystem_coherence
Cross-references to other skills/agents, "related"/"see also" mentions
These six sub-checks feed directly into six of the ten final dimensions (via
STATIC_TO_DIMENSION
mapping). The remaining four dimensions —
output_quality
,
scope_calibration
,
robustness
, and part of
triggering_accuracy
— receive no static contribution and rely entirely on Layer 2 and/or Layer 3.
Anti-pattern penalty is applied multiplicatively to the Layer 1 score:
penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)
Each additional detected anti-pattern reduces the score by 5%, flooring at 50%.
速度: < 2秒。无需调用LLM。结果具有确定性。
静态分析器(
layers/static.py
)直接对解析后的SKILL.md文件执行六项子检查:
子检查项测量内容
frontmatter_quality
名称是否存在、描述长度、触发短语质量
orchestration_wiring
输出/输入文档、代码块数量、编排反模式
progressive_disclosure
行数是否处于最优区间(200–600行)、references/和assets/目录加分项
structural_completeness
标题密度、代码块、示例章节、故障排查章节
token_efficiency
MUST/NEVER/ALWAYS指令密度、重复行占比
ecosystem_coherence
与其他Skill/Agent的交叉引用、“相关”/“另请参阅”提及情况
这六项子检查直接对应最终十大维度中的六个维度(通过
STATIC_TO_DIMENSION
映射)。剩余四个维度——
output_quality
scope_calibration
robustness
以及
triggering_accuracy
的部分评分,完全依赖第二层和/或第三层的结果。
反模式惩罚会以乘法方式作用于第一层分数:
penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)
每检测到一个额外的反模式,分数会降低5%,最低降至50%。

Layer 2 — LLM Judge

第二层——LLM评估器

Speed: 30–90 seconds. One or more LLM calls (Sonnet by default). Non-deterministic.
The
eval-judge
agent reads the SKILL.md and any
references/
files, then scores four dimensions using anchored rubrics (see references/rubrics.md):
  1. Triggering accuracy — F1 score derived from 10 mental test prompts
  2. Orchestration fitness — Worker purity assessment (0–1 rubric)
  3. Output quality — Simulates 3 realistic tasks; assesses instruction quality
  4. Scope calibration — Judges depth and breadth relative to the skill's category
The judge returns a structured JSON object (no markdown fences) that the eval engine merges into the composite. When
judges > 1
, scores are averaged and Cohen's kappa is reported as an inter-judge agreement metric.
速度: 30–90秒。调用一次或多次LLM(默认使用Sonnet模型)。结果具有非确定性。
eval-judge
Agent会读取SKILL.md及所有
references/
目录下的文件,然后使用锚定评分准则(参见references/rubrics.md)对四个维度进行评分:
  1. 触发准确性——基于10个测试提示生成的F1分数
  2. 编排适配性——Worker纯度评估(0–1分准则)
  3. 输出质量——模拟3个真实任务;评估指令质量
  4. 范围校准——评估Skill的深度与广度是否符合其类别定位
评估器会返回结构化JSON对象(无markdown围栏),由评估引擎合并到综合评分中。当
judges > 1
时,分数会取平均值,并报告Cohen's kappa值作为评估者间一致性指标。

Layer 3 — Monte Carlo Simulation

第三层——蒙特卡洛模拟

Speed: 5–20 minutes. N=50 simulated Agent SDK invocations (default). Statistical.
Monte Carlo runs
N
real prompts through the skill and records:
  • Activation rate — Fraction of prompts that triggered the skill
  • Output consistency — Coefficient of variation (CV) across quality scores
  • Failure rate — Error/crash fraction with Clopper-Pearson exact CIs
  • Token efficiency — Median token count, IQR, outlier count
The Layer 3 composite formula:
mc_score = 0.40 × activation_rate
         + 0.30 × (1 − min(1.0, CV))
         + 0.20 × (1 − failure_rate)
         + 0.10 × efficiency_norm
where
efficiency_norm = max(0, 1 − median_tokens / 8000)
.

速度: 5–20分钟。默认执行N=50次模拟Agent SDK调用。结果基于统计分析。
蒙特卡洛模拟会将N个真实提示输入Skill并记录以下数据:
  • 激活率——触发Skill的提示占比
  • 输出一致性——质量分数的变异系数(CV)
  • 失败率——错误/崩溃占比及Clopper-Pearson精确置信区间
  • Token效率——中位数Token数、四分位距、异常值数量
第三层综合评分公式:
mc_score = 0.40 × activation_rate
         + 0.30 × (1 − min(1.0, CV))
         + 0.20 × (1 − failure_rate)
         + 0.10 × efficiency_norm
其中
efficiency_norm = max(0, 1 − median_tokens / 8000)

Composite Scoring Formula

综合评分公式

The final score is a weighted blend across all three layers for each dimension, then summed:
composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty
最终分数是各维度在三层评估中的加权融合结果之和:
composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty

Dimension Weights

维度权重

DimensionWeightWhy it matters
triggering_accuracy
0.25A skill that never fires — or fires incorrectly — has no value
orchestration_fitness
0.20Skills must be pure workers; supervisor logic belongs in agents
output_quality
0.15Correct, complete output is the primary deliverable
scope_calibration
0.12Neither a stub nor a bloated monster
progressive_disclosure
0.10SKILL.md is lean; detail lives in references/
token_efficiency
0.06Minimal context waste per invocation
robustness
0.05Handles edge cases without crashing
structural_completeness
0.03Correct sections in the right order
code_template_quality
0.02Working, copy-paste-ready examples
ecosystem_coherence
0.02Cross-references; no duplication with siblings
维度权重重要性
triggering_accuracy
0.25无法触发或触发错误的Skill毫无价值
orchestration_fitness
0.20Skill必须是纯Worker;管理者逻辑应属于Agent
output_quality
0.15正确、完整的输出是核心交付物
scope_calibration
0.12既不能过于简略,也不能冗余臃肿
progressive_disclosure
0.10SKILL.md应简洁;详细内容放在references/目录中
token_efficiency
0.06每次调用的上下文浪费最小化
robustness
0.05能处理边缘情况而不崩溃
structural_completeness
0.03正确的章节结构与顺序
code_template_quality
0.02可直接复制使用的可用示例
ecosystem_coherence
0.02交叉引用;避免与其他同类Skill重复

Layer Blend Weights

层融合权重

Each dimension draws from different layers at different ratios. With all three layers active (
--depth deep
or
certify
):
DimensionStaticJudgeMonte Carlo
triggering_accuracy
0.150.250.60
orchestration_fitness
0.100.700.20
output_quality
0.000.400.60
scope_calibration
0.300.550.15
progressive_disclosure
0.800.200.00
token_efficiency
0.400.100.50
robustness
0.000.200.80
structural_completeness
0.900.100.00
code_template_quality
0.300.700.00
ecosystem_coherence
0.850.150.00
At
--depth standard
(static + judge only), blends are renormalized to drop the Monte Carlo column. At
--depth quick
(static only), all weight falls on Layer 1.
每个维度会从不同层获取不同比例的分数。当三层全部启用时(
--depth deep
certify
模式):
维度静态分析LLM评估器蒙特卡洛模拟
triggering_accuracy
0.150.250.60
orchestration_fitness
0.100.700.20
output_quality
0.000.400.60
scope_calibration
0.300.550.15
progressive_disclosure
0.800.200.00
token_efficiency
0.400.100.50
robustness
0.000.200.80
structural_completeness
0.900.100.00
code_template_quality
0.300.700.00
ecosystem_coherence
0.850.150.00
--depth standard
模式下(仅静态分析+LLM评估器),融合权重会重新归一化,移除蒙特卡洛模拟列。在
--depth quick
模式下(仅静态分析),所有权重均分配给第一层。

Blended Score Calculation

融合分数计算

For a given depth, the blended score for dimension
d
is:
blended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
             ─────────────────────────────────────────────────────
             Σ( layer_weight[d][layer] for available layers )
This normalization ensures that skipping Monte Carlo at standard depth doesn't artificially deflate scores.

对于指定评估深度,维度
d
的融合分数计算公式为:
blended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
             ─────────────────────────────────────────────────────
             Σ( layer_weight[d][layer] for available layers )
这种归一化确保在standard深度下跳过蒙特卡洛模拟不会人为拉低分数。

Interpreting Dimension Scores

维度分数解读

Each dimension score is a float in
[0.0, 1.0]
. The CLI converts it to a letter grade:
GradeScore rangeMeaning
A0.90 – 1.00Excellent — no meaningful improvement needed
B0.80 – 0.89Good — minor gaps only
C0.70 – 0.79Adequate — one or two clear improvement areas
D0.60 – 0.69Marginal — needs targeted work
F< 0.60Failing — significant remediation required
When reading a report, focus first on the lowest-graded dimension that has the highest weight. A D in
triggering_accuracy
(weight 0.25) costs far more than a D in
ecosystem_coherence
(weight 0.02).
Confidence intervals appear in the report when Layer 2 or Layer 3 ran. Narrow CIs (± < 5 points) indicate stable scores. Wide CIs suggest inconsistency — often caused by an ambiguous description or instructions that work for some prompt styles but not others.

每个维度的分数为0.0到1.0之间的浮点数。CLI会将其转换为字母等级:
等级分数范围含义
A0.90 – 1.00优秀——无需实质性改进
B0.80 – 0.89良好——仅存在微小差距
C0.70 – 0.79合格——存在1-2个明确的改进方向
D0.60 – 0.69边缘——需要针对性优化
F< 0.60不合格——需要重大整改
查看报告时,应首先关注权重最高的低分维度。
triggering_accuracy
(权重0.25)得分为D的影响远大于
ecosystem_coherence
(权重0.02)得分为D的影响。
当第二层或第三层运行时,报告中会显示置信区间。窄置信区间(± <5分)表示分数稳定。宽置信区间表明结果不一致——通常由模糊的描述或仅适用于部分提示风格的指令导致。

Quality Badges

质量徽章

Badges require both a composite score threshold AND an Elo threshold (when Elo is available). The
Badge.from_scores()
logic checks composite first, then Elo if provided:
BadgeCompositeEloMeaning
Platinum ★★★★★≥ 90≥ 1600Reference quality — suitable for gold corpus
Gold ★★★★≥ 80≥ 1500Production ready
Silver ★★★≥ 70≥ 1400Functional, has improvement opportunities
Bronze ★★≥ 60≥ 1300Minimum viable — not yet recommended for users
< 60anyDoes not meet minimum bar
The Elo threshold is skipped when Elo has not been computed (i.e., at quick or standard depth without
certify
). A skill can earn a badge on composite score alone in those cases.

获取徽章需要同时满足综合分数阈值和Elo阈值(当Elo分数可用时)。
Badge.from_scores()
逻辑会先检查综合分数,再检查Elo分数(若提供):
徽章综合分数Elo分数含义
白金 ★★★★★≥ 90≥ 1600参考级质量——适合纳入黄金语料库
黄金 ★★★★≥ 80≥ 1500可用于生产环境
白银 ★★★≥ 70≥ 1400功能完整,存在优化空间
青铜 ★★≥ 60≥ 1300最小可用——暂不推荐给用户使用
< 60任意未达到最低标准
当Elo分数未计算时(即quick或standard深度且未使用
certify
模式),会跳过Elo阈值。此时Skill仅通过综合分数即可获取徽章。

Anti-Pattern Flags

反模式标记

The static analyzer detects five anti-patterns. Each carries a severity multiplier that feeds into the penalty formula.
静态分析器会检测五种反模式。每种反模式都有严重程度乘数,会纳入惩罚公式。

OVER_CONSTRAINED

OVER_CONSTRAINED(过度约束)

Trigger: More than 15 occurrences of MUST, ALWAYS, or NEVER in the SKILL.md.
Problem: Overly prescriptive instructions reduce model flexibility, increase token overhead, and signal that the author is trying to micromanage every output rather than providing principled guidance.
Fix: Audit every MUST/ALWAYS/NEVER. Replace directive language with explanatory framing where possible. Reserve hard constraints for genuine safety or correctness requirements. Target fewer than 10 such directives per 100 lines.
触发条件: SKILL.md中MUST、ALWAYS或NEVER出现次数超过15次。
问题: 过于严格的指令会降低模型灵活性,增加Token开销,表明作者试图管控每一处输出,而非提供原则性指导。
修复方案: 审核所有MUST/ALWAYS/NEVER指令。尽可能将指令性语言替换为解释性表述。仅在涉及安全或正确性的真正硬约束时保留强制指令。目标是每100行中此类指令少于10次。

EMPTY_DESCRIPTION

EMPTY_DESCRIPTION(描述为空)

Trigger: The frontmatter
description
field is fewer than 20 characters after stripping.
Problem: Without a meaningful description, the Claude Code plugin system cannot determine when to invoke the skill. The skill becomes invisible to autonomous invocation.
Fix: Write a description of at least 60–120 characters that includes:
  • A "Use this skill when..." or "Use when..." trigger clause
  • Two or more concrete contexts separated by commas or "or"
触发条件: 前置元数据中的
description
字段去除空白字符后长度不足20个字符。
问题: 没有有意义的描述,Claude Code插件系统无法确定何时调用该Skill,导致Skill无法被自动触发。
修复方案: 撰写长度为60-120字符的描述,包含:
  • 一个“Use this skill when...”或“Use when...”触发分句
  • 两个或多个具体场景,用逗号或“or”分隔

MISSING_TRIGGER

MISSING_TRIGGER(缺少触发语句)

Trigger: The description does not contain "use when", "use this skill when", "use proactively", or "trigger when" (case-insensitive).
Problem: Even a long description is useless for autonomous invocation if it doesn't include a clear trigger signal. The system's routing model needs an explicit cue.
Fix: Prepend "Use this skill when..." to the description, followed by specific scenarios. Example: "Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team."
触发条件: 描述中不包含“use when”、“use this skill when”、“use proactively”或“trigger when”(不区分大小写)。
问题: 即使描述很长,如果没有明确的触发信号,也无法用于自动调用。系统的路由模型需要明确的提示。
修复方案: 在描述开头添加“Use this skill when...”,后跟具体场景。示例:“Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team.”

BLOATED_SKILL

BLOATED_SKILL(Skill臃肿)

Trigger: SKILL.md exceeds 800 lines AND the skill has no
references/
directory.
Problem: A monolithic SKILL.md forces the entire document into context on every invocation, wasting tokens on content only needed in edge cases.
Fix: Create a
references/
directory and move supporting material there:
  • Detailed rubrics →
    references/rubrics.md
  • Extended examples →
    references/examples.md
  • Configuration reference →
    references/config.md
The SKILL.md should link to these files with
[text](references/filename.md)
so the model can fetch them on demand.
触发条件: SKILL.md超过800行且未创建
references/
目录。
问题: 单一庞大的SKILL.md会在每次调用时将整个文档纳入上下文,浪费Token在仅边缘情况才需要的内容上。
修复方案: 创建
references/
目录并将支撑材料移至该目录:
  • 详细评分准则 →
    references/rubrics.md
  • 扩展示例 →
    references/examples.md
  • 配置参考 →
    references/config.md
SKILL.md中应使用
[text](references/filename.md)
链接到这些文件,以便模型按需获取。

ORPHAN_REFERENCE

ORPHAN_REFERENCE(孤立引用)

Trigger: SKILL.md contains a markdown link
[text](references/filename)
where
filename
does not exist in the
references/
directory.
Problem: Dead links waste tokens on context that will never resolve and confuse the model.
Fix: Either create the missing reference file or remove the dead link.
触发条件: SKILL.md包含指向
references/filename
的markdown链接,但该文件在
references/
目录中不存在。
问题: 无效链接会浪费上下文Token并造成模型困惑。
修复方案: 创建缺失的参考文件或移除无效链接。

DEAD_CROSS_REF

DEAD_CROSS_REF(无效交叉引用)

Trigger: SKILL.md references another skill or agent by relative path and that path cannot be resolved from the skills/ directory.
Problem: Broken ecosystem links undermine the plugin's coherence score and may cause the model to attempt navigation to non-existent files.
Fix: Verify the referenced skill exists. Update the path or remove the reference.

触发条件: SKILL.md通过相对路径引用另一个Skill或Agent,但该路径无法从skills/目录解析。
问题: 损坏的生态系统链接会降低插件的一致性分数,并可能导致模型尝试访问不存在的文件。
修复方案: 验证被引用的Skill是否存在。更新路径或移除引用。

Elo Ranking

Elo排名

PluginEval uses an Elo/Bradley-Terry rating system to rank a skill against the gold corpus.
Starting rating: 1500 (the corpus median by convention).
K-factor: 32 (standard for moderate-stakes ratings).
Expected score formula (standard Elo):
E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))
Rating update after each matchup:
new_rating = old_rating + 32 × (actual_score − expected_score)
where
actual_score
is 1.0 for a win, 0.5 for a draw, 0.0 for a loss.
Confidence intervals are computed via 500-sample bootstrap, reported as 95% CI. Corpus percentile reflects pairwise win rate against the gold corpus. Position bias check: Pairs are evaluated in both orders; disagreements are flagged.
The
plugin-eval init
command builds the corpus index from a plugins directory:
bash
plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpus

PluginEval使用Elo/Bradley-Terry评分系统将Skill与黄金语料库进行排名比较。
初始评分: 1500(语料库中位数惯例)。
K因子: 32(适用于中等重要性评分的标准值)。
预期分数公式(标准Elo算法):
E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))
每次对决后的评分更新:
new_rating = old_rating + 32 × (actual_score − expected_score)
其中
actual_score
在获胜时为1.0,平局时为0.5,失败时为0.0。
置信区间通过500次抽样bootstrap计算,报告95%置信区间。语料库百分位反映与黄金语料库的成对胜率。位置偏差检查: 成对对决会按两种顺序进行评估;不一致的结果会被标记。
plugin-eval init
命令会从插件目录构建语料库索引:
bash
plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpus

CLI Reference

CLI参考

Score a skill (quick static analysis only)

评分Skill(仅快速静态分析)

bash
plugin-eval score ./path/to/skill --depth quick
Returns Layer 1 results in < 2 seconds. Useful for fast feedback during authoring.
bash
plugin-eval score ./path/to/skill --depth quick
在2秒内返回第一层结果,适用于创作过程中的快速反馈。

Score with LLM judge (default)

使用LLM评估器评分(默认模式)

bash
plugin-eval score ./path/to/skill
Runs static + LLM judge (standard depth). Takes 30–90 seconds.
bash
plugin-eval score ./path/to/skill
运行静态分析+LLM评估器(standard深度),耗时30-90秒。

Score with full output as JSON

以JSON格式输出完整评分结果

bash
plugin-eval score ./path/to/skill --output json
Emits structured JSON including
composite.score
,
composite.dimensions
, and
layers[0].anti_patterns
. Suitable for CI integration:
bash
plugin-eval score ./path/to/skill --depth quick --output json --threshold 70
bash
plugin-eval score ./path/to/skill --output json
输出结构化JSON,包括
composite.score
composite.dimensions
layers[0].anti_patterns
,适合集成到CI流程:
bash
plugin-eval score ./path/to/skill --depth quick --output json --threshold 70

exits with code 1 if score < 70

若分数<70,退出码为1

undefined
undefined

Full certification (all three layers + Elo)

完整认证(三层评估+Elo排名)

bash
plugin-eval certify ./path/to/skill
Runs static + LLM judge + Monte Carlo (50 simulations) + Elo ranking. Takes 15–20 minutes. Assigns a quality badge. Use before publishing a skill to the marketplace.
bash
plugin-eval certify ./path/to/skill
运行静态分析+LLM评估器+蒙特卡洛模拟(50次)+Elo排名,耗时15-20分钟。会分配质量徽章,适合在将Skill发布到应用市场前使用。

Head-to-head comparison

一对一对比

bash
plugin-eval compare ./skill-a ./skill-b
Evaluates both skills at quick depth and prints a dimension-by-dimension comparison table. Useful for deciding between two implementations or measuring improvement before/after a rewrite.
bash
plugin-eval compare ./skill-a ./skill-b
在quick深度下评估两个Skill,并按维度输出对比表格,适合在两个实现方案中做选择,或衡量重写前后的改进情况。

Initialize corpus for Elo

初始化Elo语料库

bash
plugin-eval init ./plugins
Builds the local corpus index at
~/.plugineval/corpus
. Required before Elo ranking works.
bash
plugin-eval init ./plugins
~/.plugineval/corpus
路径下构建本地语料库索引,Elo排名功能需要先执行此操作。

Scripting the Composite Formula

编写综合评分公式脚本

Reproduce the composite score offline (pre-commit hook, CI gate):
python
def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
    """Replicate the PluginEval composite formula."""
    WEIGHTS = {
        "triggering_accuracy":    0.25,
        "orchestration_fitness":  0.20,
        "output_quality":         0.15,
        "scope_calibration":      0.12,
        "progressive_disclosure": 0.10,
        "token_efficiency":       0.06,
        "robustness":             0.05,
        "structural_completeness":0.03,
        "code_template_quality":  0.02,
        "ecosystem_coherence":    0.02,
    }
    raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
    penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
    return round(raw * 100 * penalty, 2)
离线重现综合分数(用于预提交钩子、CI gate):
python
def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
    """Replicate the PluginEval composite formula."""
    WEIGHTS = {
        "triggering_accuracy":    0.25,
        "orchestration_fitness":  0.20,
        "output_quality":         0.15,
        "scope_calibration":      0.12,
        "progressive_disclosure": 0.10,
        "token_efficiency":       0.06,
        "robustness":             0.05,
        "structural_completeness":0.03,
        "code_template_quality":  0.02,
        "ecosystem_coherence":    0.02,
    }
    raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
    penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
    return round(raw * 100 * penalty, 2)

Example: a skill with a weak triggering score

Example: a skill with a weak triggering score

scores = { "triggering_accuracy": 0.65, # D — needs description work "orchestration_fitness": 0.85, "output_quality": 0.80, # … fill in remaining 7 dimensions … }
scores = { "triggering_accuracy": 0.65, # D — needs description work "orchestration_fitness": 0.85, "output_quality": 0.80, # … fill in remaining 7 dimensions … }

composite_score(scores, anti_pattern_count=1) → ~76.5

composite_score(scores, anti_pattern_count=1) → ~76.5

undefined
undefined

JSON Output Format

JSON输出格式

Top-level shape of
--output json
:
json
{
  "composite": { "score": 76.5, "badge": "Silver", "elo": null },
  "dimensions": {
    "triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
    "orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
  },
  "layers": [
    { "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
    { "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
  ]
}
Parse
composite.score
in CI to gate deployments:
bash
score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
  echo "Quality gate failed: score $score < 70"
  exit 1
fi

--output json
的顶层结构:
json
{
  "composite": { "score": 76.5, "badge": "Silver", "elo": null },
  "dimensions": {
    "triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
    "orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
  },
  "layers": [
    { "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
    { "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
  ]
}
在CI中解析
composite.score
以实现部署门禁:
bash
score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
  echo "Quality gate failed: score $score < 70"
  exit 1
fi

Tips for Improving a Skill's Score

Skill分数提升建议

Work through dimensions in weight order. The largest gains come from fixing the top-weighted dimensions first.
按维度权重顺序进行优化。权重最高的维度能带来最大的分数提升。

Which Dimension to Improve First

优先优化哪个维度

Use this table when a score report shows multiple D/F grades and you need to prioritize effort.
DimensionWeightTypical fix effortScore impact / hourFix first if…
triggering_accuracy
0.25Low — description rewriteHighScore < 70 overall
orchestration_fitness
0.20Medium — restructure sectionsHighSkill mixes worker + supervisor logic
output_quality
0.15Medium — add examplesMediumJudge score < 0.70
scope_calibration
0.12Low — move content to references/MediumFile is < 100 or > 800 lines
progressive_disclosure
0.10Low — create references/ dirMediumNo references/ directory exists
token_efficiency
0.06Low — reduce MUST/ALWAYS/NEVERLowAnti-pattern count ≥ 3
robustness
0.05Low — add Troubleshooting sectionLowNo edge-case handling documented
structural_completeness
0.03Very low — add headings/code blocksLowFewer than 4 H2 headings
code_template_quality
0.02Very low — add language tagsVery lowCode blocks missing language tags
ecosystem_coherence
0.02Very low — add Related sectionVery lowNo cross-references at all
Rule of thumb: Fix
triggering_accuracy
before anything else — at weight 0.25 it delivers more composite-score gain per hour than all low-weight dimensions combined.
当分数报告显示多个D/F等级时,可使用下表确定优化优先级:
维度权重典型修复工作量每小时分数提升优先优化场景
triggering_accuracy
0.25低——重写描述整体分数<70
orchestration_fitness
0.20中——重构章节Skill混合了Worker与管理者逻辑
output_quality
0.15中——添加示例评估器分数<0.70
scope_calibration
0.12低——将内容移至references/文件行数<100或>800
progressive_disclosure
0.10低——创建references/目录无references/目录
token_efficiency
0.06低——减少MUST/ALWAYS/NEVER反模式数量≥3
robustness
0.05低——添加故障排查章节未记录边缘情况处理
structural_completeness
0.03极低——添加标题/代码块H2标题少于4个
code_template_quality
0.02极低——添加语言标签极低代码块缺少语言标签
ecosystem_coherence
0.02极低——添加相关章节极低无任何交叉引用
经验法则: 优先修复
triggering_accuracy
——其权重为0.25,每小时带来的综合分数提升超过所有低权重维度的总和。

Triggering Accuracy (weight 0.25)

触发准确性(权重0.25)

  • Include "Use this skill when..." followed by 3–4 comma-separated specific contexts.
  • Add "proactively" if the skill should auto-activate without an explicit user request.
  • Mental test: write 5 prompts that should trigger it and 5 that should not — does your description discriminate? If not, add or tighten the context phrases.
  • 包含“Use this skill when...”后跟3-4个逗号分隔的具体场景。
  • 若Skill应在无用户明确请求时自动激活,添加“proactively”。
  • 测试:编写5个应触发Skill的提示和5个不应触发的提示——你的描述能否区分?若不能,添加或收紧场景表述。

Orchestration Fitness (weight 0.20)

编排适配性(权重0.20)

  • Document what the skill receives and what it returns — not what it orchestrates.
  • Avoid "orchestrate", "coordinate", "dispatch", "manage workflow" in SKILL.md.
  • Include an "Output format" section and 2+ code blocks showing concrete worker behavior.
  • 记录Skill接收返回的内容——而非其编排的内容。
  • SKILL.md中避免使用“orchestrate”、“coordinate”、“dispatch”、“manage workflow”等词汇。
  • 包含“输出格式”章节及2个以上展示具体Worker行为的代码块。

Output Quality (weight 0.15)

输出质量(权重0.15)

  • Give specific, actionable instructions — not just goals.
  • Cover at least one edge case explicitly (empty input, malformed data, etc.).
  • Include an examples section showing representative inputs and expected outputs.
  • The more concrete the instructions, the higher the judge will score this dimension.
  • 提供具体、可执行的指令——而非仅目标。
  • 明确覆盖至少一种边缘情况(空输入、格式错误的数据等)。
  • 包含示例章节,展示代表性输入和预期输出。
  • 指令越具体,评估器给该维度的分数越高。

Scope Calibration (weight 0.12)

范围校准(权重0.12)

  • Target 200–600 lines. Below 100 is a stub; above 800 without
    references/
    is bloat.
  • Move background reading, extended examples, and reference tables to
    references/
    .
  • Very narrow skills should be merged with a sibling; very broad ones should be split.
  • 目标行数为200-600行。少于100行过于简略;超过800行且无
    references/
    目录则过于臃肿。
  • 将背景知识、扩展示例和参考表格移至
    references/
    目录。
  • 范围过窄的Skill应与同类Skill合并;范围过宽的Skill应拆分。

Progressive Disclosure (weight 0.10)

渐进式披露(权重0.10)

  • Add a
    references/
    directory (earns 0.15–0.25 bonus) and keep SKILL.md focused on the execution path. An
    assets/
    directory adds a further bonus.
  • 添加
    references/
    目录(可获得0.15-0.25分加分),并保持SKILL.md聚焦于执行流程。
    assets/
    目录可带来额外加分。

Token Efficiency (weight 0.06)

Token效率(权重0.06)

  • Audit MUST/ALWAYS/NEVER count. Target < 1 per 10 lines.
  • Consolidate near-duplicate bullet points and repeated-structure tables.
  • 审核MUST/ALWAYS/NEVER的数量,目标是每10行少于1次。
  • 合并重复的项目符号和结构重复的表格。

Robustness (weight 0.05)

鲁棒性(权重0.05)

  • Add a "Troubleshooting" or "Edge Cases" section covering at least 3 failure modes.
  • State what the skill returns when it cannot complete its task.
  • 添加“故障排查”或“边缘情况”章节,覆盖至少3种失败模式。
  • 说明Skill无法完成任务时的返回内容。

Structural Completeness (weight 0.03)

结构完整性(权重0.03)

  • Ensure at least 4 H2/H3 headings, 3 code blocks, an Examples section, and a Troubleshooting section.
  • 确保至少有4个H2/H3标题、3个代码块、一个示例章节和一个故障排查章节。

Code Template Quality (weight 0.02)

代码模板质量(权重0.02)

  • All code blocks must be syntactically valid and copy-paste ready with language tags.
  • 所有代码块必须语法正确,且带有语言标签以便直接复制使用。

Ecosystem Coherence (weight 0.02)

生态系统一致性(权重0.02)

  • Add a "## Related" section listing sibling skills or agents with relative paths.
  • Avoid duplicating content that already exists in another skill — link to it instead.

  • 添加“## 相关”章节,列出同类Skill或Agent的相对路径。
  • 避免重复其他Skill已有的内容——改为链接到该Skill。

Troubleshooting

故障排查

"Score is much lower than expected after adding content"

“添加内容后分数远低于预期”

The anti-pattern penalty compounds. Run with
--output json
and inspect
layers[0].anti_patterns
. If you have 5+ anti-patterns, the multiplier can reduce your score to 75% of its raw value regardless of how good the content is. Fix the flags first.
反模式惩罚会累积。使用
--output json
运行并查看
layers[0].anti_patterns
。若存在5个以上反模式,乘数会将分数降至原始分数的75%,无论内容质量如何。应先修复这些标记。

"triggering_accuracy is low despite a detailed description"

“尽管描述详细,触发准确性仍很低”

The
_description_pushiness
scorer looks for specific syntactic patterns, not just length. Verify your description contains the phrase "Use this skill when" or "Use when" (exact phrasing matters — it's a regex match). Also check that you have multiple use cases separated by commas or "or" to earn the specificity bonus.
_description_pushiness
评分器关注特定语法模式,而非仅长度。确认你的描述包含“Use this skill when”或“Use when”短语(精确匹配——基于正则表达式)。同时确保包含多个用逗号或“or”分隔的使用场景,以获得特异性加分。

"LLM judge scores vary significantly between runs"

“LLM评估器的分数在多次运行中差异显著”

This is expected for ambiguous skills. The judge generates 10 mental test prompts non-deterministically. Improve score stability by tightening the description and adding concrete examples. When
judges > 1
, averaged scores will be more stable. Use
--depth deep
with
certify
which runs Monte Carlo to get statistically-bounded scores.
对于模糊的Skill,这种情况是正常的。评估器会非确定性地生成10个测试提示。通过收紧描述和添加具体示例可提升分数稳定性。当
judges > 1
时,平均分数会更稳定。使用
--depth deep
模式的
certify
命令,会运行蒙特卡洛模拟以获得有统计边界的分数。

"progressive_disclosure score is low even though the file is the right length"

“尽管文件长度合适,渐进式披露分数仍很低”

Check whether the file is in the 200–600 line sweet spot. Files shorter than 100 lines score only 0.20 on this sub-check. Also confirm that
references/
files are not empty — the scorer checks for non-empty reference files, not just the directory.
检查文件是否处于200-600行的最优区间。少于100行的文件在此子检查中仅得0.20分。同时确认
references/
目录下的文件非空——评分器会检查非空的参考文件,而非仅目录存在。

"compare shows my rewrite scores lower than the original"

“compare显示重写后的分数低于原始版本”

Quick depth (
--depth quick
) only runs static analysis. If the rewrite moved content to
references/
and shortened SKILL.md significantly, static scores for structural completeness may drop even though overall quality improved. Run
--depth standard
for a fairer comparison that includes the LLM judge's assessment of content quality.

Quick深度(
--depth quick
)仅运行静态分析。若重写将内容移至
references/
并大幅缩短SKILL.md,结构完整性的静态分数可能会下降,尽管整体质量有所提升。使用
--depth standard
模式进行更公平的比较,该模式会包含LLM评估器对内容质量的评估。

References

参考资料

  • Full Rubric Anchors — all 4 judge dimensions
  • 完整评分准则锚点——所有4个评估器维度

Related Agents

相关Agent

  • eval-judge (
    ../../agents/eval-judge.md
    ) — the LLM judge that scores Layer 2 dimensions (
    triggering_accuracy
    ,
    orchestration_fitness
    ,
    output_quality
    ,
    scope_calibration
    ). Invoke directly when you need to re-run only the judge layer or inspect its reasoning.
  • eval-orchestrator (
    ../../agents/eval-orchestrator.md
    ) — the top-level orchestrator that sequences all three layers, merges results, assigns badges, and writes the final report. Invoke when running a full certification pass or comparing two skills head-to-head.
  • eval-judge (
    ../../agents/eval-judge.md
    )——为第二层维度(
    triggering_accuracy
    orchestration_fitness
    output_quality
    scope_calibration
    )评分的LLM评估器。当你需要仅重新运行评估器层或查看其推理过程时,可直接调用。
  • eval-orchestrator (
    ../../agents/eval-orchestrator.md
    )——顶层编排器,负责协调三层评估、合并结果、分配徽章并生成最终报告。在运行完整认证或一对一对比两个Skill时调用。