evaluation-methodology

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluation Methodology

评估方法论

This document is the authoritative reference for how PluginEval measures plugin and skill quality. It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips.

Related: Full rubric anchors

本文档是PluginEval评估插件与Skill质量的权威参考指南。内容包含三层评估体系、十大评分维度、综合评分公式、质量徽章阈值、反模式标记、Elo排名以及可落地的优化建议。

The Three Evaluation Layers

三层评估体系

PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for each applicable dimension, and later layers override or blend with earlier ones according to per-dimension blend weights.

PluginEval由三个互补的评估层组成。每个层会为适用的维度生成0.0到1.0之间的分数，后续层会根据各维度的融合权重覆盖或与前层分数融合。

Layer 1 — Static Analysis

第一层——静态分析

Speed: < 2 seconds. No LLM calls. Deterministic.

The static analyzer (

layers/static.py

) runs six sub-checks directly against the parsed SKILL.md:

Sub-check	What it measures
`frontmatter_quality`	Name presence, description length, trigger-phrase quality
`orchestration_wiring`	Output/input documentation, code block count, orchestrator anti-pattern
`progressive_disclosure`	Line count vs. sweet-spot (200–600 lines), references/ and assets/ bonuses
`structural_completeness`	Heading density, code blocks, examples section, troubleshooting section
`token_efficiency`	MUST/NEVER/ALWAYS density, duplicate-line repetition ratio
`ecosystem_coherence`	Cross-references to other skills/agents, "related"/"see also" mentions

These six sub-checks feed directly into six of the ten final dimensions (via

STATIC_TO_DIMENSION

mapping). The remaining four dimensions —

output_quality

scope_calibration

robustness

, and part of

triggering_accuracy

— receive no static contribution and rely entirely on Layer 2 and/or Layer 3.

Anti-pattern penalty is applied multiplicatively to the Layer 1 score:

penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)

Each additional detected anti-pattern reduces the score by 5%, flooring at 50%.

速度： < 2秒。无需调用LLM。结果具有确定性。

静态分析器（

layers/static.py

）直接对解析后的SKILL.md文件执行六项子检查：

子检查项	测量内容
`frontmatter_quality`	名称是否存在、描述长度、触发短语质量
`orchestration_wiring`	输出/输入文档、代码块数量、编排反模式
`progressive_disclosure`	行数是否处于最优区间（200–600行）、references/和assets/目录加分项
`structural_completeness`	标题密度、代码块、示例章节、故障排查章节
`token_efficiency`	MUST/NEVER/ALWAYS指令密度、重复行占比
`ecosystem_coherence`	与其他Skill/Agent的交叉引用、“相关”/“另请参阅”提及情况

这六项子检查直接对应最终十大维度中的六个维度（通过

STATIC_TO_DIMENSION

映射）。剩余四个维度——

output_quality

、

scope_calibration

、

robustness

以及

triggering_accuracy

的部分评分，完全依赖第二层和/或第三层的结果。

反模式惩罚会以乘法方式作用于第一层分数：

penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)

每检测到一个额外的反模式，分数会降低5%，最低降至50%。

Layer 2 — LLM Judge

第二层——LLM评估器

Speed: 30–90 seconds. One or more LLM calls (Sonnet by default). Non-deterministic.

The

eval-judge

agent reads the SKILL.md and any

references/

files, then scores four dimensions using anchored rubrics (see references/rubrics.md):

Triggering accuracy — F1 score derived from 10 mental test prompts
Orchestration fitness — Worker purity assessment (0–1 rubric)
Output quality — Simulates 3 realistic tasks; assesses instruction quality
Scope calibration — Judges depth and breadth relative to the skill's category

The judge returns a structured JSON object (no markdown fences) that the eval engine merges into the composite. When

judges > 1

, scores are averaged and Cohen's kappa is reported as an inter-judge agreement metric.

速度： 30–90秒。调用一次或多次LLM（默认使用Sonnet模型）。结果具有非确定性。

eval-judge

Agent会读取SKILL.md及所有

references/

目录下的文件，然后使用锚定评分准则（参见references/rubrics.md）对四个维度进行评分：

触发准确性——基于10个测试提示生成的F1分数
编排适配性——Worker纯度评估（0–1分准则）
输出质量——模拟3个真实任务；评估指令质量
范围校准——评估Skill的深度与广度是否符合其类别定位

评估器会返回结构化JSON对象（无markdown围栏），由评估引擎合并到综合评分中。当

judges > 1

时，分数会取平均值，并报告Cohen's kappa值作为评估者间一致性指标。

Layer 3 — Monte Carlo Simulation

第三层——蒙特卡洛模拟

Speed: 5–20 minutes. N=50 simulated Agent SDK invocations (default). Statistical.

Monte Carlo runs

real prompts through the skill and records:

Activation rate — Fraction of prompts that triggered the skill
Output consistency — Coefficient of variation (CV) across quality scores
Failure rate — Error/crash fraction with Clopper-Pearson exact CIs
Token efficiency — Median token count, IQR, outlier count

The Layer 3 composite formula:

mc_score = 0.40 × activation_rate
         + 0.30 × (1 − min(1.0, CV))
         + 0.20 × (1 − failure_rate)
         + 0.10 × efficiency_norm

where

efficiency_norm = max(0, 1 − median_tokens / 8000)

速度： 5–20分钟。默认执行N=50次模拟Agent SDK调用。结果基于统计分析。

蒙特卡洛模拟会将N个真实提示输入Skill并记录以下数据：

激活率——触发Skill的提示占比
输出一致性——质量分数的变异系数（CV）
失败率——错误/崩溃占比及Clopper-Pearson精确置信区间
Token效率——中位数Token数、四分位距、异常值数量

第三层综合评分公式：

mc_score = 0.40 × activation_rate
         + 0.30 × (1 − min(1.0, CV))
         + 0.20 × (1 − failure_rate)
         + 0.10 × efficiency_norm

其中

efficiency_norm = max(0, 1 − median_tokens / 8000)

。

Composite Scoring Formula

综合评分公式

The final score is a weighted blend across all three layers for each dimension, then summed:

composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty

最终分数是各维度在三层评估中的加权融合结果之和：

composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty

Dimension Weights

维度权重

Dimension	Weight	Why it matters
`triggering_accuracy`	0.25	A skill that never fires — or fires incorrectly — has no value
`orchestration_fitness`	0.20	Skills must be pure workers; supervisor logic belongs in agents
`output_quality`	0.15	Correct, complete output is the primary deliverable
`scope_calibration`	0.12	Neither a stub nor a bloated monster
`progressive_disclosure`	0.10	SKILL.md is lean; detail lives in references/
`token_efficiency`	0.06	Minimal context waste per invocation
`robustness`	0.05	Handles edge cases without crashing
`structural_completeness`	0.03	Correct sections in the right order
`code_template_quality`	0.02	Working, copy-paste-ready examples
`ecosystem_coherence`	0.02	Cross-references; no duplication with siblings

维度	权重	重要性
`triggering_accuracy`	0.25	无法触发或触发错误的Skill毫无价值
`orchestration_fitness`	0.20	Skill必须是纯Worker；管理者逻辑应属于Agent
`output_quality`	0.15	正确、完整的输出是核心交付物
`scope_calibration`	0.12	既不能过于简略，也不能冗余臃肿
`progressive_disclosure`	0.10	SKILL.md应简洁；详细内容放在references/目录中
`token_efficiency`	0.06	每次调用的上下文浪费最小化
`robustness`	0.05	能处理边缘情况而不崩溃
`structural_completeness`	0.03	正确的章节结构与顺序
`code_template_quality`	0.02	可直接复制使用的可用示例
`ecosystem_coherence`	0.02	交叉引用；避免与其他同类Skill重复

Layer Blend Weights

层融合权重

Each dimension draws from different layers at different ratios. With all three layers active (

--depth deep

certify

Dimension	Static	Judge	Monte Carlo
`triggering_accuracy`	0.15	0.25	0.60
`orchestration_fitness`	0.10	0.70	0.20
`output_quality`	0.00	0.40	0.60
`scope_calibration`	0.30	0.55	0.15
`progressive_disclosure`	0.80	0.20	0.00
`token_efficiency`	0.40	0.10	0.50
`robustness`	0.00	0.20	0.80
`structural_completeness`	0.90	0.10	0.00
`code_template_quality`	0.30	0.70	0.00
`ecosystem_coherence`	0.85	0.15	0.00

--depth standard

(static + judge only), blends are renormalized to drop the Monte Carlo column. At

--depth quick

(static only), all weight falls on Layer 1.

每个维度会从不同层获取不同比例的分数。当三层全部启用时（

--depth deep

或

certify

模式）：

维度	静态分析	LLM评估器	蒙特卡洛模拟
`triggering_accuracy`	0.15	0.25	0.60
`orchestration_fitness`	0.10	0.70	0.20
`output_quality`	0.00	0.40	0.60
`scope_calibration`	0.30	0.55	0.15
`progressive_disclosure`	0.80	0.20	0.00
`token_efficiency`	0.40	0.10	0.50
`robustness`	0.00	0.20	0.80
`structural_completeness`	0.90	0.10	0.00
`code_template_quality`	0.30	0.70	0.00
`ecosystem_coherence`	0.85	0.15	0.00

在

--depth standard

模式下（仅静态分析+LLM评估器），融合权重会重新归一化，移除蒙特卡洛模拟列。在

--depth quick

模式下（仅静态分析），所有权重均分配给第一层。

Blended Score Calculation

融合分数计算

For a given depth, the blended score for dimension

is:

blended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
             ─────────────────────────────────────────────────────
             Σ( layer_weight[d][layer] for available layers )

This normalization ensures that skipping Monte Carlo at standard depth doesn't artificially deflate scores.

对于指定评估深度，维度

的融合分数计算公式为：

blended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
             ─────────────────────────────────────────────────────
             Σ( layer_weight[d][layer] for available layers )

这种归一化确保在standard深度下跳过蒙特卡洛模拟不会人为拉低分数。

Interpreting Dimension Scores

维度分数解读

Each dimension score is a float in

[0.0, 1.0]

. The CLI converts it to a letter grade:

Grade	Score range	Meaning
A	0.90 – 1.00	Excellent — no meaningful improvement needed
B	0.80 – 0.89	Good — minor gaps only
C	0.70 – 0.79	Adequate — one or two clear improvement areas
D	0.60 – 0.69	Marginal — needs targeted work
F	< 0.60	Failing — significant remediation required

When reading a report, focus first on the lowest-graded dimension that has the highest weight. A D in

triggering_accuracy

(weight 0.25) costs far more than a D in

ecosystem_coherence

(weight 0.02).

Confidence intervals appear in the report when Layer 2 or Layer 3 ran. Narrow CIs (± < 5 points) indicate stable scores. Wide CIs suggest inconsistency — often caused by an ambiguous description or instructions that work for some prompt styles but not others.

每个维度的分数为0.0到1.0之间的浮点数。CLI会将其转换为字母等级：

等级	分数范围	含义
A	0.90 – 1.00	优秀——无需实质性改进
B	0.80 – 0.89	良好——仅存在微小差距
C	0.70 – 0.79	合格——存在1-2个明确的改进方向
D	0.60 – 0.69	边缘——需要针对性优化
F	< 0.60	不合格——需要重大整改

查看报告时，应首先关注权重最高的低分维度。

triggering_accuracy

（权重0.25）得分为D的影响远大于

ecosystem_coherence

（权重0.02）得分为D的影响。

当第二层或第三层运行时，报告中会显示置信区间。窄置信区间（± <5分）表示分数稳定。宽置信区间表明结果不一致——通常由模糊的描述或仅适用于部分提示风格的指令导致。

Quality Badges

质量徽章

Badges require both a composite score threshold AND an Elo threshold (when Elo is available). The

Badge.from_scores()

logic checks composite first, then Elo if provided:

Badge	Composite	Elo	Meaning
Platinum ★★★★★	≥ 90	≥ 1600	Reference quality — suitable for gold corpus
Gold ★★★★	≥ 80	≥ 1500	Production ready
Silver ★★★	≥ 70	≥ 1400	Functional, has improvement opportunities
Bronze ★★	≥ 60	≥ 1300	Minimum viable — not yet recommended for users
—	< 60	any	Does not meet minimum bar

The Elo threshold is skipped when Elo has not been computed (i.e., at quick or standard depth without

certify

). A skill can earn a badge on composite score alone in those cases.

获取徽章需要同时满足综合分数阈值和Elo阈值（当Elo分数可用时）。

Badge.from_scores()

逻辑会先检查综合分数，再检查Elo分数（若提供）：

徽章	综合分数	Elo分数	含义
白金 ★★★★★	≥ 90	≥ 1600	参考级质量——适合纳入黄金语料库
黄金 ★★★★	≥ 80	≥ 1500	可用于生产环境
白银 ★★★	≥ 70	≥ 1400	功能完整，存在优化空间
青铜 ★★	≥ 60	≥ 1300	最小可用——暂不推荐给用户使用
—	< 60	任意	未达到最低标准

当Elo分数未计算时（即quick或standard深度且未使用

certify

模式），会跳过Elo阈值。此时Skill仅通过综合分数即可获取徽章。

Anti-Pattern Flags

反模式标记

The static analyzer detects five anti-patterns. Each carries a severity multiplier that feeds into the penalty formula.

静态分析器会检测五种反模式。每种反模式都有严重程度乘数，会纳入惩罚公式。

OVER_CONSTRAINED

OVER_CONSTRAINED（过度约束）

Trigger: More than 15 occurrences of MUST, ALWAYS, or NEVER in the SKILL.md.

Problem: Overly prescriptive instructions reduce model flexibility, increase token overhead, and signal that the author is trying to micromanage every output rather than providing principled guidance.

Fix: Audit every MUST/ALWAYS/NEVER. Replace directive language with explanatory framing where possible. Reserve hard constraints for genuine safety or correctness requirements. Target fewer than 10 such directives per 100 lines.

触发条件： SKILL.md中MUST、ALWAYS或NEVER出现次数超过15次。

问题： 过于严格的指令会降低模型灵活性，增加Token开销，表明作者试图管控每一处输出，而非提供原则性指导。

修复方案： 审核所有MUST/ALWAYS/NEVER指令。尽可能将指令性语言替换为解释性表述。仅在涉及安全或正确性的真正硬约束时保留强制指令。目标是每100行中此类指令少于10次。

EMPTY_DESCRIPTION

EMPTY_DESCRIPTION（描述为空）

Trigger: The frontmatter

description

field is fewer than 20 characters after stripping.

Problem: Without a meaningful description, the Claude Code plugin system cannot determine when to invoke the skill. The skill becomes invisible to autonomous invocation.

Fix: Write a description of at least 60–120 characters that includes:

A "Use this skill when..." or "Use when..." trigger clause
Two or more concrete contexts separated by commas or "or"

触发条件： 前置元数据中的

description

字段去除空白字符后长度不足20个字符。

问题： 没有有意义的描述，Claude Code插件系统无法确定何时调用该Skill，导致Skill无法被自动触发。

修复方案： 撰写长度为60-120字符的描述，包含：

一个“Use this skill when...”或“Use when...”触发分句
两个或多个具体场景，用逗号或“or”分隔

MISSING_TRIGGER

MISSING_TRIGGER（缺少触发语句）

Trigger: The description does not contain "use when", "use this skill when", "use proactively", or "trigger when" (case-insensitive).

Problem: Even a long description is useless for autonomous invocation if it doesn't include a clear trigger signal. The system's routing model needs an explicit cue.

Fix: Prepend "Use this skill when..." to the description, followed by specific scenarios. Example: "Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team."

触发条件： 描述中不包含“use when”、“use this skill when”、“use proactively”或“trigger when”（不区分大小写）。

问题： 即使描述很长，如果没有明确的触发信号，也无法用于自动调用。系统的路由模型需要明确的提示。

修复方案： 在描述开头添加“Use this skill when...”，后跟具体场景。示例：“Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team.”

BLOATED_SKILL

BLOATED_SKILL（Skill臃肿）

Trigger: SKILL.md exceeds 800 lines AND the skill has no

references/

directory.

Problem: A monolithic SKILL.md forces the entire document into context on every invocation, wasting tokens on content only needed in edge cases.

Fix: Create a

references/

directory and move supporting material there:

Detailed rubrics →
```
references/rubrics.md
```
Extended examples →
```
references/examples.md
```
Configuration reference →
```
references/config.md
```

The SKILL.md should link to these files with

[text](references/filename.md)

so the model can fetch them on demand.

触发条件： SKILL.md超过800行且未创建

references/

目录。

问题： 单一庞大的SKILL.md会在每次调用时将整个文档纳入上下文，浪费Token在仅边缘情况才需要的内容上。

修复方案： 创建

references/

详细评分准则 →
```
references/rubrics.md
```
扩展示例 →
```
references/examples.md
```
配置参考 →
```
references/config.md
```

SKILL.md中应使用

[text](references/filename.md)

链接到这些文件，以便模型按需获取。

ORPHAN_REFERENCE

ORPHAN_REFERENCE（孤立引用）

Trigger: SKILL.md contains a markdown link

[text](references/filename)

where

filename

does not exist in the

references/

directory.

Problem: Dead links waste tokens on context that will never resolve and confuse the model.

Fix: Either create the missing reference file or remove the dead link.

触发条件： SKILL.md包含指向

references/filename

的markdown链接，但该文件在

references/

目录中不存在。

问题： 无效链接会浪费上下文Token并造成模型困惑。

修复方案： 创建缺失的参考文件或移除无效链接。

DEAD_CROSS_REF

DEAD_CROSS_REF（无效交叉引用）

Trigger: SKILL.md references another skill or agent by relative path and that path cannot be resolved from the skills/ directory.

Problem: Broken ecosystem links undermine the plugin's coherence score and may cause the model to attempt navigation to non-existent files.

Fix: Verify the referenced skill exists. Update the path or remove the reference.

触发条件： SKILL.md通过相对路径引用另一个Skill或Agent，但该路径无法从skills/目录解析。

问题： 损坏的生态系统链接会降低插件的一致性分数，并可能导致模型尝试访问不存在的文件。

修复方案： 验证被引用的Skill是否存在。更新路径或移除引用。

Elo Ranking

Elo排名

PluginEval uses an Elo/Bradley-Terry rating system to rank a skill against the gold corpus.

Starting rating: 1500 (the corpus median by convention).

K-factor: 32 (standard for moderate-stakes ratings).

Expected score formula (standard Elo):

E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))

Rating update after each matchup:

new_rating = old_rating + 32 × (actual_score − expected_score)

where

actual_score

is 1.0 for a win, 0.5 for a draw, 0.0 for a loss.

Confidence intervals are computed via 500-sample bootstrap, reported as 95% CI. Corpus percentile reflects pairwise win rate against the gold corpus. Position bias check: Pairs are evaluated in both orders; disagreements are flagged.

The

plugin-eval init

command builds the corpus index from a plugins directory:

bash

plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpus

PluginEval使用Elo/Bradley-Terry评分系统将Skill与黄金语料库进行排名比较。

初始评分： 1500（语料库中位数惯例）。

K因子： 32（适用于中等重要性评分的标准值）。

预期分数公式（标准Elo算法）：

E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))

每次对决后的评分更新：

new_rating = old_rating + 32 × (actual_score − expected_score)

其中

actual_score

在获胜时为1.0，平局时为0.5，失败时为0.0。

置信区间通过500次抽样bootstrap计算，报告95%置信区间。语料库百分位反映与黄金语料库的成对胜率。位置偏差检查： 成对对决会按两种顺序进行评估；不一致的结果会被标记。

plugin-eval init

命令会从插件目录构建语料库索引：

bash

plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpus

CLI Reference

CLI参考

Score a skill (quick static analysis only)

评分Skill（仅快速静态分析）

bash

plugin-eval score ./path/to/skill --depth quick

Returns Layer 1 results in < 2 seconds. Useful for fast feedback during authoring.

bash

plugin-eval score ./path/to/skill --depth quick

在2秒内返回第一层结果，适用于创作过程中的快速反馈。

Score with LLM judge (default)

使用LLM评估器评分（默认模式）

bash

plugin-eval score ./path/to/skill

Runs static + LLM judge (standard depth). Takes 30–90 seconds.

bash

plugin-eval score ./path/to/skill

运行静态分析+LLM评估器（standard深度），耗时30-90秒。

Score with full output as JSON

以JSON格式输出完整评分结果

bash

plugin-eval score ./path/to/skill --output json

Emits structured JSON including

composite.score

composite.dimensions

, and

layers[0].anti_patterns

. Suitable for CI integration:

bash

plugin-eval score ./path/to/skill --depth quick --output json --threshold 70

bash

plugin-eval score ./path/to/skill --output json

输出结构化JSON，包括

composite.score

、

composite.dimensions

和

layers[0].anti_patterns

，适合集成到CI流程：

bash

plugin-eval score ./path/to/skill --depth quick --output json --threshold 70

exits with code 1 if score < 70

若分数<70，退出码为1

undefined

undefined

Full certification (all three layers + Elo)

完整认证（三层评估+Elo排名）

bash

plugin-eval certify ./path/to/skill

Runs static + LLM judge + Monte Carlo (50 simulations) + Elo ranking. Takes 15–20 minutes. Assigns a quality badge. Use before publishing a skill to the marketplace.

bash

plugin-eval certify ./path/to/skill

运行静态分析+LLM评估器+蒙特卡洛模拟（50次）+Elo排名，耗时15-20分钟。会分配质量徽章，适合在将Skill发布到应用市场前使用。

Head-to-head comparison

一对一对比

bash

plugin-eval compare ./skill-a ./skill-b

Evaluates both skills at quick depth and prints a dimension-by-dimension comparison table. Useful for deciding between two implementations or measuring improvement before/after a rewrite.

bash

plugin-eval compare ./skill-a ./skill-b

在quick深度下评估两个Skill，并按维度输出对比表格，适合在两个实现方案中做选择，或衡量重写前后的改进情况。

Initialize corpus for Elo

初始化Elo语料库

bash

plugin-eval init ./plugins

Builds the local corpus index at

~/.plugineval/corpus

. Required before Elo ranking works.

bash

plugin-eval init ./plugins

在

~/.plugineval/corpus

路径下构建本地语料库索引，Elo排名功能需要先执行此操作。

Scripting the Composite Formula

编写综合评分公式脚本

Reproduce the composite score offline (pre-commit hook, CI gate):

python

def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
    """Replicate the PluginEval composite formula."""
    WEIGHTS = {
        "triggering_accuracy":    0.25,
        "orchestration_fitness":  0.20,
        "output_quality":         0.15,
        "scope_calibration":      0.12,
        "progressive_disclosure": 0.10,
        "token_efficiency":       0.06,
        "robustness":             0.05,
        "structural_completeness":0.03,
        "code_template_quality":  0.02,
        "ecosystem_coherence":    0.02,
    }
    raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
    penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
    return round(raw * 100 * penalty, 2)

离线重现综合分数（用于预提交钩子、CI gate）：

python

def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
    """Replicate the PluginEval composite formula."""
    WEIGHTS = {
        "triggering_accuracy":    0.25,
        "orchestration_fitness":  0.20,
        "output_quality":         0.15,
        "scope_calibration":      0.12,
        "progressive_disclosure": 0.10,
        "token_efficiency":       0.06,
        "robustness":             0.05,
        "structural_completeness":0.03,
        "code_template_quality":  0.02,
        "ecosystem_coherence":    0.02,
    }
    raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
    penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
    return round(raw * 100 * penalty, 2)

Example: a skill with a weak triggering score

scores = { "triggering_accuracy": 0.65, # D — needs description work "orchestration_fitness": 0.85, "output_quality": 0.80, # … fill in remaining 7 dimensions … }

composite_score(scores, anti_pattern_count=1) → ~76.5

undefined

undefined

JSON Output Format

JSON输出格式

Top-level shape of

--output json

json

{
  "composite": { "score": 76.5, "badge": "Silver", "elo": null },
  "dimensions": {
    "triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
    "orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
  },
  "layers": [
    { "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
    { "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
  ]
}

Parse

composite.score

in CI to gate deployments:

bash

score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
  echo "Quality gate failed: score $score < 70"
  exit 1
fi

--output json

的顶层结构：

json

{
  "composite": { "score": 76.5, "badge": "Silver", "elo": null },
  "dimensions": {
    "triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
    "orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
  },
  "layers": [
    { "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
    { "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
  ]
}

在CI中解析

composite.score

以实现部署门禁：

bash

score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
  echo "Quality gate failed: score $score < 70"
  exit 1
fi

Tips for Improving a Skill's Score

Skill分数提升建议

Work through dimensions in weight order. The largest gains come from fixing the top-weighted dimensions first.

按维度权重顺序进行优化。权重最高的维度能带来最大的分数提升。

Which Dimension to Improve First

优先优化哪个维度

Use this table when a score report shows multiple D/F grades and you need to prioritize effort.

Dimension	Weight	Typical fix effort	Score impact / hour	Fix first if…
`triggering_accuracy`	0.25	Low — description rewrite	High	Score < 70 overall
`orchestration_fitness`	0.20	Medium — restructure sections	High	Skill mixes worker + supervisor logic
`output_quality`	0.15	Medium — add examples	Medium	Judge score < 0.70
`scope_calibration`	0.12	Low — move content to references/	Medium	File is < 100 or > 800 lines
`progressive_disclosure`	0.10	Low — create references/ dir	Medium	No references/ directory exists
`token_efficiency`	0.06	Low — reduce MUST/ALWAYS/NEVER	Low	Anti-pattern count ≥ 3
`robustness`	0.05	Low — add Troubleshooting section	Low	No edge-case handling documented
`structural_completeness`	0.03	Very low — add headings/code blocks	Low	Fewer than 4 H2 headings
`code_template_quality`	0.02	Very low — add language tags	Very low	Code blocks missing language tags
`ecosystem_coherence`	0.02	Very low — add Related section	Very low	No cross-references at all

Rule of thumb: Fix

triggering_accuracy

before anything else — at weight 0.25 it delivers more composite-score gain per hour than all low-weight dimensions combined.

当分数报告显示多个D/F等级时，可使用下表确定优化优先级：

维度	权重	典型修复工作量	每小时分数提升	优先优化场景
`triggering_accuracy`	0.25	低——重写描述	高	整体分数<70
`orchestration_fitness`	0.20	中——重构章节	高	Skill混合了Worker与管理者逻辑
`output_quality`	0.15	中——添加示例	中	评估器分数<0.70
`scope_calibration`	0.12	低——将内容移至references/	中	文件行数<100或>800
`progressive_disclosure`	0.10	低——创建references/目录	中	无references/目录
`token_efficiency`	0.06	低——减少MUST/ALWAYS/NEVER	低	反模式数量≥3
`robustness`	0.05	低——添加故障排查章节	低	未记录边缘情况处理
`structural_completeness`	0.03	极低——添加标题/代码块	低	H2标题少于4个
`code_template_quality`	0.02	极低——添加语言标签	极低	代码块缺少语言标签
`ecosystem_coherence`	0.02	极低——添加相关章节	极低	无任何交叉引用

经验法则： 优先修复

triggering_accuracy

——其权重为0.25，每小时带来的综合分数提升超过所有低权重维度的总和。

Triggering Accuracy (weight 0.25)

触发准确性（权重0.25）

Include "Use this skill when..." followed by 3–4 comma-separated specific contexts.
Add "proactively" if the skill should auto-activate without an explicit user request.
Mental test: write 5 prompts that should trigger it and 5 that should not — does your description discriminate? If not, add or tighten the context phrases.

包含“Use this skill when...”后跟3-4个逗号分隔的具体场景。
若Skill应在无用户明确请求时自动激活，添加“proactively”。
测试：编写5个应触发Skill的提示和5个不应触发的提示——你的描述能否区分？若不能，添加或收紧场景表述。

Orchestration Fitness (weight 0.20)

编排适配性（权重0.20）

Document what the skill receives and what it returns — not what it orchestrates.
Avoid "orchestrate", "coordinate", "dispatch", "manage workflow" in SKILL.md.
Include an "Output format" section and 2+ code blocks showing concrete worker behavior.

记录Skill接收和返回的内容——而非其编排的内容。
SKILL.md中避免使用“orchestrate”、“coordinate”、“dispatch”、“manage workflow”等词汇。
包含“输出格式”章节及2个以上展示具体Worker行为的代码块。

Output Quality (weight 0.15)

输出质量（权重0.15）

Give specific, actionable instructions — not just goals.
Cover at least one edge case explicitly (empty input, malformed data, etc.).
Include an examples section showing representative inputs and expected outputs.
The more concrete the instructions, the higher the judge will score this dimension.

提供具体、可执行的指令——而非仅目标。
明确覆盖至少一种边缘情况（空输入、格式错误的数据等）。
包含示例章节，展示代表性输入和预期输出。
指令越具体，评估器给该维度的分数越高。

Scope Calibration (weight 0.12)

范围校准（权重0.12）

Target 200–600 lines. Below 100 is a stub; above 800 without
```
references/
```
is bloat.
Move background reading, extended examples, and reference tables to
```
references/
```
.
Very narrow skills should be merged with a sibling; very broad ones should be split.

目标行数为200-600行。少于100行过于简略；超过800行且无
```
references/
```
目录则过于臃肿。
将背景知识、扩展示例和参考表格移至
```
references/
```
目录。
范围过窄的Skill应与同类Skill合并；范围过宽的Skill应拆分。

Progressive Disclosure (weight 0.10)

渐进式披露（权重0.10）

Add a
```
references/
```
directory (earns 0.15–0.25 bonus) and keep SKILL.md focused on the execution path. An
```
assets/
```
directory adds a further bonus.

添加
```
references/
```
目录（可获得0.15-0.25分加分），并保持SKILL.md聚焦于执行流程。
```
assets/
```
目录可带来额外加分。

Token Efficiency (weight 0.06)

Token效率（权重0.06）

Audit MUST/ALWAYS/NEVER count. Target < 1 per 10 lines.
Consolidate near-duplicate bullet points and repeated-structure tables.

审核MUST/ALWAYS/NEVER的数量，目标是每10行少于1次。
合并重复的项目符号和结构重复的表格。

Robustness (weight 0.05)

鲁棒性（权重0.05）

Add a "Troubleshooting" or "Edge Cases" section covering at least 3 failure modes.
State what the skill returns when it cannot complete its task.

添加“故障排查”或“边缘情况”章节，覆盖至少3种失败模式。
说明Skill无法完成任务时的返回内容。

Structural Completeness (weight 0.03)

结构完整性（权重0.03）

Ensure at least 4 H2/H3 headings, 3 code blocks, an Examples section, and a Troubleshooting section.

确保至少有4个H2/H3标题、3个代码块、一个示例章节和一个故障排查章节。

Code Template Quality (weight 0.02)

代码模板质量（权重0.02）

All code blocks must be syntactically valid and copy-paste ready with language tags.

所有代码块必须语法正确，且带有语言标签以便直接复制使用。

Ecosystem Coherence (weight 0.02)

生态系统一致性（权重0.02）

Add a "## Related" section listing sibling skills or agents with relative paths.
Avoid duplicating content that already exists in another skill — link to it instead.

添加“## 相关”章节，列出同类Skill或Agent的相对路径。
避免重复其他Skill已有的内容——改为链接到该Skill。

Troubleshooting

故障排查

"Score is much lower than expected after adding content"

“添加内容后分数远低于预期”

The anti-pattern penalty compounds. Run with

--output json

and inspect

layers[0].anti_patterns

. If you have 5+ anti-patterns, the multiplier can reduce your score to 75% of its raw value regardless of how good the content is. Fix the flags first.

反模式惩罚会累积。使用

--output json

运行并查看

layers[0].anti_patterns

。若存在5个以上反模式，乘数会将分数降至原始分数的75%，无论内容质量如何。应先修复这些标记。

"triggering_accuracy is low despite a detailed description"

“尽管描述详细，触发准确性仍很低”

The

_description_pushiness

scorer looks for specific syntactic patterns, not just length. Verify your description contains the phrase "Use this skill when" or "Use when" (exact phrasing matters — it's a regex match). Also check that you have multiple use cases separated by commas or "or" to earn the specificity bonus.

_description_pushiness

评分器关注特定语法模式，而非仅长度。确认你的描述包含“Use this skill when”或“Use when”短语（精确匹配——基于正则表达式）。同时确保包含多个用逗号或“or”分隔的使用场景，以获得特异性加分。

"LLM judge scores vary significantly between runs"

“LLM评估器的分数在多次运行中差异显著”

This is expected for ambiguous skills. The judge generates 10 mental test prompts non-deterministically. Improve score stability by tightening the description and adding concrete examples. When

judges > 1

, averaged scores will be more stable. Use

--depth deep

with

certify

which runs Monte Carlo to get statistically-bounded scores.

对于模糊的Skill，这种情况是正常的。评估器会非确定性地生成10个测试提示。通过收紧描述和添加具体示例可提升分数稳定性。当

judges > 1

时，平均分数会更稳定。使用

--depth deep

模式的

certify

命令，会运行蒙特卡洛模拟以获得有统计边界的分数。

"progressive_disclosure score is low even though the file is the right length"

“尽管文件长度合适，渐进式披露分数仍很低”

Check whether the file is in the 200–600 line sweet spot. Files shorter than 100 lines score only 0.20 on this sub-check. Also confirm that

references/

files are not empty — the scorer checks for non-empty reference files, not just the directory.

检查文件是否处于200-600行的最优区间。少于100行的文件在此子检查中仅得0.20分。同时确认

references/

目录下的文件非空——评分器会检查非空的参考文件，而非仅目录存在。

"compare shows my rewrite scores lower than the original"

“compare显示重写后的分数低于原始版本”

Quick depth (

--depth quick

) only runs static analysis. If the rewrite moved content to

references/

and shortened SKILL.md significantly, static scores for structural completeness may drop even though overall quality improved. Run

--depth standard

for a fairer comparison that includes the LLM judge's assessment of content quality.

Quick深度（

--depth quick

）仅运行静态分析。若重写将内容移至

references/

并大幅缩短SKILL.md，结构完整性的静态分数可能会下降，尽管整体质量有所提升。使用

--depth standard

模式进行更公平的比较，该模式会包含LLM评估器对内容质量的评估。