skill-distiller

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Distiller

技能蒸馏器

Transform skills authored for high-capability models (Opus) into deterministic workflows that execute reliably on lower-cost models (Sonnet, Haiku). The core insight from EvoSkills: skills encode reusable task structure, not model-specific artifacts. A skill evolved on Opus transfers with +35-45pp gains to other models — but only when the instructions are sufficiently deterministic that lower-capability models can follow them without improvising.
将为高能力模型(Opus)编写的技能转换为可在低成本模型(Sonnet、Haiku)上可靠执行的确定性工作流。EvoSkills的核心见解是:技能编码的是可复用的任务结构,而非模型特定的产物。在Opus上演进的技能迁移到其他模型时,性能可提升35-45个百分点——但只有当指令足够确定,低能力模型无需即兴发挥就能遵循时,这种提升才会实现。

Reference Files

参考文件

FileContentsLoad When
references/distillation-patterns.md
Pattern catalog for converting reasoning to rulesAlways
文件路径内容加载时机
references/distillation-patterns.md
将推理转换为规则的模式目录始终加载

Prerequisites

前置条件

  • The source skill must exist and pass
    package-evaluator
    at >= 70%
  • Access to both the source model (Opus) and target model (Haiku/Sonnet) for validation
  • The
    surrogate-verifier
    skill for cross-model assertion checking
  • 源技能必须存在且通过
    package-evaluator
    评估,得分≥70%
  • 可访问源模型(Opus)和目标模型(Haiku/Sonnet)以进行验证
  • 具备用于跨模型断言检查的
    surrogate-verifier
    技能

Workflow

工作流

Phase 1: Complexity Analysis

阶段1:复杂度分析

Score each section of the source SKILL.md for reasoning difficulty:
Complexity SignalScoreDistillation Action
Decision tree with 3+ branchesHIGHConvert to explicit if/then lookup table
"Use judgment" or "consider context"HIGHReplace with concrete heuristic rules
Multi-step inference chainHIGHBreak into numbered atomic steps
Reference to domain expertiseMEDAdd explicit reference file with knowledge
Clear enumerated stepsLOWKeep as-is
Concrete examples with expected outputLOWKeep as-is
Produce a complexity map: section name -> complexity score -> planned action.
为源SKILL.md的每个章节评分,判断推理难度:
复杂度信号评分蒸馏操作
包含3个及以上分支的决策树转换为明确的if/then查找表
"运用判断"或"考虑上下文"替换为具体的启发式规则
多步骤推理链拆分为编号的原子步骤
引用领域专业知识添加包含相关知识的明确参考文件
清晰的枚举步骤保持原样
带有预期输出的具体示例保持原样
生成复杂度映射表:章节名称 → 复杂度评分 → 计划操作。

Phase 2: Trace Collection

阶段2:追踪收集

Execute the source skill with Opus on 5 representative tasks:
  1. Select tasks from
    evals/cases.yaml
    (positive cases) or generate new ones
  2. For each task, capture the full execution trace:
    • Tool calls made (which tools, in what order)
    • Intermediate reasoning visible in output
    • Final output structure and content
    • Time taken and token usage
  3. Store traces as structured data for pattern extraction
使用Opus在5个代表性任务上执行源技能:
  1. evals/cases.yaml
    中选择任务(正向案例)或生成新任务
  2. 针对每个任务,捕获完整的执行追踪:
    • 调用的工具(哪些工具、调用顺序)
    • 输出中可见的中间推理过程
    • 最终输出的结构和内容
    • 耗时和Token使用量
  3. 将追踪结果存储为结构化数据,用于模式提取

Phase 3: Pattern Extraction

阶段3:模式提取

From the collected traces, extract deterministic patterns:
  1. Decision paths — For each HIGH-complexity section, find the actual decisions Opus made across the 5 tasks. If Opus chose the same path in 4/5 cases, that path becomes the default rule
  2. Lookup tables — Where Opus applied domain knowledge, build explicit lookup tables (e.g., "if input contains SQL, use these patterns; if input contains Python, use those")
  3. Concrete examples — Extract representative input/output pairs from traces to serve as few-shot examples in the distilled skill
  4. Tool sequences — Identify the common tool invocation pattern and make it explicit ("Step 1: Read the file. Step 2: Grep for pattern X. Step 3: Write output.")
从收集到的追踪结果中提取确定性模式:
  1. 决策路径 — 对于每个高复杂度章节,找出Opus在5个任务中实际做出的决策。如果Opus在4/5的案例中选择了相同路径,则该路径成为默认规则
  2. 查找表 — 在Opus应用领域知识的地方,构建明确的查找表(例如:“如果输入包含SQL,使用这些模式;如果输入包含Python,使用那些模式”)
  3. 具体示例 — 从追踪结果中提取代表性的输入/输出对,作为蒸馏后技能中的少样本示例
  4. 工具序列 — 识别常见的工具调用模式并将其明确化(“步骤1:读取文件。步骤2:使用Grep查找模式X。步骤3:写入输出。”)

Phase 4: Distilled Rewrite

阶段4:蒸馏重写

Rewrite the SKILL.md applying all distillation actions from Phase 1:
Source PatternDistilled Replacement
"Analyze the code and determine...""Check for these 5 specific patterns: [list]"
"Use appropriate formatting""Output as a markdown table with columns: [A, B, C]"
"Consider the context to decide...""If [condition A]: do X. If [condition B]: do Y. Default: Z"
"Apply best practices for..."Reference file with explicit best practices enumerated
Multi-paragraph reasoning instructionNumbered step list with single-sentence steps
Rules for the rewrite:
  • Every instruction must be actionable by a model with no domain expertise
  • No step should require inference — each step's input and output must be explicit
  • Replace all "consider", "analyze", "determine" verbs with "check", "count", "list", "output"
  • Add concrete examples for any step that could be ambiguous
  • Keep the SKILL.md under 500 lines (distillation should reduce, not expand)
应用阶段1中的所有蒸馏操作,重写SKILL.md:
源模式蒸馏后替代内容
"分析代码并确定...""检查以下5个特定模式:[列表]"
"使用合适的格式""输出为包含以下列的Markdown表格:[A, B, C]"
"根据上下文决定...""如果[条件A]:执行X。如果[条件B]:执行Y。默认:Z"
"应用...的最佳实践"引用包含明确枚举最佳实践的参考文件
多段落推理指令拆分为包含单句步骤的编号步骤列表
重写规则:
  • 每条指令必须是无领域专业知识的模型也能执行的可操作指令
  • 任何步骤都不应需要推理——每个步骤的输入和输出必须明确
  • 将所有“考虑”“分析”“确定”类动词替换为“检查”“计数”“列出”“输出”
  • 为任何可能存在歧义的步骤添加具体示例
  • 保持SKILL.md在500行以内(蒸馏应精简内容,而非扩充)

Phase 5: Target Model Validation

阶段5:目标模型验证

Run the distilled skill on the target model (Haiku or Sonnet):
  1. Execute the same 5 tasks from Phase 2 with the distilled skill loaded
  2. Use the
    surrogate-verifier
    to generate assertions for each task output
  3. Compare pass rates:
MetricSource (Opus + original)Target (Haiku + distilled)Delta
Assertions passedN/MN/M±
Weighted scoreX.XXX.XX±
Output completeness%%±
Format compliance%%±
  1. If target model score < 80% of source model score, iterate:
    • Identify which assertions the target model fails
    • Add more explicit instructions for those specific failure points
    • Re-run validation (max 3 iterations)
在目标模型(Haiku或Sonnet)上运行蒸馏后的技能:
  1. 加载蒸馏后的技能,执行阶段2中的5个相同任务
  2. 使用
    surrogate-verifier
    为每个任务输出生成断言
  3. 对比通过率:
指标源模型(Opus + 原技能)目标模型(Haiku + 蒸馏后技能)差值
通过的断言数N/MN/M±
加权得分X.XXX.XX±
输出完整性%%±
格式合规性%%±
  1. 如果目标模型得分低于源模型得分的80%,则迭代优化:
    • 确定目标模型未通过哪些断言
    • 针对这些特定失败点添加更明确的指令
    • 重新运行验证(最多3次迭代)

Phase 6: Cross-Model Report

阶段6:跨模型报告

Produce the final comparison:
markdown
undefined
生成最终对比报告:
markdown
undefined

Skill Distillation Report: <skill-name>

技能蒸馏报告: <skill-name>

Complexity Reduction

复杂度降低情况

  • Sections distilled: N/M (HIGH → LOW)
  • Instruction word count: original X → distilled Y (Z% reduction)
  • Decision points replaced with lookup tables: N
  • 已蒸馏章节数: N/M(高→低)
  • 指令字数: 原X字 → 蒸馏后Y字(减少Z%)
  • 替换为查找表的决策点数量: N

Cross-Model Performance

跨模型性能

ModelAssertions PassedWeighted ScoreFormat Compliance
Opus7/71.00100%
Sonnet6/70.92100%
Haiku5/70.8585%
模型通过的断言数加权得分格式合规性
Opus7/71.00100%
Sonnet6/70.92100%
Haiku5/70.8585%

Changes Made

已做修改

  1. [Section] "Analyze complexity" → explicit 5-item checklist
  2. [Section] "Apply formatting" → fixed markdown table template ...
  1. [章节] "分析复杂度" → 明确的5项检查清单
  2. [章节] "应用格式" → 固定的Markdown表格模板 ...

Recommendation

建议

[SHIP | ITERATE | MANUAL_REVIEW_NEEDED]
undefined
[发布 | 迭代 | 需要人工审核]
undefined

Error Handling

错误处理

ErrorResolution
Source skill scores below 70%Refuse distillation; recommend evolution via test-engineer
No execution traces availableGenerate synthetic tasks and collect traces before proceeding
Target model fails all assertionsSkill may be too complex for target model; report with detail
Distilled skill longer than sourceReview distillation; patterns may need consolidation
错误类型解决方法
源技能得分低于70%拒绝蒸馏;建议通过test-engineer进行技能演进
无可用执行追踪生成合成任务并收集追踪结果后再继续
目标模型未通过所有断言技能可能对目标模型来说过于复杂;详细报告情况
蒸馏后技能比原技能更长复查蒸馏过程;可能需要合并模式

Limitations

局限性

  • Cannot distill skills that rely on open-ended adaptive reasoning at many decision points or multi-turn reasoning
  • Visual/interactive skills (HTML generation, browser automation) may not distill well
  • Distillation optimizes for determinism, not creativity — skills requiring open-ended generation (writing, brainstorming) are poor candidates
  • Trace collection requires actual model execution, incurring API costs
  • 无法蒸馏依赖大量开放式自适应推理或多轮推理的技能
  • 视觉/交互式技能(HTML生成、浏览器自动化)可能无法很好地蒸馏
  • 蒸馏针对确定性优化,而非创造性——需要开放式生成的技能(写作、头脑风暴)不适合蒸馏
  • 追踪收集需要实际执行模型,会产生API成本