skill-distiller
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Distiller
技能蒸馏器
Transform skills authored for high-capability models (Opus) into deterministic workflows
that execute reliably on lower-cost models (Sonnet, Haiku). The core insight from
EvoSkills: skills encode reusable task structure, not model-specific artifacts. A skill
evolved on Opus transfers with +35-45pp gains to other models — but only when the
instructions are sufficiently deterministic that lower-capability models can follow them
without improvising.
将为高能力模型(Opus)编写的技能转换为可在低成本模型(Sonnet、Haiku)上可靠执行的确定性工作流。EvoSkills的核心见解是:技能编码的是可复用的任务结构,而非模型特定的产物。在Opus上演进的技能迁移到其他模型时,性能可提升35-45个百分点——但只有当指令足够确定,低能力模型无需即兴发挥就能遵循时,这种提升才会实现。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Pattern catalog for converting reasoning to rules | Always |
| 文件路径 | 内容 | 加载时机 |
|---|---|---|
| 将推理转换为规则的模式目录 | 始终加载 |
Prerequisites
前置条件
- The source skill must exist and pass at >= 70%
package-evaluator - Access to both the source model (Opus) and target model (Haiku/Sonnet) for validation
- The skill for cross-model assertion checking
surrogate-verifier
- 源技能必须存在且通过评估,得分≥70%
package-evaluator - 可访问源模型(Opus)和目标模型(Haiku/Sonnet)以进行验证
- 具备用于跨模型断言检查的技能
surrogate-verifier
Workflow
工作流
Phase 1: Complexity Analysis
阶段1:复杂度分析
Score each section of the source SKILL.md for reasoning difficulty:
| Complexity Signal | Score | Distillation Action |
|---|---|---|
| Decision tree with 3+ branches | HIGH | Convert to explicit if/then lookup table |
| "Use judgment" or "consider context" | HIGH | Replace with concrete heuristic rules |
| Multi-step inference chain | HIGH | Break into numbered atomic steps |
| Reference to domain expertise | MED | Add explicit reference file with knowledge |
| Clear enumerated steps | LOW | Keep as-is |
| Concrete examples with expected output | LOW | Keep as-is |
Produce a complexity map: section name -> complexity score -> planned action.
为源SKILL.md的每个章节评分,判断推理难度:
| 复杂度信号 | 评分 | 蒸馏操作 |
|---|---|---|
| 包含3个及以上分支的决策树 | 高 | 转换为明确的if/then查找表 |
| "运用判断"或"考虑上下文" | 高 | 替换为具体的启发式规则 |
| 多步骤推理链 | 高 | 拆分为编号的原子步骤 |
| 引用领域专业知识 | 中 | 添加包含相关知识的明确参考文件 |
| 清晰的枚举步骤 | 低 | 保持原样 |
| 带有预期输出的具体示例 | 低 | 保持原样 |
生成复杂度映射表:章节名称 → 复杂度评分 → 计划操作。
Phase 2: Trace Collection
阶段2:追踪收集
Execute the source skill with Opus on 5 representative tasks:
- Select tasks from (positive cases) or generate new ones
evals/cases.yaml - For each task, capture the full execution trace:
- Tool calls made (which tools, in what order)
- Intermediate reasoning visible in output
- Final output structure and content
- Time taken and token usage
- Store traces as structured data for pattern extraction
使用Opus在5个代表性任务上执行源技能:
- 从中选择任务(正向案例)或生成新任务
evals/cases.yaml - 针对每个任务,捕获完整的执行追踪:
- 调用的工具(哪些工具、调用顺序)
- 输出中可见的中间推理过程
- 最终输出的结构和内容
- 耗时和Token使用量
- 将追踪结果存储为结构化数据,用于模式提取
Phase 3: Pattern Extraction
阶段3:模式提取
From the collected traces, extract deterministic patterns:
- Decision paths — For each HIGH-complexity section, find the actual decisions Opus made across the 5 tasks. If Opus chose the same path in 4/5 cases, that path becomes the default rule
- Lookup tables — Where Opus applied domain knowledge, build explicit lookup tables (e.g., "if input contains SQL, use these patterns; if input contains Python, use those")
- Concrete examples — Extract representative input/output pairs from traces to serve as few-shot examples in the distilled skill
- Tool sequences — Identify the common tool invocation pattern and make it explicit ("Step 1: Read the file. Step 2: Grep for pattern X. Step 3: Write output.")
从收集到的追踪结果中提取确定性模式:
- 决策路径 — 对于每个高复杂度章节,找出Opus在5个任务中实际做出的决策。如果Opus在4/5的案例中选择了相同路径,则该路径成为默认规则
- 查找表 — 在Opus应用领域知识的地方,构建明确的查找表(例如:“如果输入包含SQL,使用这些模式;如果输入包含Python,使用那些模式”)
- 具体示例 — 从追踪结果中提取代表性的输入/输出对,作为蒸馏后技能中的少样本示例
- 工具序列 — 识别常见的工具调用模式并将其明确化(“步骤1:读取文件。步骤2:使用Grep查找模式X。步骤3:写入输出。”)
Phase 4: Distilled Rewrite
阶段4:蒸馏重写
Rewrite the SKILL.md applying all distillation actions from Phase 1:
| Source Pattern | Distilled Replacement |
|---|---|
| "Analyze the code and determine..." | "Check for these 5 specific patterns: [list]" |
| "Use appropriate formatting" | "Output as a markdown table with columns: [A, B, C]" |
| "Consider the context to decide..." | "If [condition A]: do X. If [condition B]: do Y. Default: Z" |
| "Apply best practices for..." | Reference file with explicit best practices enumerated |
| Multi-paragraph reasoning instruction | Numbered step list with single-sentence steps |
Rules for the rewrite:
- Every instruction must be actionable by a model with no domain expertise
- No step should require inference — each step's input and output must be explicit
- Replace all "consider", "analyze", "determine" verbs with "check", "count", "list", "output"
- Add concrete examples for any step that could be ambiguous
- Keep the SKILL.md under 500 lines (distillation should reduce, not expand)
应用阶段1中的所有蒸馏操作,重写SKILL.md:
| 源模式 | 蒸馏后替代内容 |
|---|---|
| "分析代码并确定..." | "检查以下5个特定模式:[列表]" |
| "使用合适的格式" | "输出为包含以下列的Markdown表格:[A, B, C]" |
| "根据上下文决定..." | "如果[条件A]:执行X。如果[条件B]:执行Y。默认:Z" |
| "应用...的最佳实践" | 引用包含明确枚举最佳实践的参考文件 |
| 多段落推理指令 | 拆分为包含单句步骤的编号步骤列表 |
重写规则:
- 每条指令必须是无领域专业知识的模型也能执行的可操作指令
- 任何步骤都不应需要推理——每个步骤的输入和输出必须明确
- 将所有“考虑”“分析”“确定”类动词替换为“检查”“计数”“列出”“输出”
- 为任何可能存在歧义的步骤添加具体示例
- 保持SKILL.md在500行以内(蒸馏应精简内容,而非扩充)
Phase 5: Target Model Validation
阶段5:目标模型验证
Run the distilled skill on the target model (Haiku or Sonnet):
- Execute the same 5 tasks from Phase 2 with the distilled skill loaded
- Use the to generate assertions for each task output
surrogate-verifier - Compare pass rates:
| Metric | Source (Opus + original) | Target (Haiku + distilled) | Delta |
|---|---|---|---|
| Assertions passed | N/M | N/M | ± |
| Weighted score | X.XX | X.XX | ± |
| Output completeness | % | % | ± |
| Format compliance | % | % | ± |
- If target model score < 80% of source model score, iterate:
- Identify which assertions the target model fails
- Add more explicit instructions for those specific failure points
- Re-run validation (max 3 iterations)
在目标模型(Haiku或Sonnet)上运行蒸馏后的技能:
- 加载蒸馏后的技能,执行阶段2中的5个相同任务
- 使用为每个任务输出生成断言
surrogate-verifier - 对比通过率:
| 指标 | 源模型(Opus + 原技能) | 目标模型(Haiku + 蒸馏后技能) | 差值 |
|---|---|---|---|
| 通过的断言数 | N/M | N/M | ± |
| 加权得分 | X.XX | X.XX | ± |
| 输出完整性 | % | % | ± |
| 格式合规性 | % | % | ± |
- 如果目标模型得分低于源模型得分的80%,则迭代优化:
- 确定目标模型未通过哪些断言
- 针对这些特定失败点添加更明确的指令
- 重新运行验证(最多3次迭代)
Phase 6: Cross-Model Report
阶段6:跨模型报告
Produce the final comparison:
markdown
undefined生成最终对比报告:
markdown
undefinedSkill Distillation Report: <skill-name>
技能蒸馏报告: <skill-name>
Complexity Reduction
复杂度降低情况
- Sections distilled: N/M (HIGH → LOW)
- Instruction word count: original X → distilled Y (Z% reduction)
- Decision points replaced with lookup tables: N
- 已蒸馏章节数: N/M(高→低)
- 指令字数: 原X字 → 蒸馏后Y字(减少Z%)
- 替换为查找表的决策点数量: N
Cross-Model Performance
跨模型性能
| Model | Assertions Passed | Weighted Score | Format Compliance |
|---|---|---|---|
| Opus | 7/7 | 1.00 | 100% |
| Sonnet | 6/7 | 0.92 | 100% |
| Haiku | 5/7 | 0.85 | 85% |
| 模型 | 通过的断言数 | 加权得分 | 格式合规性 |
|---|---|---|---|
| Opus | 7/7 | 1.00 | 100% |
| Sonnet | 6/7 | 0.92 | 100% |
| Haiku | 5/7 | 0.85 | 85% |
Changes Made
已做修改
- [Section] "Analyze complexity" → explicit 5-item checklist
- [Section] "Apply formatting" → fixed markdown table template ...
- [章节] "分析复杂度" → 明确的5项检查清单
- [章节] "应用格式" → 固定的Markdown表格模板 ...
Recommendation
建议
[SHIP | ITERATE | MANUAL_REVIEW_NEEDED]
undefined[发布 | 迭代 | 需要人工审核]
undefinedError Handling
错误处理
| Error | Resolution |
|---|---|
| Source skill scores below 70% | Refuse distillation; recommend evolution via test-engineer |
| No execution traces available | Generate synthetic tasks and collect traces before proceeding |
| Target model fails all assertions | Skill may be too complex for target model; report with detail |
| Distilled skill longer than source | Review distillation; patterns may need consolidation |
| 错误类型 | 解决方法 |
|---|---|
| 源技能得分低于70% | 拒绝蒸馏;建议通过test-engineer进行技能演进 |
| 无可用执行追踪 | 生成合成任务并收集追踪结果后再继续 |
| 目标模型未通过所有断言 | 技能可能对目标模型来说过于复杂;详细报告情况 |
| 蒸馏后技能比原技能更长 | 复查蒸馏过程;可能需要合并模式 |
Limitations
局限性
- Cannot distill skills that rely on open-ended adaptive reasoning at many decision points or multi-turn reasoning
- Visual/interactive skills (HTML generation, browser automation) may not distill well
- Distillation optimizes for determinism, not creativity — skills requiring open-ended generation (writing, brainstorming) are poor candidates
- Trace collection requires actual model execution, incurring API costs
- 无法蒸馏依赖大量开放式自适应推理或多轮推理的技能
- 视觉/交互式技能(HTML生成、浏览器自动化)可能无法很好地蒸馏
- 蒸馏针对确定性优化,而非创造性——需要开放式生成的技能(写作、头脑风暴)不适合蒸馏
- 追踪收集需要实际执行模型,会产生API成本