bmad-distillator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Distillator: A Document Distillation Engine

Distillator：文档蒸馏引擎

Overview

概述

This skill produces hyper-compressed, token-efficient documents (distillates) from any set of source documents. A distillate preserves every fact, decision, constraint, and relationship from the sources while stripping all overhead that humans need and LLMs don't. Act as an information extraction and compression specialist. The output is a single dense document (or semantically-split set) that a downstream LLM workflow can consume as sole context input without information loss.

This is a compression task, not a summarization task. Summaries are lossy. Distillates are lossless compression optimized for LLM consumption.

该技能可从任意源文档集合生成超高压缩、token高效的文档（即蒸馏产物）。蒸馏产物会保留源文档中的所有事实、决策、约束和关联关系，同时剔除人类阅读需要但LLM不需要的冗余内容。你需要扮演信息提取与压缩专家，输出单份高密度文档（或语义拆分的文档集合），下游LLM工作流可将其作为唯一上下文输入使用，不会出现信息丢失。

这是压缩任务，不是摘要任务。摘要是有损的，而蒸馏产物是针对LLM使用优化的无损压缩产物。

On Activation

激活时操作

Validate inputs. The caller must provide:
- source_documents (required) — One or more file paths, folder paths, or glob patterns to distill
- downstream_consumer (optional) — What workflow/agent consumes this distillate (e.g., "PRD creation", "architecture design"). When provided, use it to judge signal vs noise. When omitted, preserve everything.
- token_budget (optional) — Approximate target size. When provided and the distillate would exceed it, trigger semantic splitting.
- output_path (optional) — Where to save. When omitted, save adjacent to the primary source document with
```
-distillate.md
```
  suffix.
- --validate (flag) — Run round-trip reconstruction test after producing the distillate.
Route — proceed to Stage 1.

输入校验。调用方必须提供：
- source_documents（必填）—— 待提炼的一个或多个文件路径、文件夹路径或glob匹配规则
- downstream_consumer（可选）—— 消费该蒸馏产物的工作流/Agent（例如「PRD创作」、「架构设计」）。如果提供了该参数，可以用它来判断信号和噪声；如果未提供，则保留所有信息。
- token_budget（可选）—— 大致的目标大小。如果提供了该参数且蒸馏产物预计超出阈值，则触发语义拆分。
- output_path（可选）—— 保存路径。如果未提供，则保存在主源文档同目录下，后缀为
```
-distillate.md
```
  。
- --validate（标记）—— 生成蒸馏产物后运行往返重构测试。
路由—— 进入阶段1。

Stages

阶段

#	Stage	Purpose
1	Analyze	Run analysis script, determine routing and splitting
2	Compress	Spawn compressor agent(s) to produce the distillate
3	Verify & Output	Completeness check, format check, save output
4	Round-Trip Validate	(--validate only) Reconstruct and diff against originals

#	阶段	目的
1	分析	运行分析脚本，确定路由和拆分规则
2	压缩	启动压缩Agent生成蒸馏产物
3	校验与输出	完整性检查、格式检查、保存输出
4	往返校验	（仅开启--validate时）重构产物并与原文档做差异对比

Stage 1: Analyze

阶段1：分析

Run

scripts/analyze_sources.py --help

then run it with the source paths. Use its routing recommendation and grouping output to drive Stage 2. Do NOT read the source documents yourself.

先运行

scripts/analyze_sources.py --help

，再传入源路径执行该脚本。使用它输出的路由建议和分组结果来推进阶段2，不要自行读取源文档。

Stage 2: Compress

阶段2：压缩

Single mode (routing =

"single"

, ≤3 files, ≤15K estimated tokens):

Spawn one subagent using

agents/distillate-compressor.md

with all source file paths.

Fan-out mode (routing =

"fan-out"

Spawn one compressor subagent per group from the analysis output. Each compressor receives only its group's file paths and produces an intermediate distillate.
After all compressors return, spawn one final merge compressor subagent using
```
agents/distillate-compressor.md
```
. Pass it the intermediate distillate contents as its input (not the original files). Its job is cross-group deduplication, thematic regrouping, and final compression.
Clean up intermediate distillate content (it exists only in memory, not saved to disk).

Graceful degradation: If subagent spawning is unavailable, read the source documents and perform the compression work directly using the same instructions from

agents/distillate-compressor.md

. For fan-out, process groups sequentially then merge.

The compressor returns a structured JSON result containing the distillate content, source headings, named entities, and token estimate.

单模式（路由结果为

"single"

，≤3个文件，预计token≤15K）：

使用

agents/distillate-compressor.md

启动一个子Agent，传入所有源文件路径。

扇出模式（路由结果为

"fan-out"

）：

给分析输出的每个分组各启动一个压缩子Agent，每个压缩Agent仅接收对应分组的文件路径，生成中间蒸馏产物。
所有压缩Agent返回结果后，使用
```
agents/distillate-compressor.md
```
启动一个最终合并压缩子Agent，传入所有中间蒸馏产物的内容作为输入（不需要传入原文件），它的任务是跨分组去重、主题重分组和最终压缩。
清理中间蒸馏产物内容（仅存在于内存中，不会保存到磁盘）。

优雅降级：如果无法启动子Agent，请自行读取源文档，按照

agents/distillate-compressor.md

中的说明直接完成压缩工作。如果是扇出模式，按顺序处理每个分组后再合并。

压缩器会返回结构化JSON结果，包含蒸馏产物内容、源标题、命名实体和token预估数量。

Stage 3: Verify & Output

阶段3：校验与输出

After the compressor (or merge compressor) returns:

Completeness check. Using the headings and named entities list returned by the compressor, verify each appears in the distillate content. If gaps are found, send them back to the compressor for a targeted fix pass — not a full recompression. Limit to 2 fix passes maximum.
Format check. Verify the output follows distillate format rules:
- No prose paragraphs (only bullets)
- No decorative formatting
- No repeated information
- Each bullet is self-contained
- Themes are clearly delineated with
```
##
```
  headings
Determine output format. Using the split prediction from Stage 1 and actual distillate size:

Single distillate (≤~5,000 tokens or token_budget not exceeded):

Save as a single file with frontmatter:
yaml
```
---
type: bmad-distillate
sources:
  - "{relative path to source file 1}"
  - "{relative path to source file 2}"
downstream_consumer: "{consumer or 'general'}"
created: "{date}"
token_estimate: {approximate token count}
parts: 1
---
```
Split distillate (>~5,000 tokens, or token_budget requires it):
Create a folder
```
{base-name}-distillate/
```
containing:
```
{base-name}-distillate/
├── _index.md           # Orientation, cross-cutting items, section manifest
├── 01-{topic-slug}.md  # Self-contained section
├── 02-{topic-slug}.md
└── 03-{topic-slug}.md
```
The
```
_index.md
```
contains:
- Frontmatter with sources (relative paths from the distillate folder to the originals)
- 3-5 bullet orientation (what was distilled, from what)
- Section manifest: each section's filename + 1-line description
- Cross-cutting items that span multiple sections
Each section file is self-contained — loadable independently. Include a 1-line context header: "This section covers [topic]. Part N of M."

Source paths in frontmatter must be relative to the distillate's location.
Measure distillate. Run
```
scripts/analyze_sources.py
```
on the final distillate file(s) to get accurate token counts for the output. Use the
```
total_estimated_tokens
```
from this analysis as
```
distillate_total_tokens
```
.

Report results. Always return structured JSON output:

json

{
  "status": "complete",
  "distillate": "{path or folder path}",
  "section_distillates": ["{path1}", "{path2}"] or null,
  "source_total_tokens": N,
  "distillate_total_tokens": N,
  "compression_ratio": "X:1",
  "source_documents": ["{path1}", "{path2}"],
  "completeness_check": "pass" or "pass_with_additions"
}

Where

source_total_tokens

is from the Stage 1 analysis and

distillate_total_tokens

is from step 4. The

compression_ratio

source_total_tokens / distillate_total_tokens

formatted as "X:1" (e.g., "3.2:1").

If
```
--validate
```
flag was set, proceed to Stage 4. Otherwise, done.

压缩器（或合并压缩器）返回结果后：

完整性检查。使用压缩器返回的标题和命名实体列表，校验每一项都出现在蒸馏产物内容中。如果发现缺失，将缺失内容发回压缩器做针对性修复，不要全量重新压缩，最多修复2次。
格式检查。校验输出符合蒸馏产物格式规则：
- 无散文段落（仅使用列表项）
- 无装饰性格式
- 无重复信息 n - 每个列表项内容自包含
- 主题用
```
##
```
  标题明确划分
确定输出格式。结合阶段1的拆分预测和实际蒸馏产物大小判断：

单份蒸馏产物（≤约5000 token或未超出token_budget）：

保存为单个文件，包含如下前置元数据：
yaml
```
---
type: bmad-distillate
sources:
  - "{源文件1的相对路径}"
  - "{源文件2的相对路径}"
downstream_consumer: "{消费者名称或'general'}"
created: "{日期}"
token_estimate: {预估token数量}
parts: 1
---
```
拆分蒸馏产物（>约5000 token，或token_budget要求拆分）：
创建文件夹
```
{基础名称}-distillate/
```
，包含如下内容：
```
{基础名称}-distillate/
├── _index.md           # 说明、跨模块内容、章节清单
├── 01-{主题slug}.md  # 内容自包含的章节
├── 02-{主题slug}.md
└── 03-{主题slug}.md
```
```
_index.md
```
包含：
- 前置元数据，包含源文件（从蒸馏产物文件夹到原文件的相对路径）
- 3-5条说明性列表项（提炼了什么内容，来源是什么）
- 章节清单：每个章节的文件名+1行描述
- 跨多个章节的通用内容
每个章节文件内容自包含，可独立加载。包含1行上下文头部：「本章节涵盖[主题]，共M部分的第N部分」。

前置元数据中的源路径必须是相对于蒸馏产物存储位置的路径。
测算蒸馏产物。对最终蒸馏产物文件运行
```
scripts/analyze_sources.py
```
，获取准确的输出token数量，使用该分析得到的
```
total_estimated_tokens
```
作为
```
distillate_total_tokens
```
。

返回结果。始终返回结构化JSON输出：

json

{
  "status": "complete",
  "distillate": "{文件或文件夹路径}",
  "section_distillates": ["{路径1}", "{路径2}"] 或 null,
  "source_total_tokens": N,
  "distillate_total_tokens": N,
  "compression_ratio": "X:1",
  "source_documents": ["{路径1}", "{路径2}"],
  "completeness_check": "pass" 或 "pass_with_additions"
}

其中

source_total_tokens

来自阶段1的分析结果，

distillate_total_tokens

来自步骤4的结果。

compression_ratio

是

source_total_tokens / distillate_total_tokens

，格式为「X:1」（例如「3.2:1」）。

如果设置了
```
--validate
```
标记，进入阶段4，否则任务完成。

Stage 4: Round-Trip Validation (--validate only)

阶段4：往返校验（仅开启--validate时）

This stage proves the distillate is lossless by reconstructing source documents from the distillate alone. Use for critical documents where information loss is unacceptable, or as a quality gate for high-stakes downstream workflows. Not for routine use — it adds significant token cost.

Spawn the reconstructor agent using
```
agents/round-trip-reconstructor.md
```
. Pass it ONLY the distillate file path (or
```
_index.md
```
path for split distillates) — it must NOT have access to the original source documents.
For split distillates, spawn one reconstructor per section in parallel. Each receives its section file plus the
```
_index.md
```
for cross-cutting context.
Graceful degradation: If subagent spawning is unavailable, this stage cannot be performed by the main agent (it has already seen the originals). Report that round-trip validation requires subagent support and skip.
Receive reconstructions. The reconstructor returns reconstruction file paths saved adjacent to the distillate.
Perform semantic diff. Read both the original source documents and the reconstructions. For each section of the original, assess:
- Is the core information present in the reconstruction?
- Are specific details preserved (numbers, names, decisions)?
- Are relationships and rationale intact?
- Did the reconstruction add anything not in the original? (indicates hallucination filling gaps)

Produce validation report saved adjacent to the distillate as

-validation-report.md

markdown

---
type: distillate-validation
distillate: "{distillate path}"
sources: ["{source paths}"]
created: "{date}"
---

## Validation Summary
- Status: PASS | PASS_WITH_WARNINGS | FAIL
- Information preserved: {percentage estimate}
- Gaps found: {count}
- Hallucinations detected: {count}

## Gaps (information in originals but missing from reconstruction)
- {gap description} — Source: {which original}, Section: {where}

## Hallucinations (information in reconstruction not traceable to originals)
- {hallucination description} — appears to fill gap in: {section}

## Possible Gap Markers (flagged by reconstructor)
- {marker description}

If gaps are found, offer to run a targeted fix pass on the distillate — adding the missing information without full recompression. Limit to 2 fix passes maximum.
Clean up — delete the temporary reconstruction files after the report is generated.

该阶段通过仅用蒸馏产物重构源文档，来证明蒸馏产物是无损的。适用于不允许信息丢失的关键文档，或作为高风险下游工作流的质量关口。不建议常规使用，会产生大量token消耗。

使用
agents/round-trip-reconstructor.md
启动重构Agent，仅传入蒸馏产物文件路径（拆分蒸馏产物则传入
```
_index.md
```
路径）—— 不能让它访问原始源文档。
对于拆分蒸馏产物，给每个章节并行启动一个重构Agent，每个Agent接收对应章节文件加上
```
_index.md
```
作为跨章节上下文。
优雅降级：如果无法启动子Agent，主Agent无法执行该阶段（它已经看过原文档），请上报往返校验需要子Agent支持并跳过该阶段。
接收重构结果。重构Agent会返回保存在蒸馏产物同目录下的重构文件路径。
执行语义差异对比。读取原始源文档和重构结果，针对原文档的每个章节评估：
- 核心信息是否出现在重构结果中？
- 具体细节是否保留（数字、名称、决策）？
- 关联关系和逻辑是否完整？
- 重构结果是否添加了原文档中没有的内容？（说明存在幻觉补全缺失内容）

生成校验报告，保存在蒸馏产物同目录下，文件名为

-validation-report.md

：

markdown

---
type: distillate-validation
distillate: "{蒸馏产物路径}"
sources: ["{源文件路径}"]
created: "{日期}"
---

## 校验总结
- 状态: PASS | PASS_WITH_WARNINGS | FAIL
- 信息保留率: {预估百分比}
- 发现缺失: {数量}
- 检测到幻觉: {数量}

## 缺失内容（原文档存在但重构结果中丢失的信息）
- {缺失内容描述} — 来源: {原文档名称}, 章节: {位置}

## 幻觉内容（重构结果存在但无法在原文档中溯源的信息）
- {幻觉内容描述} — 疑似补全了以下位置的缺失: {章节}

## 可能的缺失标记（重构Agent标记的内容）
- {标记描述}

如果发现缺失，可提供对蒸馏产物做针对性修复的选项，不需要全量重新压缩，最多修复2次。
清理—— 报告生成后删除临时重构文件。