very-long-text-summarization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVery Long Text Summarization
超长文本总结
Processes texts too large for a single context window using hierarchical multi-pass extraction with armies of cheap models. Produces structured knowledge maps, indexed summaries, and skill drafts — not just prose compression.
使用低成本模型集群的分层多轮提取能力,处理超出单个上下文窗口的文本。不仅会压缩文本内容,还会生成结构化知识图谱、带索引的摘要和技能草稿 —— 而非仅做文本压缩。
When to Use
适用场景
✅ Use for:
- Professional handbooks and textbooks (100-1000+ pages)
- Career biographies and memoirs (extracting expertise patterns)
- Large codebases (architecture-level understanding)
- Research paper collections (synthesizing findings across papers)
- Any text exceeding a single context window (~100K tokens)
❌ NOT for:
- Short documents (<10 pages) — just read them directly
- Real-time conversation summarization (use auto-compact patterns)
- Code documentation generation (use )
technical-writer - Simple TL;DR requests (not worth the multi-pass overhead)
✅ 适用场景:
- 专业手册和教材(100-1000+页)
- 职业传记和回忆录(提取经验模式)
- 大型代码库(架构层面理解)
- 研究论文合集(跨论文整合研究结论)
- 任何超出单个上下文窗口的文本(约10万tokens)
❌ 不适用场景:
- 短文档(<10页)—— 建议直接阅读
- 实时会话摘要(请使用自动压缩模式)
- 代码文档生成(请使用)
technical-writer - 简单的TL;DR需求(多轮处理的开销不划算)
Architecture: Three-Pass Hierarchical Extraction
架构:三轮分层提取
mermaid
flowchart TD
D[Document] --> C[Chunk into segments]
C --> P1["Pass 1: Haiku army\n(parallel extraction)"]
P1 --> I[Intermediate summaries]
I --> P2["Pass 2: Sonnet synthesis\n(merge + structure)"]
P2 --> S[Structured knowledge map]
S --> P3["Pass 3: Opus refinement\n(optional, for skill drafts)"]
P3 --> O[Final output]mermaid
flowchart TD
D[Document] --> C[Chunk into segments]
C --> P1["Pass 1: Haiku army\n(parallel extraction)"]
P1 --> I[Intermediate summaries]
I --> P2["Pass 2: Sonnet synthesis\n(merge + structure)"]
P2 --> S[Structured knowledge map]
S --> P3["Pass 3: Opus refinement\n(optional, for skill drafts)"]
P3 --> O[Final output]Pass 1: Chunked Extraction (Haiku Army)
第一轮:分块提取(Haiku集群)
Split the document into overlapping chunks (~4K tokens each, 500 token overlap). Deploy one Haiku call per chunk in parallel. Each extracts:
yaml
extraction_template:
summary: "2-3 sentence summary of this section"
key_claims: ["list of factual claims or assertions"]
processes: ["any step-by-step procedures described"]
decisions: ["any decision points or heuristics mentioned"]
failures: ["any failures, mistakes, or anti-patterns described"]
aha_moments: ["any insights, realizations, or conceptual breakthroughs"]
metaphors: ["any metaphors or mental models used"]
temporal: ["any 'things changed when...' or 'before X, after Y' patterns"]
quotes: ["notable direct quotes worth preserving"]
references: ["any citations, links, or cross-references"]Cost: ~$0.001 per chunk. A 300-page book (~150K tokens) = ~38 chunks = ~$0.04 total for Pass 1.
Parallelism: All chunks run simultaneously. A 300-page book completes Pass 1 in ~3 seconds (wall clock), not 3 minutes.
将文档拆分为重叠分块(每块约4K tokens,重叠500 tokens)。每个分块并行调用一次Haiku,提取以下内容:
yaml
extraction_template:
summary: "2-3 sentence summary of this section"
key_claims: ["list of factual claims or assertions"]
processes: ["any step-by-step procedures described"]
decisions: ["any decision points or heuristics mentioned"]
failures: ["any failures, mistakes, or anti-patterns described"]
aha_moments: ["any insights, realizations, or conceptual breakthroughs"]
metaphors: ["any metaphors or mental models used"]
temporal: ["any 'things changed when...' or 'before X, after Y' patterns"]
quotes: ["notable direct quotes worth preserving"]
references: ["any citations, links, or cross-references"]成本:每个分块约0.001美元。一本300页的书(约15万tokens)= 约38个分块 = 第一轮总成本约0.04美元。
并行性:所有分块同时处理。一本300页的书第一轮仅需约3秒(实际运行时间),而非3分钟。
Pass 2: Synthesis (Sonnet)
第二轮:整合(Sonnet)
Feed all Pass 1 extractions into one or more Sonnet calls. Sonnet merges, deduplicates, and structures the knowledge.
yaml
synthesis_template:
document_summary: "1-2 paragraph executive summary"
knowledge_map:
core_concepts:
- concept: "name"
definition: "what it means in this domain"
relationships: ["connects to concept X because..."]
processes:
- name: "process name"
steps: ["ordered steps"]
decision_points: ["where choices are made"]
common_mistakes: ["what goes wrong"]
expertise_patterns:
- pattern: "what experts do differently"
novice_mistake: "what novices do instead"
aha_moment: "the insight that bridges the gap"
temporal_evolution:
- period: "date range"
paradigm: "what was believed/practiced"
change_trigger: "what caused the shift"
key_metaphors:
- metaphor: "how practitioners think about X"
maps_to: "the underlying structure it represents"
index:
- topic: "topic name"
chunk_ids: [3, 7, 12] # Which original chunks cover this
summary: "1 sentence"Cost: ~$0.02-0.05 depending on extraction volume. The index preserves traceability back to specific book sections.
将第一轮的所有提取结果输入一个或多个Sonnet调用。Sonnet会对知识进行合并、去重和结构化处理。
yaml
synthesis_template:
document_summary: "1-2 paragraph executive summary"
knowledge_map:
core_concepts:
- concept: "name"
definition: "what it means in this domain"
relationships: ["connects to concept X because..."]
processes:
- name: "process name"
steps: ["ordered steps"]
decision_points: ["where choices are made"]
common_mistakes: ["what goes wrong"]
expertise_patterns:
- pattern: "what experts do differently"
novice_mistake: "what novices do instead"
aha_moment: "the insight that bridges the gap"
temporal_evolution:
- period: "date range"
paradigm: "what was believed/practiced"
change_trigger: "what caused the shift"
key_metaphors:
- metaphor: "how practitioners think about X"
maps_to: "the underlying structure it represents"
index:
- topic: "topic name"
chunk_ids: [3, 7, 12] # Which original chunks cover this
summary: "1 sentence"成本:约0.02-0.05美元,取决于提取内容的体量。索引支持回溯到书籍的具体章节。
Pass 3: Refinement (Opus, Optional)
第三轮:优化(Opus,可选)
For skill-draft output mode: Opus takes the knowledge map and produces a SKILL.md following the skill-architect template. This is the "crystallize skill from handbook" pipeline.
Cost: ~$0.10. Only run when the output is a skill draft.
如果需要输出技能草稿:Opus接收知识图谱,按照skill-architect模板生成SKILL.md。这是"从手册中提炼技能"的流水线。
成本:约0.10美元。仅在需要输出技能草稿时运行。
Chunking Strategy
分块策略
Semantic Chunking (Preferred)
语义分块(推荐)
Split on document structure — chapter boundaries, section headings, paragraph breaks. Preserves semantic coherence within each chunk.
python
def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 500) -> list[str]:
"""Split text on structural boundaries with overlap."""
# Split on headings, then merge short sections
sections = split_on_headings(text) # ##, ###, etc.
chunks = []
current = ""
for section in sections:
if count_tokens(current + section) > max_tokens:
chunks.append(current)
# Overlap: keep the last ~500 tokens
current = get_last_n_tokens(current, overlap) + section
else:
current += section
if current:
chunks.append(current)
return chunks按照文档结构拆分——章节边界、章节标题、段落分隔符。保证每个分块内的语义连贯性。
python
def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 500) -> list[str]:
"""Split text on structural boundaries with overlap."""
# Split on headings, then merge short sections
sections = split_on_headings(text) # ##, ###, etc.
chunks = []
current = ""
for section in sections:
if count_tokens(current + section) > max_tokens:
chunks.append(current)
# Overlap: keep the last ~500 tokens
current = get_last_n_tokens(current, overlap) + section
else:
current += section
if current:
chunks.append(current)
return chunksFixed-Size Chunking (Fallback)
固定大小分块(备选)
For unstructured text without headings. Split on paragraph boundaries, targeting ~4K tokens with 500-token overlap.
适用于没有标题的非结构化文本。按段落边界拆分,目标分块大小约4K tokens,重叠500 tokens。
Why Overlap?
为什么需要重叠?
Concepts that span chunk boundaries need to appear in both chunks to be extracted. Without overlap, you lose cross-boundary knowledge.
跨分块边界的概念需要在两个分块中都出现才能被提取。没有重叠的话,会丢失跨边界的知识。
Output Modes
输出模式
Mode 1: Summary
模式1:摘要
Produces a structured summary with executive overview, key concepts, and index.
Use for: Quick understanding of a long document. Reading a handbook before a meeting.
输出带执行概览、核心概念和索引的结构化摘要。
适用场景:快速理解长文档,会议前阅读手册。
Mode 2: Knowledge Map
模式2:知识图谱
Produces the full knowledge map: concepts, processes, expertise patterns, temporal evolution, metaphors. Machine-readable (YAML/JSON) for downstream processing.
Use for: Feeding into skill creation, domain meta-skill development, or cross-document analysis.
输出完整的知识图谱:概念、流程、经验模式、时间演进、隐喻。支持机器读取(YAML/JSON),可用于下游处理。
适用场景:用于技能创建、领域元技能开发,或跨文档分析。
Mode 3: Skill Draft
模式3:技能草稿
Produces a SKILL.md following the skill-architect template, with the handbook's expertise encoded as decision trees, anti-patterns, and shibboleths.
Use for: Converting professional handbooks into Claude skills. The KE pipeline.
按照skill-architect模板生成SKILL.md,将手册中的经验编码为决策树、反模式和行话标识。
适用场景:将专业手册转换为Claude技能,即KE流水线。
Cost Model
成本模型
| Document Size | Pages | Chunks | Pass 1 (Haiku) | Pass 2 (Sonnet) | Pass 3 (Opus) | Total |
|---|---|---|---|---|---|---|
| Article | 10 | 4 | $0.004 | $0.01 | — | $0.014 |
| Chapter | 30 | 10 | $0.01 | $0.02 | — | $0.03 |
| Handbook | 300 | 38 | $0.04 | $0.05 | $0.10 | $0.19 |
| Textbook | 800 | 100 | $0.10 | $0.10 | $0.10 | $0.30 |
| Encyclopedia | 2000+ | 250+ | $0.25 | $0.20 | $0.10 | $0.55 |
Processing time is dominated by the longest single Haiku call (~2-3s). With full parallelism, even a 2000-page text completes Pass 1 in under 5 seconds.
| 文档大小 | 页数 | 分块数 | 第一轮(Haiku) | 第二轮(Sonnet) | 第三轮(Opus) | 总计 |
|---|---|---|---|---|---|---|
| 文章 | 10 | 4 | $0.004 | $0.01 | — | $0.014 |
| 章节 | 30 | 10 | $0.01 | $0.02 | — | $0.03 |
| 手册 | 300 | 38 | $0.04 | $0.05 | $0.10 | $0.19 |
| 教材 | 800 | 100 | $0.10 | $0.10 | $0.10 | $0.30 |
| 百科全书 | 2000+ | 250+ | $0.25 | $0.20 | $0.10 | $0.55 |
处理时间由单次最长的Haiku调用决定(约2-3秒)。在完全并行的情况下,即使是2000页的文本,第一轮处理也能在5秒内完成。
Anti-Patterns
反模式
Single-Pass Summarization
单轮总结
Wrong: Feed the entire document into one Opus call.
Why: Exceeds context window, or attention dilution produces weak extraction on such long input.
Right: Hierarchical multi-pass. Cheap parallel extraction → expensive synthesis.
错误做法:将整个文档一次性输入Opus调用。
原因:超出上下文窗口,或者过长输入导致注意力稀释,提取效果差。
正确做法:分层多轮处理。低成本并行提取 → 高成本整合。
Summarization Without Structure
无结构化总结
Wrong: Produce a 2-paragraph prose summary of a 300-page handbook.
Why: The structure IS the knowledge. A flat summary loses the decision trees, failure patterns, and temporal evolution that make skills valuable.
Right: Structured knowledge map with indexed access back to source sections.
错误做法:为300页的手册生成仅2段的文本摘要。
原因:结构本身就是知识。平级摘要会丢失决策树、失败模式和时间演进等对技能有价值的信息。
正确做法:结构化知识图谱,支持索引回溯到源章节。
Skipping Overlap
跳过重叠设置
Wrong: Chunk on hard boundaries with no overlap.
Why: Cross-boundary concepts get split and lost.
Right: 500-token overlap between chunks. Each chunk includes the tail of the previous chunk.
错误做法:按硬边界拆分分块,没有重叠。
原因:跨边界的概念会被拆分丢失。
正确做法:分块之间重叠500 tokens,每个分块包含前一个分块的尾部内容。
Ignoring Source Traceability
忽略源追溯能力
Wrong: Produce extractions without tracking which chunk they came from.
Why: When a claim seems wrong, you need to verify it against the source. Without traceability, you can't.
Right: Every extraction carries a linking back to the original text segment.
chunk_id错误做法:提取内容时不记录来源分块。
原因:当某个论断看起来有问题时,你需要对照源文档验证。没有追溯能力就无法验证。
正确做法:每个提取结果都附带,关联回原始文本片段。
chunk_id