very-long-text-summarization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Very Long Text Summarization

超长文本总结

Processes texts too large for a single context window using hierarchical multi-pass extraction with armies of cheap models. Produces structured knowledge maps, indexed summaries, and skill drafts — not just prose compression.

使用低成本模型集群的分层多轮提取能力，处理超出单个上下文窗口的文本。不仅会压缩文本内容，还会生成结构化知识图谱、带索引的摘要和技能草稿 —— 而非仅做文本压缩。

When to Use

适用场景

✅ Use for:

Professional handbooks and textbooks (100-1000+ pages)
Career biographies and memoirs (extracting expertise patterns)
Large codebases (architecture-level understanding)
Research paper collections (synthesizing findings across papers)
Any text exceeding a single context window (~100K tokens)

❌ NOT for:

Short documents (<10 pages) — just read them directly
Real-time conversation summarization (use auto-compact patterns)
Code documentation generation (use
```
technical-writer
```
)
Simple TL;DR requests (not worth the multi-pass overhead)

✅ 适用场景:

专业手册和教材（100-1000+页）
职业传记和回忆录（提取经验模式）
大型代码库（架构层面理解）
研究论文合集（跨论文整合研究结论）
任何超出单个上下文窗口的文本（约10万tokens）

❌ 不适用场景:

短文档（<10页）—— 建议直接阅读
实时会话摘要（请使用自动压缩模式）
代码文档生成（请使用
```
technical-writer
```
）
简单的TL;DR需求（多轮处理的开销不划算）

Architecture: Three-Pass Hierarchical Extraction

架构：三轮分层提取

mermaid

flowchart TD
  D[Document] --> C[Chunk into segments]
  C --> P1["Pass 1: Haiku army\n(parallel extraction)"]
  P1 --> I[Intermediate summaries]
  I --> P2["Pass 2: Sonnet synthesis\n(merge + structure)"]
  P2 --> S[Structured knowledge map]
  S --> P3["Pass 3: Opus refinement\n(optional, for skill drafts)"]
  P3 --> O[Final output]

mermaid

flowchart TD
  D[Document] --> C[Chunk into segments]
  C --> P1["Pass 1: Haiku army\n(parallel extraction)"]
  P1 --> I[Intermediate summaries]
  I --> P2["Pass 2: Sonnet synthesis\n(merge + structure)"]
  P2 --> S[Structured knowledge map]
  S --> P3["Pass 3: Opus refinement\n(optional, for skill drafts)"]
  P3 --> O[Final output]

Pass 1: Chunked Extraction (Haiku Army)

第一轮：分块提取（Haiku集群）

Split the document into overlapping chunks (~4K tokens each, 500 token overlap). Deploy one Haiku call per chunk in parallel. Each extracts:

yaml

extraction_template:
  summary: "2-3 sentence summary of this section"
  key_claims: ["list of factual claims or assertions"]
  processes: ["any step-by-step procedures described"]
  decisions: ["any decision points or heuristics mentioned"]
  failures: ["any failures, mistakes, or anti-patterns described"]
  aha_moments: ["any insights, realizations, or conceptual breakthroughs"]
  metaphors: ["any metaphors or mental models used"]
  temporal: ["any 'things changed when...' or 'before X, after Y' patterns"]
  quotes: ["notable direct quotes worth preserving"]
  references: ["any citations, links, or cross-references"]

Cost: ~$0.001 per chunk. A 300-page book (~150K tokens) = ~38 chunks = ~$0.04 total for Pass 1.

Parallelism: All chunks run simultaneously. A 300-page book completes Pass 1 in ~3 seconds (wall clock), not 3 minutes.

将文档拆分为重叠分块（每块约4K tokens，重叠500 tokens）。每个分块并行调用一次Haiku，提取以下内容：

yaml

extraction_template:
  summary: "2-3 sentence summary of this section"
  key_claims: ["list of factual claims or assertions"]
  processes: ["any step-by-step procedures described"]
  decisions: ["any decision points or heuristics mentioned"]
  failures: ["any failures, mistakes, or anti-patterns described"]
  aha_moments: ["any insights, realizations, or conceptual breakthroughs"]
  metaphors: ["any metaphors or mental models used"]
  temporal: ["any 'things changed when...' or 'before X, after Y' patterns"]
  quotes: ["notable direct quotes worth preserving"]
  references: ["any citations, links, or cross-references"]

成本：每个分块约0.001美元。一本300页的书（约15万tokens）= 约38个分块 = 第一轮总成本约0.04美元。

并行性：所有分块同时处理。一本300页的书第一轮仅需约3秒（实际运行时间），而非3分钟。

Pass 2: Synthesis (Sonnet)

第二轮：整合（Sonnet）

Feed all Pass 1 extractions into one or more Sonnet calls. Sonnet merges, deduplicates, and structures the knowledge.

yaml

synthesis_template:
  document_summary: "1-2 paragraph executive summary"
  
  knowledge_map:
    core_concepts:
      - concept: "name"
        definition: "what it means in this domain"
        relationships: ["connects to concept X because..."]
    
    processes:
      - name: "process name"
        steps: ["ordered steps"]
        decision_points: ["where choices are made"]
        common_mistakes: ["what goes wrong"]
    
    expertise_patterns:
      - pattern: "what experts do differently"
        novice_mistake: "what novices do instead"
        aha_moment: "the insight that bridges the gap"
    
    temporal_evolution:
      - period: "date range"
        paradigm: "what was believed/practiced"
        change_trigger: "what caused the shift"
    
    key_metaphors:
      - metaphor: "how practitioners think about X"
        maps_to: "the underlying structure it represents"
  
  index:
    - topic: "topic name"
      chunk_ids: [3, 7, 12]  # Which original chunks cover this
      summary: "1 sentence"

Cost: ~$0.02-0.05 depending on extraction volume. The index preserves traceability back to specific book sections.

将第一轮的所有提取结果输入一个或多个Sonnet调用。Sonnet会对知识进行合并、去重和结构化处理。

yaml

synthesis_template:
  document_summary: "1-2 paragraph executive summary"
  
  knowledge_map:
    core_concepts:
      - concept: "name"
        definition: "what it means in this domain"
        relationships: ["connects to concept X because..."]
    
    processes:
      - name: "process name"
        steps: ["ordered steps"]
        decision_points: ["where choices are made"]
        common_mistakes: ["what goes wrong"]
    
    expertise_patterns:
      - pattern: "what experts do differently"
        novice_mistake: "what novices do instead"
        aha_moment: "the insight that bridges the gap"
    
    temporal_evolution:
      - period: "date range"
        paradigm: "what was believed/practiced"
        change_trigger: "what caused the shift"
    
    key_metaphors:
      - metaphor: "how practitioners think about X"
        maps_to: "the underlying structure it represents"
  
  index:
    - topic: "topic name"
      chunk_ids: [3, 7, 12]  # Which original chunks cover this
      summary: "1 sentence"

成本：约0.02-0.05美元，取决于提取内容的体量。索引支持回溯到书籍的具体章节。

Pass 3: Refinement (Opus, Optional)

第三轮：优化（Opus，可选）

For skill-draft output mode: Opus takes the knowledge map and produces a SKILL.md following the skill-architect template. This is the "crystallize skill from handbook" pipeline.

Cost: ~$0.10. Only run when the output is a skill draft.

如果需要输出技能草稿：Opus接收知识图谱，按照skill-architect模板生成SKILL.md。这是"从手册中提炼技能"的流水线。

成本：约0.10美元。仅在需要输出技能草稿时运行。

Chunking Strategy

分块策略

Semantic Chunking (Preferred)

语义分块（推荐）

Split on document structure — chapter boundaries, section headings, paragraph breaks. Preserves semantic coherence within each chunk.

python

def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 500) -> list[str]:
    """Split text on structural boundaries with overlap."""
    # Split on headings, then merge short sections
    sections = split_on_headings(text)  # ##, ###, etc.
    
    chunks = []
    current = ""
    
    for section in sections:
        if count_tokens(current + section) > max_tokens:
            chunks.append(current)
            # Overlap: keep the last ~500 tokens
            current = get_last_n_tokens(current, overlap) + section
        else:
            current += section
    
    if current:
        chunks.append(current)
    
    return chunks

按照文档结构拆分——章节边界、章节标题、段落分隔符。保证每个分块内的语义连贯性。

python

def semantic_chunk(text: str, max_tokens: int = 4000, overlap: int = 500) -> list[str]:
    """Split text on structural boundaries with overlap."""
    # Split on headings, then merge short sections
    sections = split_on_headings(text)  # ##, ###, etc.
    
    chunks = []
    current = ""
    
    for section in sections:
        if count_tokens(current + section) > max_tokens:
            chunks.append(current)
            # Overlap: keep the last ~500 tokens
            current = get_last_n_tokens(current, overlap) + section
        else:
            current += section
    
    if current:
        chunks.append(current)
    
    return chunks

Fixed-Size Chunking (Fallback)

固定大小分块（备选）

For unstructured text without headings. Split on paragraph boundaries, targeting ~4K tokens with 500-token overlap.

适用于没有标题的非结构化文本。按段落边界拆分，目标分块大小约4K tokens，重叠500 tokens。

Why Overlap?

为什么需要重叠？

Concepts that span chunk boundaries need to appear in both chunks to be extracted. Without overlap, you lose cross-boundary knowledge.

跨分块边界的概念需要在两个分块中都出现才能被提取。没有重叠的话，会丢失跨边界的知识。

Output Modes

输出模式

Mode 1: Summary

模式1：摘要

Produces a structured summary with executive overview, key concepts, and index.

Use for: Quick understanding of a long document. Reading a handbook before a meeting.

输出带执行概览、核心概念和索引的结构化摘要。

适用场景：快速理解长文档，会议前阅读手册。

Mode 2: Knowledge Map

模式2：知识图谱

Produces the full knowledge map: concepts, processes, expertise patterns, temporal evolution, metaphors. Machine-readable (YAML/JSON) for downstream processing.

Use for: Feeding into skill creation, domain meta-skill development, or cross-document analysis.

输出完整的知识图谱：概念、流程、经验模式、时间演进、隐喻。支持机器读取（YAML/JSON），可用于下游处理。

适用场景：用于技能创建、领域元技能开发，或跨文档分析。

Mode 3: Skill Draft

模式3：技能草稿

Produces a SKILL.md following the skill-architect template, with the handbook's expertise encoded as decision trees, anti-patterns, and shibboleths.

Use for: Converting professional handbooks into Claude skills. The KE pipeline.

按照skill-architect模板生成SKILL.md，将手册中的经验编码为决策树、反模式和行话标识。

适用场景：将专业手册转换为Claude技能，即KE流水线。

Cost Model

成本模型

Document Size	Pages	Chunks	Pass 1 (Haiku)	Pass 2 (Sonnet)	Pass 3 (Opus)	Total
Article	10	4	$0.004	$0.01	—	$0.014
Chapter	30	10	$0.01	$0.02	—	$0.03
Handbook	300	38	$0.04	$0.05	$0.10	$0.19
Textbook	800	100	$0.10	$0.10	$0.10	$0.30
Encyclopedia	2000+	250+	$0.25	$0.20	$0.10	$0.55

Processing time is dominated by the longest single Haiku call (~2-3s). With full parallelism, even a 2000-page text completes Pass 1 in under 5 seconds.

文档大小	页数	分块数	第一轮（Haiku）	第二轮（Sonnet）	第三轮（Opus）	总计
文章	10	4	$0.004	$0.01	—	$0.014
章节	30	10	$0.01	$0.02	—	$0.03
手册	300	38	$0.04	$0.05	$0.10	$0.19
教材	800	100	$0.10	$0.10	$0.10	$0.30
百科全书	2000+	250+	$0.25	$0.20	$0.10	$0.55

处理时间由单次最长的Haiku调用决定（约2-3秒）。在完全并行的情况下，即使是2000页的文本，第一轮处理也能在5秒内完成。

Anti-Patterns

反模式

Single-Pass Summarization

单轮总结

Wrong: Feed the entire document into one Opus call. Why: Exceeds context window, or attention dilution produces weak extraction on such long input. Right: Hierarchical multi-pass. Cheap parallel extraction → expensive synthesis.

错误做法：将整个文档一次性输入Opus调用。原因：超出上下文窗口，或者过长输入导致注意力稀释，提取效果差。 正确做法：分层多轮处理。低成本并行提取 → 高成本整合。

Summarization Without Structure

无结构化总结

Wrong: Produce a 2-paragraph prose summary of a 300-page handbook. Why: The structure IS the knowledge. A flat summary loses the decision trees, failure patterns, and temporal evolution that make skills valuable. Right: Structured knowledge map with indexed access back to source sections.

错误做法：为300页的手册生成仅2段的文本摘要。原因：结构本身就是知识。平级摘要会丢失决策树、失败模式和时间演进等对技能有价值的信息。 正确做法：结构化知识图谱，支持索引回溯到源章节。

Skipping Overlap

跳过重叠设置

Wrong: Chunk on hard boundaries with no overlap. Why: Cross-boundary concepts get split and lost. Right: 500-token overlap between chunks. Each chunk includes the tail of the previous chunk.

错误做法：按硬边界拆分分块，没有重叠。原因：跨边界的概念会被拆分丢失。 正确做法：分块之间重叠500 tokens，每个分块包含前一个分块的尾部内容。

Ignoring Source Traceability

忽略源追溯能力

Wrong: Produce extractions without tracking which chunk they came from. Why: When a claim seems wrong, you need to verify it against the source. Without traceability, you can't. Right: Every extraction carries a

chunk_id

linking back to the original text segment.

错误做法：提取内容时不记录来源分块。原因：当某个论断看起来有问题时，你需要对照源文档验证。没有追溯能力就无法验证。 正确做法：每个提取结果都附带

chunk_id

，关联回原始文本片段。