algo-nlp-summarization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Text Summarization

文本摘要

Overview

概述

Text summarization condenses documents while preserving key information. Extractive: selects and concatenates important sentences from the original. Abstractive: generates new text that paraphrases the content. Extractive is simpler and more faithful; abstractive is more fluent but may hallucinate.
文本摘要在保留关键信息的前提下压缩文档内容。抽取式:从原文中选取并拼接重要句子;抽象式:生成改写内容的新文本。抽取式方法更简单、忠实于原文;抽象式方法更流畅,但可能出现幻觉(hallucinate)。

When to Use

使用场景

Trigger conditions:
  • Condensing long documents, reports, or article collections
  • Building automated summary pipelines for content curation
  • Comparing extractive vs abstractive approaches for a use case
When NOT to use:
  • When full document understanding is needed (summarization loses detail)
  • For structured data extraction (use NER or information extraction)
触发条件:
  • 压缩长文档、报告或文章合集
  • 为内容策划构建自动摘要流水线
  • 针对特定用例比较抽取式与抽象式方法
不适用场景:
  • 需要完整理解文档内容时(摘要会丢失细节)
  • 用于结构化数据提取时(应使用NER或信息提取技术)

Algorithm

算法

IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.
IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.

Phase 1: Input Validation

阶段1:输入验证

Determine: input length, target summary length (ratio or word count), single-doc vs multi-doc, domain. Gate: Input text available, target length defined.
确定:输入长度、目标摘要长度(比例或词数)、单文档 vs 多文档、领域。 准入条件: 输入文本可用,目标长度已定义。

Phase 2: Core Algorithm

阶段2:核心算法

Extractive (TextRank/LexRank):
  1. Split document into sentences
  2. Build similarity graph (sentence nodes, cosine similarity edges)
  3. Run PageRank on sentence graph
  4. Select top-k sentences by rank, reorder by original position
Abstractive (transformer-based):
  1. Use pre-trained model (BART, T5, Pegasus)
  2. Encode input document (handle length limits with chunking if needed)
  3. Generate summary with beam search
  4. Post-process: check for repetition, factual consistency
抽取式(TextRank/LexRank):
  1. 将文档分割为句子
  2. 构建相似度图(句子为节点,余弦相似度为边)
  3. 在句子图上运行PageRank算法
  4. 选择排名前k的句子,按原文顺序重新排列
抽象式(基于Transformer):
  1. 使用预训练模型(BART、T5、Pegasus)
  2. 编码输入文档(若需处理长度限制,使用分块策略)
  3. 通过束搜索生成摘要
  4. 后处理:检查重复内容、事实一致性

Phase 3: Verification

阶段3:验证

Evaluate: ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries. Manual check for factual accuracy and coherence. Gate: ROUGE scores reasonable for domain, no hallucinations in spot-check.
评估:对照参考摘要的ROUGE分数(ROUGE-1、ROUGE-2、ROUGE-L)。人工检查事实准确性和连贯性。 准入条件: ROUGE分数符合领域合理范围,抽查无幻觉内容。

Phase 4: Output

阶段4:输出

Return summary with metadata.
返回包含元数据的摘要。

Output Format

输出格式

json
{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}
json
{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}

Examples

示例

Sample I/O

输入输出示例

Input: 2000-word news article about quarterly earnings Expected: 200-word summary covering: revenue, profit, guidance, key highlights. Extractive: 5-6 selected sentences. Abstractive: coherent paragraph.
输入: 2000字关于季度收益的新闻文章 预期输出: 200字摘要,涵盖:营收、利润、业绩指引、关键亮点。抽取式:5-6个选取的句子;抽象式:连贯段落。

Edge Cases

边缘情况

InputExpectedWhy
Very short input (< 100 words)Return as-is or minimal trimmingAlready concise
Multiple contradicting sectionsSummary may miss nuanceSummarization favors dominant theme
Technical jargonExtractive preserves, abstractive may simplifyDomain expertise affects quality
输入预期输出原因
极短输入(< 100 words)原样返回或最小化修剪内容已简洁
存在多个矛盾章节摘要可能遗漏细节摘要倾向于突出主导主题
包含技术术语抽取式保留术语,抽象式可能简化领域专业度影响摘要质量

Gotchas

注意事项

  • ROUGE ≠ quality: ROUGE measures n-gram overlap with references. A high-ROUGE summary can be incoherent, and a low-ROUGE summary can be excellent with different word choices.
  • Input length limits: Transformer models have max token limits (512-4096). Long documents need chunking strategies (chunk-then-summarize or hierarchical summarization).
  • Repetition: Abstractive models sometimes repeat phrases. Use repetition penalty during generation (no_repeat_ngram_size).
  • Position bias: In news text, important information is front-loaded (inverted pyramid). Simple "take first N sentences" is a strong extractive baseline.
  • Multi-document summarization: Summarizing multiple related documents requires handling redundancy and contradiction across sources.
  • ROUGE ≠ 质量:ROUGE衡量与参考摘要的n元语法重叠度。高ROUGE分数的摘要可能不连贯,而低ROUGE分数的摘要若用词不同也可能表现出色。
  • 输入长度限制:Transformer模型有最大token限制(512-4096)。长文档需采用分块策略(先分块再摘要或分层摘要)。
  • 重复内容:抽象式模型有时会重复短语。生成时可使用重复惩罚(no_repeat_ngram_size)。
  • 位置偏差:新闻文本中重要信息通常前置(倒金字塔结构)。简单的“取前N句”是一个效果不错的抽取式基线方法。
  • 多文档摘要:总结多个相关文档需处理跨来源的冗余和矛盾内容。

References

参考资料

  • For TextRank/LexRank implementation details, see
    references/graph-based-extraction.md
  • For factual consistency checking, see
    references/factual-consistency.md
  • 有关TextRank/LexRank的实现细节,请查看
    references/graph-based-extraction.md
  • 有关事实一致性检查,请查看
    references/factual-consistency.md