algo-nlp-summarization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Text Summarization

文本摘要

Overview

概述

Text summarization condenses documents while preserving key information. Extractive: selects and concatenates important sentences from the original. Abstractive: generates new text that paraphrases the content. Extractive is simpler and more faithful; abstractive is more fluent but may hallucinate.

文本摘要在保留关键信息的前提下压缩文档内容。抽取式：从原文中选取并拼接重要句子；抽象式：生成改写内容的新文本。抽取式方法更简单、忠实于原文；抽象式方法更流畅，但可能出现幻觉（hallucinate）。

When to Use

使用场景

Trigger conditions:

Condensing long documents, reports, or article collections
Building automated summary pipelines for content curation
Comparing extractive vs abstractive approaches for a use case

When NOT to use:

When full document understanding is needed (summarization loses detail)
For structured data extraction (use NER or information extraction)

触发条件：

压缩长文档、报告或文章合集
为内容策划构建自动摘要流水线
针对特定用例比较抽取式与抽象式方法

不适用场景：

需要完整理解文档内容时（摘要会丢失细节）
用于结构化数据提取时（应使用NER或信息提取技术）

Algorithm

算法

IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.

IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.

Phase 1: Input Validation

阶段1：输入验证

Determine: input length, target summary length (ratio or word count), single-doc vs multi-doc, domain. Gate: Input text available, target length defined.

确定：输入长度、目标摘要长度（比例或词数）、单文档 vs 多文档、领域。 准入条件： 输入文本可用，目标长度已定义。

Phase 2: Core Algorithm

阶段2：核心算法

Extractive (TextRank/LexRank):

Split document into sentences
Build similarity graph (sentence nodes, cosine similarity edges)
Run PageRank on sentence graph
Select top-k sentences by rank, reorder by original position

Abstractive (transformer-based):

Use pre-trained model (BART, T5, Pegasus)
Encode input document (handle length limits with chunking if needed)
Generate summary with beam search
Post-process: check for repetition, factual consistency

抽取式（TextRank/LexRank）：

将文档分割为句子
构建相似度图（句子为节点，余弦相似度为边）
在句子图上运行PageRank算法
选择排名前k的句子，按原文顺序重新排列

抽象式（基于Transformer）：

使用预训练模型（BART、T5、Pegasus）
编码输入文档（若需处理长度限制，使用分块策略）
通过束搜索生成摘要
后处理：检查重复内容、事实一致性

Phase 3: Verification

阶段3：验证

Evaluate: ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries. Manual check for factual accuracy and coherence. Gate: ROUGE scores reasonable for domain, no hallucinations in spot-check.

评估：对照参考摘要的ROUGE分数（ROUGE-1、ROUGE-2、ROUGE-L）。人工检查事实准确性和连贯性。 准入条件： ROUGE分数符合领域合理范围，抽查无幻觉内容。

Phase 4: Output

阶段4：输出

Return summary with metadata.

返回包含元数据的摘要。

Output Format

输出格式

json

{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}

json

{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}

Examples

示例

Sample I/O

输入输出示例

Input: 2000-word news article about quarterly earnings Expected: 200-word summary covering: revenue, profit, guidance, key highlights. Extractive: 5-6 selected sentences. Abstractive: coherent paragraph.

输入： 2000字关于季度收益的新闻文章 预期输出： 200字摘要，涵盖：营收、利润、业绩指引、关键亮点。抽取式：5-6个选取的句子；抽象式：连贯段落。

Edge Cases

边缘情况

Input	Expected	Why
Very short input (< 100 words)	Return as-is or minimal trimming	Already concise
Multiple contradicting sections	Summary may miss nuance	Summarization favors dominant theme
Technical jargon	Extractive preserves, abstractive may simplify	Domain expertise affects quality

输入	预期输出	原因
极短输入（< 100 words）	原样返回或最小化修剪	内容已简洁
存在多个矛盾章节	摘要可能遗漏细节	摘要倾向于突出主导主题
包含技术术语	抽取式保留术语，抽象式可能简化	领域专业度影响摘要质量

Gotchas

注意事项

ROUGE ≠ quality: ROUGE measures n-gram overlap with references. A high-ROUGE summary can be incoherent, and a low-ROUGE summary can be excellent with different word choices.
Input length limits: Transformer models have max token limits (512-4096). Long documents need chunking strategies (chunk-then-summarize or hierarchical summarization).
Repetition: Abstractive models sometimes repeat phrases. Use repetition penalty during generation (no_repeat_ngram_size).
Position bias: In news text, important information is front-loaded (inverted pyramid). Simple "take first N sentences" is a strong extractive baseline.
Multi-document summarization: Summarizing multiple related documents requires handling redundancy and contradiction across sources.

ROUGE ≠ 质量：ROUGE衡量与参考摘要的n元语法重叠度。高ROUGE分数的摘要可能不连贯，而低ROUGE分数的摘要若用词不同也可能表现出色。
输入长度限制：Transformer模型有最大token限制（512-4096）。长文档需采用分块策略（先分块再摘要或分层摘要）。
重复内容：抽象式模型有时会重复短语。生成时可使用重复惩罚（no_repeat_ngram_size）。
位置偏差：新闻文本中重要信息通常前置（倒金字塔结构）。简单的“取前N句”是一个效果不错的抽取式基线方法。
多文档摘要：总结多个相关文档需处理跨来源的冗余和矛盾内容。

References

参考资料

For TextRank/LexRank implementation details, see
```
references/graph-based-extraction.md
```
For factual consistency checking, see
```
references/factual-consistency.md
```

有关TextRank/LexRank的实现细节，请查看
```
references/graph-based-extraction.md
```
有关事实一致性检查，请查看
```
references/factual-consistency.md
```