algo-nlp-summarization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseText Summarization
文本摘要
Overview
概述
Text summarization condenses documents while preserving key information. Extractive: selects and concatenates important sentences from the original. Abstractive: generates new text that paraphrases the content. Extractive is simpler and more faithful; abstractive is more fluent but may hallucinate.
文本摘要在保留关键信息的前提下压缩文档内容。抽取式:从原文中选取并拼接重要句子;抽象式:生成改写内容的新文本。抽取式方法更简单、忠实于原文;抽象式方法更流畅,但可能出现幻觉(hallucinate)。
When to Use
使用场景
Trigger conditions:
- Condensing long documents, reports, or article collections
- Building automated summary pipelines for content curation
- Comparing extractive vs abstractive approaches for a use case
When NOT to use:
- When full document understanding is needed (summarization loses detail)
- For structured data extraction (use NER or information extraction)
触发条件:
- 压缩长文档、报告或文章合集
- 为内容策划构建自动摘要流水线
- 针对特定用例比较抽取式与抽象式方法
不适用场景:
- 需要完整理解文档内容时(摘要会丢失细节)
- 用于结构化数据提取时(应使用NER或信息提取技术)
Algorithm
算法
IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.Phase 1: Input Validation
阶段1:输入验证
Determine: input length, target summary length (ratio or word count), single-doc vs multi-doc, domain.
Gate: Input text available, target length defined.
确定:输入长度、目标摘要长度(比例或词数)、单文档 vs 多文档、领域。
准入条件: 输入文本可用,目标长度已定义。
Phase 2: Core Algorithm
阶段2:核心算法
Extractive (TextRank/LexRank):
- Split document into sentences
- Build similarity graph (sentence nodes, cosine similarity edges)
- Run PageRank on sentence graph
- Select top-k sentences by rank, reorder by original position
Abstractive (transformer-based):
- Use pre-trained model (BART, T5, Pegasus)
- Encode input document (handle length limits with chunking if needed)
- Generate summary with beam search
- Post-process: check for repetition, factual consistency
抽取式(TextRank/LexRank):
- 将文档分割为句子
- 构建相似度图(句子为节点,余弦相似度为边)
- 在句子图上运行PageRank算法
- 选择排名前k的句子,按原文顺序重新排列
抽象式(基于Transformer):
- 使用预训练模型(BART、T5、Pegasus)
- 编码输入文档(若需处理长度限制,使用分块策略)
- 通过束搜索生成摘要
- 后处理:检查重复内容、事实一致性
Phase 3: Verification
阶段3:验证
Evaluate: ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries. Manual check for factual accuracy and coherence.
Gate: ROUGE scores reasonable for domain, no hallucinations in spot-check.
评估:对照参考摘要的ROUGE分数(ROUGE-1、ROUGE-2、ROUGE-L)。人工检查事实准确性和连贯性。
准入条件: ROUGE分数符合领域合理范围,抽查无幻觉内容。
Phase 4: Output
阶段4:输出
Return summary with metadata.
返回包含元数据的摘要。
Output Format
输出格式
json
{
"summary": "The company reported Q4 revenue of...",
"method": "extractive_textrank",
"metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}json
{
"summary": "The company reported Q4 revenue of...",
"method": "extractive_textrank",
"metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}Examples
示例
Sample I/O
输入输出示例
Input: 2000-word news article about quarterly earnings
Expected: 200-word summary covering: revenue, profit, guidance, key highlights. Extractive: 5-6 selected sentences. Abstractive: coherent paragraph.
输入: 2000字关于季度收益的新闻文章
预期输出: 200字摘要,涵盖:营收、利润、业绩指引、关键亮点。抽取式:5-6个选取的句子;抽象式:连贯段落。
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| Very short input (< 100 words) | Return as-is or minimal trimming | Already concise |
| Multiple contradicting sections | Summary may miss nuance | Summarization favors dominant theme |
| Technical jargon | Extractive preserves, abstractive may simplify | Domain expertise affects quality |
| 输入 | 预期输出 | 原因 |
|---|---|---|
| 极短输入(< 100 words) | 原样返回或最小化修剪 | 内容已简洁 |
| 存在多个矛盾章节 | 摘要可能遗漏细节 | 摘要倾向于突出主导主题 |
| 包含技术术语 | 抽取式保留术语,抽象式可能简化 | 领域专业度影响摘要质量 |
Gotchas
注意事项
- ROUGE ≠ quality: ROUGE measures n-gram overlap with references. A high-ROUGE summary can be incoherent, and a low-ROUGE summary can be excellent with different word choices.
- Input length limits: Transformer models have max token limits (512-4096). Long documents need chunking strategies (chunk-then-summarize or hierarchical summarization).
- Repetition: Abstractive models sometimes repeat phrases. Use repetition penalty during generation (no_repeat_ngram_size).
- Position bias: In news text, important information is front-loaded (inverted pyramid). Simple "take first N sentences" is a strong extractive baseline.
- Multi-document summarization: Summarizing multiple related documents requires handling redundancy and contradiction across sources.
- ROUGE ≠ 质量:ROUGE衡量与参考摘要的n元语法重叠度。高ROUGE分数的摘要可能不连贯,而低ROUGE分数的摘要若用词不同也可能表现出色。
- 输入长度限制:Transformer模型有最大token限制(512-4096)。长文档需采用分块策略(先分块再摘要或分层摘要)。
- 重复内容:抽象式模型有时会重复短语。生成时可使用重复惩罚(no_repeat_ngram_size)。
- 位置偏差:新闻文本中重要信息通常前置(倒金字塔结构)。简单的“取前N句”是一个效果不错的抽取式基线方法。
- 多文档摘要:总结多个相关文档需处理跨来源的冗余和矛盾内容。
References
参考资料
- For TextRank/LexRank implementation details, see
references/graph-based-extraction.md - For factual consistency checking, see
references/factual-consistency.md
- 有关TextRank/LexRank的实现细节,请查看
references/graph-based-extraction.md - 有关事实一致性检查,请查看
references/factual-consistency.md