algo-nlp-lda

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LDA Topic Modeling

LDA主题建模

Overview

概述

Latent Dirichlet Allocation models each document as a mixture of topics and each topic as a distribution over words. Discovers K latent topics from a corpus without supervision. Uses Gibbs sampling or variational inference. Complexity: O(N × K × iterations) where N = total word tokens.

Latent Dirichlet Allocation（潜在狄利克雷分配）将每个文档建模为主题的混合体，每个主题则建模为词汇上的概率分布。无需监督即可从语料库中发现K个潜在主题。采用Gibbs sampling或variational inference算法。复杂度为O(N × K × 迭代次数)，其中N为总词元数。

When to Use

使用场景

Trigger conditions:

Discovering latent themes in a large document collection
Organizing/categorizing documents by automatically discovered topics
Exploratory text analysis when categories are unknown

When NOT to use:

When categories are known (use supervised classification)
For short texts (tweets, titles) — too few words per document for reliable topic assignment
When you need semantic understanding (use embeddings)

触发条件：

在大型文档集合中发现潜在主题
通过自动发现的主题来组织/分类文档
类别未知时的探索性文本分析

不适用场景：

当类别已知时（使用监督分类）
处理短文本（如推文、标题）——每个文档的词汇量过少，无法进行可靠的主题分配
需要语义理解时（使用嵌入模型）

Algorithm

算法

IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.

IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.

Phase 1: Input Validation

阶段1：输入验证

Preprocess: tokenize, remove stop words, apply lemmatization. Build document-term matrix. Filter: remove terms appearing in <5 or >50% of documents. Gate: Clean DTM, vocabulary size reasonable (1K-50K terms).

预处理：分词、移除停用词、词形还原。构建文档-词项矩阵。过滤：移除出现在少于5%或多于50%文档中的词项。 检查点： 清理后的文档-词项矩阵合理，词汇量适中（1K-50K词项）。

Phase 2: Core Algorithm

阶段2：核心算法

Choose K (start with √(N/2), try range K=5,10,15,20,...)
Set hyperparameters: α = 50/K (document-topic density), β = 0.01 (topic-word density)
Run LDA (Gibbs sampling: 1000+ iterations, or variational inference)
Extract: topic-word distributions (top 10-20 words per topic) and document-topic distributions

选择K值（初始值可设为√(N/2)，尝试K=5、10、15、20等范围）
设置超参数：α = 50/K（文档-主题密度），β = 0.01（主题-词汇密度）
运行LDA算法（Gibbs sampling：1000+次迭代，或variational inference）
提取结果：主题-词汇分布（每个主题取前10-20个词汇）和文档-主题分布

Phase 3: Verification

阶段3：验证

Evaluate: topic coherence (C_v score, higher is better), manual inspection of top words per topic, check for "junk" topics (mixed incoherent words). Gate: Coherence score acceptable, topics are humanly interpretable.

评估指标：主题一致性（C_v分数，分数越高越好）、人工检查每个主题的前N个词汇、排查“无效主题”（词汇混杂无意义的主题）。 检查点： 一致性分数达标，主题可被人工解释。

Phase 4: Output

阶段4：输出

Return topics with top words and document assignments.

返回包含核心词汇的主题列表及文档分配结果。

Output Format

输出格式

json

{
  "topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
  "doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
  "metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}

json

{
  "topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
  "doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
  "metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}

Examples

示例

Sample I/O

样本输入输出

Input: 1000 news articles, K=5 Expected: Topics like: {politics, sports, technology, business, entertainment} with coherent top words per topic.

输入： 1000篇新闻文章，K=5 预期输出： 生成如{政治、体育、科技、商业、娱乐}等主题，每个主题包含连贯的核心词汇。

Edge Cases

边缘情况

Input	Expected	Why
Very short documents	Poor topic assignment	Too few words for reliable mixture estimation
Homogeneous corpus	1-2 topics dominate	All documents are similar, limited topic diversity
K=1	Single topic = corpus vocabulary	Degenerate case, no discrimination

输入	预期结果	原因
极短文档	主题分配效果差	词汇量过少，无法可靠估计主题混合比例
同质化语料库	1-2个主题占主导	所有文档相似度高，主题多样性有限
K=1	单个主题=整个语料库词汇表	退化情况，无区分度

Gotchas

注意事项

Stop words MUST be removed: LDA will create "junk" topics dominated by common words ("the", "is", "and") if stop words remain.
Topic labeling is manual: LDA gives word distributions, NOT topic names. You must interpret and label topics based on top words.
Reproducibility: Gibbs sampling is stochastic. Different random seeds give different topics. Run multiple times and check stability.
Dynamic topics: Standard LDA assumes topics are static. For evolving corpora (news over years), use Dynamic Topic Models.
Hyperparameter sensitivity: Low α produces documents with fewer, more distinct topics. Low β produces topics with fewer, more specific words. Tune or use automatic methods.

必须移除停用词：若保留停用词，LDA会生成以常用词（如“the”、“is”、“and”）为主的“无效主题”。
主题标签需手动添加：LDA仅提供词汇分布，不生成主题名称。需根据核心词汇人工解读并为主题命名。
可重复性：Gibbs sampling是随机算法，不同随机种子会生成不同主题。需多次运行并检查稳定性。
动态主题：标准LDA假设主题是静态的。对于随时间演变的语料库（如历年新闻），需使用Dynamic Topic Models。
超参数敏感性：低α值会使文档对应更少、更明确的主题。低β值会使主题对应更少、更具体的词汇。可手动调优或使用自动调优方法。

References

参考资料

For coherence metrics and K selection, see
```
references/topic-evaluation.md
```
For dynamic and correlated topic models, see
```
references/advanced-lda.md
```

关于一致性指标和K值选择，详见
```
references/topic-evaluation.md
```
关于动态主题模型和相关主题模型，详见
```
references/advanced-lda.md
```