algo-nlp-lda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LDA Topic Modeling

LDA主题建模

Overview

概述

Latent Dirichlet Allocation models each document as a mixture of topics and each topic as a distribution over words. Discovers K latent topics from a corpus without supervision. Uses Gibbs sampling or variational inference. Complexity: O(N × K × iterations) where N = total word tokens.
Latent Dirichlet Allocation(潜在狄利克雷分配)将每个文档建模为主题的混合体,每个主题则建模为词汇上的概率分布。无需监督即可从语料库中发现K个潜在主题。采用Gibbs sampling或variational inference算法。复杂度为O(N × K × 迭代次数),其中N为总词元数。

When to Use

使用场景

Trigger conditions:
  • Discovering latent themes in a large document collection
  • Organizing/categorizing documents by automatically discovered topics
  • Exploratory text analysis when categories are unknown
When NOT to use:
  • When categories are known (use supervised classification)
  • For short texts (tweets, titles) — too few words per document for reliable topic assignment
  • When you need semantic understanding (use embeddings)
触发条件:
  • 在大型文档集合中发现潜在主题
  • 通过自动发现的主题来组织/分类文档
  • 类别未知时的探索性文本分析
不适用场景:
  • 当类别已知时(使用监督分类)
  • 处理短文本(如推文、标题)——每个文档的词汇量过少,无法进行可靠的主题分配
  • 需要语义理解时(使用嵌入模型)

Algorithm

算法

IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.
IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.

Phase 1: Input Validation

阶段1:输入验证

Preprocess: tokenize, remove stop words, apply lemmatization. Build document-term matrix. Filter: remove terms appearing in <5 or >50% of documents. Gate: Clean DTM, vocabulary size reasonable (1K-50K terms).
预处理:分词、移除停用词、词形还原。构建文档-词项矩阵。过滤:移除出现在少于5%或多于50%文档中的词项。 检查点: 清理后的文档-词项矩阵合理,词汇量适中(1K-50K词项)。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Choose K (start with √(N/2), try range K=5,10,15,20,...)
  2. Set hyperparameters: α = 50/K (document-topic density), β = 0.01 (topic-word density)
  3. Run LDA (Gibbs sampling: 1000+ iterations, or variational inference)
  4. Extract: topic-word distributions (top 10-20 words per topic) and document-topic distributions
  1. 选择K值(初始值可设为√(N/2),尝试K=5、10、15、20等范围)
  2. 设置超参数:α = 50/K(文档-主题密度),β = 0.01(主题-词汇密度)
  3. 运行LDA算法(Gibbs sampling:1000+次迭代,或variational inference)
  4. 提取结果:主题-词汇分布(每个主题取前10-20个词汇)和文档-主题分布

Phase 3: Verification

阶段3:验证

Evaluate: topic coherence (C_v score, higher is better), manual inspection of top words per topic, check for "junk" topics (mixed incoherent words). Gate: Coherence score acceptable, topics are humanly interpretable.
评估指标:主题一致性(C_v分数,分数越高越好)、人工检查每个主题的前N个词汇、排查“无效主题”(词汇混杂无意义的主题)。 检查点: 一致性分数达标,主题可被人工解释。

Phase 4: Output

阶段4:输出

Return topics with top words and document assignments.
返回包含核心词汇的主题列表及文档分配结果。

Output Format

输出格式

json
{
  "topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
  "doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
  "metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}
json
{
  "topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
  "doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
  "metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}

Examples

示例

Sample I/O

样本输入输出

Input: 1000 news articles, K=5 Expected: Topics like: {politics, sports, technology, business, entertainment} with coherent top words per topic.
输入: 1000篇新闻文章,K=5 预期输出: 生成如{政治、体育、科技、商业、娱乐}等主题,每个主题包含连贯的核心词汇。

Edge Cases

边缘情况

InputExpectedWhy
Very short documentsPoor topic assignmentToo few words for reliable mixture estimation
Homogeneous corpus1-2 topics dominateAll documents are similar, limited topic diversity
K=1Single topic = corpus vocabularyDegenerate case, no discrimination
输入预期结果原因
极短文档主题分配效果差词汇量过少,无法可靠估计主题混合比例
同质化语料库1-2个主题占主导所有文档相似度高,主题多样性有限
K=1单个主题=整个语料库词汇表退化情况,无区分度

Gotchas

注意事项

  • Stop words MUST be removed: LDA will create "junk" topics dominated by common words ("the", "is", "and") if stop words remain.
  • Topic labeling is manual: LDA gives word distributions, NOT topic names. You must interpret and label topics based on top words.
  • Reproducibility: Gibbs sampling is stochastic. Different random seeds give different topics. Run multiple times and check stability.
  • Dynamic topics: Standard LDA assumes topics are static. For evolving corpora (news over years), use Dynamic Topic Models.
  • Hyperparameter sensitivity: Low α produces documents with fewer, more distinct topics. Low β produces topics with fewer, more specific words. Tune or use automatic methods.
  • 必须移除停用词:若保留停用词,LDA会生成以常用词(如“the”、“is”、“and”)为主的“无效主题”。
  • 主题标签需手动添加:LDA仅提供词汇分布,不生成主题名称。需根据核心词汇人工解读并为主题命名。
  • 可重复性:Gibbs sampling是随机算法,不同随机种子会生成不同主题。需多次运行并检查稳定性。
  • 动态主题:标准LDA假设主题是静态的。对于随时间演变的语料库(如历年新闻),需使用Dynamic Topic Models。
  • 超参数敏感性:低α值会使文档对应更少、更明确的主题。低β值会使主题对应更少、更具体的词汇。可手动调优或使用自动调优方法。

References

参考资料

  • For coherence metrics and K selection, see
    references/topic-evaluation.md
  • For dynamic and correlated topic models, see
    references/advanced-lda.md
  • 关于一致性指标和K值选择,详见
    references/topic-evaluation.md
  • 关于动态主题模型和相关主题模型,详见
    references/advanced-lda.md