algo-nlp-similarity

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Text Similarity

文本相似度

Overview

概述

Text similarity measures how close two texts are in meaning or surface form. Lexical methods (Jaccard, cosine on TF-IDF) compare word overlap. Semantic methods (sentence embeddings) capture meaning even with different words. Choice depends on whether you need exact matching or meaning matching.
文本相似度用于衡量两段文本在含义或表层形式上的接近程度。词汇法(如Jaccard系数、基于TF-IDF的余弦相似度)通过词语重叠度进行比较;语义法(如句子嵌入)则能捕捉文本含义,即使文本使用不同词语也能有效匹配。方法的选择取决于你需要的是精确匹配还是含义匹配。

When to Use

使用场景

Trigger conditions:
  • Finding similar or duplicate documents in a collection
  • Matching queries to FAQ answers or knowledge base entries
  • Detecting plagiarism or content reuse
When NOT to use:
  • For topic-level grouping (use topic modeling / LDA)
  • For entity extraction from text (use NER)
触发条件:
  • 在文档集合中查找相似或重复文档
  • 将查询内容与FAQ答案或知识库条目进行匹配
  • 检测抄袭或内容复用
不适用场景:
  • 主题层面的分组(请使用主题建模/LDA)
  • 从文本中提取实体(请使用NER)

Algorithm

算法

IRON LAW: Lexical Similarity ≠ Semantic Similarity
"The car is fast" and "The automobile is speedy" have LOW lexical
similarity (different words) but HIGH semantic similarity (same meaning).
"Bank of the river" and "Bank account" have HIGH lexical similarity
but LOW semantic similarity. Choose the method that matches your
definition of "similar."
IRON LAW: Lexical Similarity ≠ Semantic Similarity
"The car is fast" and "The automobile is speedy" have LOW lexical
similarity (different words) but HIGH semantic similarity (same meaning).
"Bank of the river" and "Bank account" have HIGH lexical similarity
but LOW semantic similarity. Choose the method that matches your
definition of "similar."

Phase 1: Input Validation

阶段1:输入验证

Determine: similarity type needed (lexical or semantic), text preprocessing requirements, scale (pairwise vs all-pairs vs query-to-corpus). Gate: Texts preprocessed, method selected.
确定所需的相似度类型(词汇或语义)、文本预处理要求、计算范围(两两对比、全量两两对比或查询与语料库对比)。 关卡: 文本已完成预处理,方法已选定。

Phase 2: Core Algorithm

阶段2:核心算法

Lexical methods:
  • Jaccard: |A∩B| / |A∪B| on word sets
  • Cosine on TF-IDF vectors: cos(θ) = (A·B) / (|A|×|B|)
Semantic methods:
  • Sentence embeddings: encode texts with sentence-transformers (all-MiniLM-L6-v2)
  • Cosine similarity on embedding vectors
  • For large-scale: use FAISS or Annoy for approximate nearest neighbor search
词汇法:
  • Jaccard系数:基于词集合的|A∩B| / |A∪B|
  • 基于TF-IDF向量的余弦相似度:cos(θ) = (A·B) / (|A|×|B|)
语义法:
  • 句子嵌入:使用sentence-transformers(如all-MiniLM-L6-v2模型)对文本进行编码
  • 基于嵌入向量的余弦相似度
  • 大规模场景:使用FAISS或Annoy进行近似最近邻搜索

Phase 3: Verification

阶段3:验证

Spot-check: highly similar pairs should be genuinely similar. Low-similarity pairs should be genuinely different. Check threshold calibration. Gate: Similarity scores align with human judgment on sample pairs.
抽样检查:相似度高的文本应对确实相似,相似度低的文本应对确实不同。同时检查阈值校准情况。 关卡: 相似度分数与人工对样本文本对的判断一致。

Phase 4: Output

阶段4:输出

Return similarity scores or nearest neighbors.
返回相似度分数或最近邻结果。

Output Format

输出格式

json
{
  "similarities": [{"text_a": "doc1", "text_b": "doc5", "score": 0.92, "method": "semantic_cosine"}],
  "metadata": {"method": "sentence-transformers", "model": "all-MiniLM-L6-v2", "pairs_computed": 500}
}
json
{
  "similarities": [{"text_a": "doc1", "text_b": "doc5", "score": 0.92, "method": "semantic_cosine"}],
  "metadata": {"method": "sentence-transformers", "model": "all-MiniLM-L6-v2", "pairs_computed": 500}
}

Examples

示例

Sample I/O

样本输入输出

Input: Text A: "How to reset my password", Text B: "I forgot my login credentials" Expected: Lexical (Jaccard) ≈ 0.07 (almost no word overlap). Semantic ≈ 0.82 (same intent).
输入: 文本A:"How to reset my password",文本B:"I forgot my login credentials" 预期结果: 词汇法(Jaccard)≈0.07(几乎无词语重叠);语义法≈0.82(意图相同)。

Edge Cases

边缘情况

InputExpectedWhy
Identical textsScore = 1.0Exact match
Empty textUndefined or 0Handle gracefully
Different languagesLexical=0, semantic depends on modelMultilingual models can match cross-language
输入预期结果原因
完全相同的文本分数=1.0精确匹配
空文本未定义或0需优雅处理
不同语言文本词汇相似度=0,语义相似度取决于模型多语言模型可实现跨语言匹配

Gotchas

注意事项

  • Threshold is use-case specific: 0.8 similarity might mean "duplicate" for deduplication but "somewhat related" for recommendation. Calibrate threshold on labeled examples.
  • Text length effects: Cosine on TF-IDF is sensitive to document length. Very short texts have sparse vectors with unreliable similarity. Use embeddings for short texts.
  • Embedding model choice: Different models have different strengths. all-MiniLM-L6-v2 is fast but less accurate than larger models. Match model to performance needs.
  • Computational scaling: All-pairs similarity on N documents is O(N²). For large corpora, use approximate methods (locality-sensitive hashing, FAISS).
  • Domain adaptation: General-purpose embedding models may not capture domain-specific similarity (legal, medical). Fine-tune on domain data for best results.
  • 阈值需根据场景调整:对于去重场景,0.8的相似度可能代表“重复”;但对于推荐场景,可能仅代表“略有关联”。需基于标注示例校准阈值。
  • 文本长度影响:基于TF-IDF的余弦相似度对文档长度敏感。极短文本的向量稀疏,相似度结果不可靠。对于短文本,建议使用嵌入法。
  • 嵌入模型选择:不同模型各有优势。all-MiniLM-L6-v2速度快,但准确性不如更大规模的模型。需根据性能需求选择匹配的模型。
  • 计算规模:N篇文档的全量两两相似度计算复杂度为O(N²)。对于大规模语料库,建议使用近似方法(如局部敏感哈希、FAISS)。
  • 领域适配:通用嵌入模型可能无法捕捉特定领域(如法律、医疗)的相似度。针对领域数据进行微调可获得最佳效果。

References

参考资料

  • For embedding model comparison and benchmarks, see
    references/model-benchmarks.md
  • For approximate nearest neighbor search at scale, see
    references/ann-search.md
  • 关于嵌入模型的对比和基准测试,请查看
    references/model-benchmarks.md
  • 关于大规模近似最近邻搜索,请查看
    references/ann-search.md