content-similarity-checker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Content Similarity Checker

内容相似度检查器

Compare documents and text for similarity using multiple algorithms.
使用多种算法比较文档和文本的相似度。

Features

功能特性

  • Cosine Similarity: TF-IDF based comparison
  • Jaccard Similarity: Set-based comparison
  • Levenshtein Distance: Edit distance for short texts
  • Batch Comparison: Compare multiple documents
  • Similarity Matrix: Pairwise comparison of all documents
  • Reports: Detailed similarity reports
  • Cosine相似度: 基于TF-IDF的比较
  • Jaccard相似度: 基于集合的比较
  • Levenshtein距离: 短文本的编辑距离
  • 批量比较: 比较多个文档
  • 相似度矩阵: 所有文档的两两比较
  • 报告: 详细的相似度报告

Quick Start

快速开始

python
from similarity_checker import SimilarityChecker

checker = SimilarityChecker()
python
from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

Compare two texts

Compare two texts

score = checker.compare( "The quick brown fox jumps over the lazy dog", "A fast brown fox leaps over a sleepy dog" ) print(f"Similarity: {score:.2%}")
score = checker.compare( "The quick brown fox jumps over the lazy dog", "A fast brown fox leaps over a sleepy dog" ) print(f"Similarity: {score:.2%}")

Compare documents

Compare documents

score = checker.compare_files("doc1.txt", "doc2.txt")
undefined
score = checker.compare_files("doc1.txt", "doc2.txt")
undefined

CLI Usage

CLI使用方法

bash
undefined
bash
undefined

Compare two texts

Compare two texts

python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

Compare two files

Compare two files

python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

Compare all files in folder

Compare all files in folder

python similarity_checker.py --folder ./documents/ --output matrix.csv
python similarity_checker.py --folder ./documents/ --output matrix.csv

Use specific algorithm

Use specific algorithm

python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

Find similar documents (threshold)

Find similar documents (threshold)

python similarity_checker.py --folder ./documents/ --threshold 0.7
python similarity_checker.py --folder ./documents/ --threshold 0.7

JSON output

JSON output

python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
undefined
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
undefined

API Reference

API参考

SimilarityChecker Class

SimilarityChecker类

python
class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str
python
class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str

Similarity Methods

相似度算法

Cosine Similarity (Default)

Cosine相似度(默认)

Best for comparing documents of different lengths:
python
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
最适合比较不同长度的文档:
python
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)

Returns: 0.0 to 1.0

Returns: 0.0 to 1.0

undefined
undefined

Jaccard Similarity

Jaccard相似度

Good for comparing sets of words/tokens:
python
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
适合比较单词/词元的集合:
python
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)

Returns: 0.0 to 1.0

Returns: 0.0 to 1.0

undefined
undefined

Levenshtein (Edit Distance)

Levenshtein(编辑距离)

Best for short texts, typo detection:
python
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
最适合短文本、拼写错误检测:
python
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)

Returns: 0.0 to 1.0 (normalized)

Returns: 0.0 to 1.0 (normalized)

undefined
undefined

TF-IDF + Cosine

TF-IDF + Cosine

Advanced: considers term importance:
python
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)
进阶版:考虑术语重要性:
python
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

Batch Comparison

批量比较

Compare to Corpus

与语料库比较

python
checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)
python
checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

Returns:

Returns:

[ {"index": 0, "similarity": 0.65, "text": "AI includes..."}, {"index": 2, "similarity": 0.42, "text": "Neural networks..."}, {"index": 1, "similarity": 0.12, "text": "Python is..."} ]
undefined
[ {"index": 0, "similarity": 0.65, "text": "AI includes..."}, {"index": 2, "similarity": 0.42, "text": "Neural networks..."}, {"index": 1, "similarity": 0.12, "text": "Python is..."} ]
undefined

Similarity Matrix

相似度矩阵

python
documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)
python
documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

Returns DataFrame:

Returns DataFrame:

doc_0 doc_1 doc_2

doc_0 doc_1 doc_2

doc_0 1.000 0.750 0.320

doc_0 1.000 0.750 0.320

doc_1 0.750 1.000 0.410

doc_1 0.750 1.000 0.410

doc_2 0.320 0.410 1.000

doc_2 0.320 0.410 1.000

undefined
undefined

Find Duplicates

查找重复内容

python
documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)
python
documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

Returns:

Returns:

[ {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92}, {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88} ]
undefined
[ {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92}, {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88} ]
undefined

Compare All Methods

比较所有算法

Get similarity scores from all algorithms:
python
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)
获取所有算法的相似度得分:
python
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

Returns:

Returns:

{ "cosine": 0.82, "jaccard": 0.65, "levenshtein": 0.71, "tfidf": 0.78, "average": 0.74 }
undefined
{ "cosine": 0.82, "jaccard": 0.65, "levenshtein": 0.71, "tfidf": 0.78, "average": 0.74 }
undefined

Folder Operations

文件夹操作

Compare All Files in Folder

比较文件夹中的所有文件

python
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")
python
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

Returns:

Returns:

{ "files": ["doc1.txt", "doc2.txt", "doc3.txt"], "comparisons": 3, "similar_pairs": [ {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87} ], "matrix": <DataFrame> }
undefined
{ "files": ["doc1.txt", "doc2.txt", "doc3.txt"], "comparisons": 3, "similar_pairs": [ {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87} ], "matrix": <DataFrame> }
undefined

Find Most Similar to Query

查找与查询文本最相似的文件

python
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)
python
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

Returns:

Returns:

[ {"file": "doc3.txt", "similarity": 0.89}, {"file": "doc1.txt", "similarity": 0.72}, ... ]
undefined
[ {"file": "doc3.txt", "similarity": 0.89}, {"file": "doc1.txt", "similarity": 0.72}, ... ]
undefined

Output Format

输出格式

Comparison Result

比较结果详情

python
result = checker.compare_with_details(text1, text2)
python
result = checker.compare_with_details(text1, text2)

Returns:

Returns:

{ "similarity": 0.82, "method": "cosine", "text1_length": 150, "text2_length": 180, "common_words": 25, "unique_words_text1": 10, "unique_words_text2": 15, "interpretation": "High similarity - likely related content" }
undefined
{ "similarity": 0.82, "method": "cosine", "text1_length": 150, "text2_length": 180, "common_words": 25, "unique_words_text1": 10, "unique_words_text2": 15, "interpretation": "High similarity - likely related content" }
undefined

Example Workflows

示例工作流

Plagiarism Check

抄袭检测

python
checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")
python
checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

Document Deduplication

文档去重

python
checker = SimilarityChecker()
python
checker = SimilarityChecker()

Load all documents

Load all documents

docs = {} for file in Path("./articles/").glob("*.txt"): docs[file.name] = file.read_text()
docs = {} for file in Path("./articles/").glob("*.txt"): docs[file.name] = file.read_text()

Find near-duplicates

Find near-duplicates

duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
undefined
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
undefined

Content Matching

内容匹配

python
checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")
python
checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

Dependencies

依赖项

  • scikit-learn>=1.3.0
  • nltk>=3.8.0
  • numpy>=1.24.0
  • pandas>=2.0.0
  • scikit-learn>=1.3.0
  • nltk>=3.8.0
  • numpy>=1.24.0
  • pandas>=2.0.0