content-similarity-checker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseContent Similarity Checker
内容相似度检查器
Compare documents and text for similarity using multiple algorithms.
使用多种算法比较文档和文本的相似度。
Features
功能特性
- Cosine Similarity: TF-IDF based comparison
- Jaccard Similarity: Set-based comparison
- Levenshtein Distance: Edit distance for short texts
- Batch Comparison: Compare multiple documents
- Similarity Matrix: Pairwise comparison of all documents
- Reports: Detailed similarity reports
- Cosine相似度: 基于TF-IDF的比较
- Jaccard相似度: 基于集合的比较
- Levenshtein距离: 短文本的编辑距离
- 批量比较: 比较多个文档
- 相似度矩阵: 所有文档的两两比较
- 报告: 详细的相似度报告
Quick Start
快速开始
python
from similarity_checker import SimilarityChecker
checker = SimilarityChecker()python
from similarity_checker import SimilarityChecker
checker = SimilarityChecker()Compare two texts
Compare two texts
score = checker.compare(
"The quick brown fox jumps over the lazy dog",
"A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")
score = checker.compare(
"The quick brown fox jumps over the lazy dog",
"A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")
Compare documents
Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")
undefinedscore = checker.compare_files("doc1.txt", "doc2.txt")
undefinedCLI Usage
CLI使用方法
bash
undefinedbash
undefinedCompare two texts
Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
Compare two files
Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
Compare all files in folder
Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv
python similarity_checker.py --folder ./documents/ --output matrix.csv
Use specific algorithm
Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
Find similar documents (threshold)
Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7
python similarity_checker.py --folder ./documents/ --threshold 0.7
JSON output
JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
undefinedpython similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
undefinedAPI Reference
API参考
SimilarityChecker Class
SimilarityChecker类
python
class SimilarityChecker:
def __init__(self, method: str = "cosine")
# Text comparison
def compare(self, text1: str, text2: str) -> float
def compare_files(self, file1: str, file2: str) -> float
# Multiple algorithms
def compare_all_methods(self, text1: str, text2: str) -> dict
# Batch comparison
def compare_to_corpus(self, text: str, corpus: list) -> list
def similarity_matrix(self, documents: list) -> pd.DataFrame
def find_duplicates(self, documents: list, threshold: float = 0.8) -> list
# Folder operations
def compare_folder(self, folder: str, threshold: float = None) -> dict
def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list
# Report
def generate_report(self, output: str) -> strpython
class SimilarityChecker:
def __init__(self, method: str = "cosine")
# Text comparison
def compare(self, text1: str, text2: str) -> float
def compare_files(self, file1: str, file2: str) -> float
# Multiple algorithms
def compare_all_methods(self, text1: str, text2: str) -> dict
# Batch comparison
def compare_to_corpus(self, text: str, corpus: list) -> list
def similarity_matrix(self, documents: list) -> pd.DataFrame
def find_duplicates(self, documents: list, threshold: float = 0.8) -> list
# Folder operations
def compare_folder(self, folder: str, threshold: float = None) -> dict
def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list
# Report
def generate_report(self, output: str) -> strSimilarity Methods
相似度算法
Cosine Similarity (Default)
Cosine相似度(默认)
Best for comparing documents of different lengths:
python
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)最适合比较不同长度的文档:
python
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)Returns: 0.0 to 1.0
Returns: 0.0 to 1.0
undefinedundefinedJaccard Similarity
Jaccard相似度
Good for comparing sets of words/tokens:
python
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)适合比较单词/词元的集合:
python
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)Returns: 0.0 to 1.0
Returns: 0.0 to 1.0
undefinedundefinedLevenshtein (Edit Distance)
Levenshtein(编辑距离)
Best for short texts, typo detection:
python
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)最适合短文本、拼写错误检测:
python
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)Returns: 0.0 to 1.0 (normalized)
Returns: 0.0 to 1.0 (normalized)
undefinedundefinedTF-IDF + Cosine
TF-IDF + Cosine
Advanced: considers term importance:
python
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)进阶版:考虑术语重要性:
python
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)Batch Comparison
批量比较
Compare to Corpus
与语料库比较
python
checker = SimilarityChecker()
target = "Machine learning is a subset of artificial intelligence."
corpus = [
"AI includes machine learning and deep learning.",
"Python is a programming language.",
"Neural networks power deep learning systems."
]
results = checker.compare_to_corpus(target, corpus)python
checker = SimilarityChecker()
target = "Machine learning is a subset of artificial intelligence."
corpus = [
"AI includes machine learning and deep learning.",
"Python is a programming language.",
"Neural networks power deep learning systems."
]
results = checker.compare_to_corpus(target, corpus)Returns:
Returns:
[
{"index": 0, "similarity": 0.65, "text": "AI includes..."},
{"index": 2, "similarity": 0.42, "text": "Neural networks..."},
{"index": 1, "similarity": 0.12, "text": "Python is..."}
]
undefined[
{"index": 0, "similarity": 0.65, "text": "AI includes..."},
{"index": 2, "similarity": 0.42, "text": "Neural networks..."},
{"index": 1, "similarity": 0.12, "text": "Python is..."}
]
undefinedSimilarity Matrix
相似度矩阵
python
documents = [
"Document one content...",
"Document two content...",
"Document three content..."
]
matrix = checker.similarity_matrix(documents)python
documents = [
"Document one content...",
"Document two content...",
"Document three content..."
]
matrix = checker.similarity_matrix(documents)Returns DataFrame:
Returns DataFrame:
doc_0 doc_1 doc_2
doc_0 doc_1 doc_2
doc_0 1.000 0.750 0.320
doc_0 1.000 0.750 0.320
doc_1 0.750 1.000 0.410
doc_1 0.750 1.000 0.410
doc_2 0.320 0.410 1.000
doc_2 0.320 0.410 1.000
undefinedundefinedFind Duplicates
查找重复内容
python
documents = [...] # List of texts
duplicates = checker.find_duplicates(documents, threshold=0.85)python
documents = [...] # List of texts
duplicates = checker.find_duplicates(documents, threshold=0.85)Returns:
Returns:
[
{"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
{"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]
undefined[
{"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
{"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]
undefinedCompare All Methods
比较所有算法
Get similarity scores from all algorithms:
python
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)获取所有算法的相似度得分:
python
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)Returns:
Returns:
{
"cosine": 0.82,
"jaccard": 0.65,
"levenshtein": 0.71,
"tfidf": 0.78,
"average": 0.74
}
undefined{
"cosine": 0.82,
"jaccard": 0.65,
"levenshtein": 0.71,
"tfidf": 0.78,
"average": 0.74
}
undefinedFolder Operations
文件夹操作
Compare All Files in Folder
比较文件夹中的所有文件
python
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")python
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")Returns:
Returns:
{
"files": ["doc1.txt", "doc2.txt", "doc3.txt"],
"comparisons": 3,
"similar_pairs": [
{"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
],
"matrix": <DataFrame>
}
undefined{
"files": ["doc1.txt", "doc2.txt", "doc3.txt"],
"comparisons": 3,
"similar_pairs": [
{"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
],
"matrix": <DataFrame>
}
undefinedFind Most Similar to Query
查找与查询文本最相似的文件
python
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)python
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)Returns:
Returns:
[
{"file": "doc3.txt", "similarity": 0.89},
{"file": "doc1.txt", "similarity": 0.72},
...
]
undefined[
{"file": "doc3.txt", "similarity": 0.89},
{"file": "doc1.txt", "similarity": 0.72},
...
]
undefinedOutput Format
输出格式
Comparison Result
比较结果详情
python
result = checker.compare_with_details(text1, text2)python
result = checker.compare_with_details(text1, text2)Returns:
Returns:
{
"similarity": 0.82,
"method": "cosine",
"text1_length": 150,
"text2_length": 180,
"common_words": 25,
"unique_words_text1": 10,
"unique_words_text2": 15,
"interpretation": "High similarity - likely related content"
}
undefined{
"similarity": 0.82,
"method": "cosine",
"text1_length": 150,
"text2_length": 180,
"common_words": 25,
"unique_words_text1": 10,
"unique_words_text2": 15,
"interpretation": "High similarity - likely related content"
}
undefinedExample Workflows
示例工作流
Plagiarism Check
抄袭检测
python
checker = SimilarityChecker()
submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")
suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]
if suspicious:
print(f"Warning: Found {len(suspicious)} potentially similar sources")
for p in suspicious:
print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")python
checker = SimilarityChecker()
submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")
suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]
if suspicious:
print(f"Warning: Found {len(suspicious)} potentially similar sources")
for p in suspicious:
print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")Document Deduplication
文档去重
python
checker = SimilarityChecker()python
checker = SimilarityChecker()Load all documents
Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
docs[file.name] = file.read_text()
docs = {}
for file in Path("./articles/").glob("*.txt"):
docs[file.name] = file.read_text()
Find near-duplicates
Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
undefinedduplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
undefinedContent Matching
内容匹配
python
checker = SimilarityChecker()
query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)
print("Most relevant articles:")
for r in results:
print(f" {r['file']}: {r['similarity']:.0%} match")python
checker = SimilarityChecker()
query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)
print("Most relevant articles:")
for r in results:
print(f" {r['file']}: {r['similarity']:.0%} match")Dependencies
依赖项
- scikit-learn>=1.3.0
- nltk>=3.8.0
- numpy>=1.24.0
- pandas>=2.0.0
- scikit-learn>=1.3.0
- nltk>=3.8.0
- numpy>=1.24.0
- pandas>=2.0.0