keyword-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKeyword Extractor
关键词提取工具
Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.
从文本文档中使用多种算法提取重要的关键词和关键短语。支持TF-IDF、RAKE和基于频率的简单分析,并提供词云可视化功能。
Quick Start
快速开始
python
from scripts.keyword_extractor import KeywordExtractorpython
from scripts.keyword_extractor import KeywordExtractorExtract keywords
Extract keywords
extractor = KeywordExtractor()
keywords = extractor.extract("Your long text document here...")
print(keywords[:10]) # Top 10 keywords
extractor = KeywordExtractor()
keywords = extractor.extract("Your long text document here...")
print(keywords[:10]) # Top 10 keywords
From file
From file
keywords = extractor.extract_from_file("document.txt")
extractor.to_wordcloud("keywords.png")
undefinedkeywords = extractor.extract_from_file("document.txt")
extractor.to_wordcloud("keywords.png")
undefinedFeatures
功能特性
- Multiple Algorithms: TF-IDF, RAKE, frequency-based
- Key Phrases: Extract multi-word phrases, not just single words
- Scoring: Relevance scores for ranking
- Stopword Filtering: Built-in + custom stopwords
- N-gram Support: Unigrams, bigrams, trigrams
- Word Cloud: Visualize keyword importance
- Batch Processing: Process multiple documents
- 多种算法支持:TF-IDF、RAKE、基于频率的分析
- 关键短语提取:提取多词短语,而非仅单个单词
- 相关性评分:通过评分对关键词进行排序
- 停用词过滤:内置停用词库 + 自定义停用词
- N-gram支持:一元词、二元词、三元词
- 词云可视化:直观展示关键词重要性
- 批量处理:处理多个文档
API Reference
API 参考
Initialization
初始化
python
extractor = KeywordExtractor(
method="tfidf", # tfidf, rake, frequency
max_keywords=20, # Maximum keywords to return
min_word_length=3, # Minimum word length
ngram_range=(1, 3) # Unigrams to trigrams
)python
extractor = KeywordExtractor(
method="tfidf", # tfidf, rake, frequency
max_keywords=20, # Maximum keywords to return
min_word_length=3, # Minimum word length
ngram_range=(1, 3) # Unigrams to trigrams
)Extraction Methods
提取方法
python
undefinedpython
undefinedTF-IDF (best for comparing documents)
TF-IDF (best for comparing documents)
keywords = extractor.extract(text, method="tfidf")
keywords = extractor.extract(text, method="tfidf")
RAKE (best for key phrases)
RAKE (best for key phrases)
keywords = extractor.extract(text, method="rake")
keywords = extractor.extract(text, method="rake")
Frequency (simple word counts)
Frequency (simple word counts)
keywords = extractor.extract(text, method="frequency")
undefinedkeywords = extractor.extract(text, method="frequency")
undefinedResults Format
结果格式
python
keywords = extractor.extract(text)python
keywords = extractor.extract(text)Returns list of tuples: [(keyword, score), ...]
Returns list of tuples: [(keyword, score), ...]
[('machine learning', 0.85), ('data science', 0.72), ...]
[('machine learning', 0.85), ('data science', 0.72), ...]
Get just keywords
Get just keywords
keyword_list = extractor.get_keywords(text)
keyword_list = extractor.get_keywords(text)
['machine learning', 'data science', ...]
['machine learning', 'data science', ...]
undefinedundefinedCustomization
自定义配置
python
undefinedpython
undefinedAdd custom stopwords
Add custom stopwords
extractor.add_stopwords(['company', 'product', 'service'])
extractor.add_stopwords(['company', 'product', 'service'])
Set minimum frequency
Set minimum frequency
extractor.min_frequency = 2
extractor.min_frequency = 2
Filter by part of speech (nouns only)
Filter by part of speech (nouns only)
extractor.pos_filter = ['NN', 'NNS', 'NNP']
undefinedextractor.pos_filter = ['NN', 'NNS', 'NNP']
undefinedVisualization
可视化功能
python
undefinedpython
undefinedGenerate word cloud
Generate word cloud
extractor.to_wordcloud("wordcloud.png", colormap="viridis")
extractor.to_wordcloud("wordcloud.png", colormap="viridis")
Bar chart of top keywords
Bar chart of top keywords
extractor.plot_keywords("keywords.png", top_n=15)
undefinedextractor.plot_keywords("keywords.png", top_n=15)
undefinedExport
导出功能
python
undefinedpython
undefinedTo JSON
To JSON
extractor.to_json("keywords.json")
extractor.to_json("keywords.json")
To CSV
To CSV
extractor.to_csv("keywords.csv")
extractor.to_csv("keywords.csv")
To plain text
To plain text
extractor.to_text("keywords.txt")
undefinedextractor.to_text("keywords.txt")
undefinedCLI Usage
CLI 使用方法
bash
undefinedbash
undefinedExtract from text
Extract from text
python keyword_extractor.py --text "Your text here" --top 10
python keyword_extractor.py --text "Your text here" --top 10
Extract from file
Extract from file
python keyword_extractor.py --input document.txt --method tfidf --output keywords.json
python keyword_extractor.py --input document.txt --method tfidf --output keywords.json
Generate word cloud
Generate word cloud
python keyword_extractor.py --input document.txt --wordcloud cloud.png
python keyword_extractor.py --input document.txt --wordcloud cloud.png
Batch process directory
Batch process directory
python keyword_extractor.py --input-dir ./docs --output keywords_all.csv
undefinedpython keyword_extractor.py --input-dir ./docs --output keywords_all.csv
undefinedCLI Arguments
CLI 参数说明
| Argument | Description | Default |
|---|---|---|
| Text to analyze | - |
| Input file path | - |
| Directory of files | - |
| Output file | - |
| Algorithm (tfidf, rake, frequency) | |
| Number of keywords | 20 |
| N-gram range (e.g., "1,2") | |
| Generate word cloud | - |
| Custom stopwords file | - |
| 参数 | 描述 | 默认值 |
|---|---|---|
| 待分析的文本 | - |
| 输入文件路径 | - |
| 文档目录路径 | - |
| 输出文件路径 | - |
| 使用的算法(tfidf, rake, frequency) | |
| 返回的关键词数量 | 20 |
| N-gram范围(例如:"1,2") | |
| 生成词云 | - |
| 自定义停用词文件路径 | - |
Examples
示例
Article Keyword Extraction
文章关键词提取
python
extractor = KeywordExtractor(method="tfidf")
article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""
keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
print(f"{score:.3f}: {keyword}")python
extractor = KeywordExtractor(method="tfidf")
article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""
keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
print(f"{score:.3f}: {keyword}")Compare Multiple Documents
多文档对比分析
python
extractor = KeywordExtractor(method="tfidf")
docs = [
open("doc1.txt").read(),
open("doc2.txt").read(),
open("doc3.txt").read()
]python
extractor = KeywordExtractor(method="tfidf")
docs = [
open("doc1.txt").read(),
open("doc2.txt").read(),
open("doc3.txt").read()
]Extract keywords from each
Extract keywords from each
for i, doc in enumerate(docs):
keywords = extractor.extract(doc, top_n=5)
print(f"\nDocument {i+1}:")
for kw, score in keywords:
print(f" {kw}: {score:.3f}")
undefinedfor i, doc in enumerate(docs):
keywords = extractor.extract(doc, top_n=5)
print(f"\nDocument {i+1}:")
for kw, score in keywords:
print(f" {kw}: {score:.3f}")
undefinedSEO Keyword Research
SEO关键词研究
python
extractor = KeywordExtractor(
method="rake",
ngram_range=(2, 4), # Focus on phrases
max_keywords=30
)
webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)python
extractor = KeywordExtractor(
method="rake",
ngram_range=(2, 4), # Focus on phrases
max_keywords=30
)
webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)Filter by score threshold
Filter by score threshold
high_value = [(kw, s) for kw, s in keywords if s > 0.5]
print("High-value keywords for SEO:")
for kw, score in high_value:
print(f" {kw}")
undefinedhigh_value = [(kw, s) for kw, s in keywords if s > 0.5]
print("High-value keywords for SEO:")
for kw, score in high_value:
print(f" {kw}")
undefinedAlgorithm Comparison
算法对比
| Algorithm | Best For | Strengths |
|---|---|---|
| TF-IDF | Document comparison | Finds unique terms, good for search |
| RAKE | Key phrases | Extracts multi-word concepts |
| Frequency | Quick overview | Simple, fast, interpretable |
| 算法 | 适用场景 | 优势 |
|---|---|---|
| TF-IDF | 文档对比 | 识别独特术语,适用于搜索场景 |
| RAKE | 关键短语提取 | 提取多词概念 |
| Frequency | 快速概览 | 简单、快速、易于理解 |
Dependencies
依赖库
scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0Limitations
局限性
- English optimized (other languages need language-specific stopwords)
- Very short texts may not have enough data for TF-IDF
- Domain-specific jargon may need custom stopword handling
- 针对英语优化(其他语言需要特定语言的停用词库)
- 极短文本可能没有足够数据支撑TF-IDF分析
- 领域特定术语可能需要自定义停用词处理