keyword-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Keyword Extractor

关键词提取工具

Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.
从文本文档中使用多种算法提取重要的关键词和关键短语。支持TF-IDF、RAKE和基于频率的简单分析,并提供词云可视化功能。

Quick Start

快速开始

python
from scripts.keyword_extractor import KeywordExtractor
python
from scripts.keyword_extractor import KeywordExtractor

Extract keywords

Extract keywords

extractor = KeywordExtractor() keywords = extractor.extract("Your long text document here...") print(keywords[:10]) # Top 10 keywords
extractor = KeywordExtractor() keywords = extractor.extract("Your long text document here...") print(keywords[:10]) # Top 10 keywords

From file

From file

keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png")
undefined
keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png")
undefined

Features

功能特性

  • Multiple Algorithms: TF-IDF, RAKE, frequency-based
  • Key Phrases: Extract multi-word phrases, not just single words
  • Scoring: Relevance scores for ranking
  • Stopword Filtering: Built-in + custom stopwords
  • N-gram Support: Unigrams, bigrams, trigrams
  • Word Cloud: Visualize keyword importance
  • Batch Processing: Process multiple documents
  • 多种算法支持:TF-IDF、RAKE、基于频率的分析
  • 关键短语提取:提取多词短语,而非仅单个单词
  • 相关性评分:通过评分对关键词进行排序
  • 停用词过滤:内置停用词库 + 自定义停用词
  • N-gram支持:一元词、二元词、三元词
  • 词云可视化:直观展示关键词重要性
  • 批量处理:处理多个文档

API Reference

API 参考

Initialization

初始化

python
extractor = KeywordExtractor(
    method="tfidf",      # tfidf, rake, frequency
    max_keywords=20,     # Maximum keywords to return
    min_word_length=3,   # Minimum word length
    ngram_range=(1, 3)   # Unigrams to trigrams
)
python
extractor = KeywordExtractor(
    method="tfidf",      # tfidf, rake, frequency
    max_keywords=20,     # Maximum keywords to return
    min_word_length=3,   # Minimum word length
    ngram_range=(1, 3)   # Unigrams to trigrams
)

Extraction Methods

提取方法

python
undefined
python
undefined

TF-IDF (best for comparing documents)

TF-IDF (best for comparing documents)

keywords = extractor.extract(text, method="tfidf")
keywords = extractor.extract(text, method="tfidf")

RAKE (best for key phrases)

RAKE (best for key phrases)

keywords = extractor.extract(text, method="rake")
keywords = extractor.extract(text, method="rake")

Frequency (simple word counts)

Frequency (simple word counts)

keywords = extractor.extract(text, method="frequency")
undefined
keywords = extractor.extract(text, method="frequency")
undefined

Results Format

结果格式

python
keywords = extractor.extract(text)
python
keywords = extractor.extract(text)

Returns list of tuples: [(keyword, score), ...]

Returns list of tuples: [(keyword, score), ...]

[('machine learning', 0.85), ('data science', 0.72), ...]

[('machine learning', 0.85), ('data science', 0.72), ...]

Get just keywords

Get just keywords

keyword_list = extractor.get_keywords(text)
keyword_list = extractor.get_keywords(text)

['machine learning', 'data science', ...]

['machine learning', 'data science', ...]

undefined
undefined

Customization

自定义配置

python
undefined
python
undefined

Add custom stopwords

Add custom stopwords

extractor.add_stopwords(['company', 'product', 'service'])
extractor.add_stopwords(['company', 'product', 'service'])

Set minimum frequency

Set minimum frequency

extractor.min_frequency = 2
extractor.min_frequency = 2

Filter by part of speech (nouns only)

Filter by part of speech (nouns only)

extractor.pos_filter = ['NN', 'NNS', 'NNP']
undefined
extractor.pos_filter = ['NN', 'NNS', 'NNP']
undefined

Visualization

可视化功能

python
undefined
python
undefined

Generate word cloud

Generate word cloud

extractor.to_wordcloud("wordcloud.png", colormap="viridis")
extractor.to_wordcloud("wordcloud.png", colormap="viridis")

Bar chart of top keywords

Bar chart of top keywords

extractor.plot_keywords("keywords.png", top_n=15)
undefined
extractor.plot_keywords("keywords.png", top_n=15)
undefined

Export

导出功能

python
undefined
python
undefined

To JSON

To JSON

extractor.to_json("keywords.json")
extractor.to_json("keywords.json")

To CSV

To CSV

extractor.to_csv("keywords.csv")
extractor.to_csv("keywords.csv")

To plain text

To plain text

extractor.to_text("keywords.txt")
undefined
extractor.to_text("keywords.txt")
undefined

CLI Usage

CLI 使用方法

bash
undefined
bash
undefined

Extract from text

Extract from text

python keyword_extractor.py --text "Your text here" --top 10
python keyword_extractor.py --text "Your text here" --top 10

Extract from file

Extract from file

python keyword_extractor.py --input document.txt --method tfidf --output keywords.json
python keyword_extractor.py --input document.txt --method tfidf --output keywords.json

Generate word cloud

Generate word cloud

python keyword_extractor.py --input document.txt --wordcloud cloud.png
python keyword_extractor.py --input document.txt --wordcloud cloud.png

Batch process directory

Batch process directory

python keyword_extractor.py --input-dir ./docs --output keywords_all.csv
undefined
python keyword_extractor.py --input-dir ./docs --output keywords_all.csv
undefined

CLI Arguments

CLI 参数说明

ArgumentDescriptionDefault
--text
Text to analyze-
--input
Input file path-
--input-dir
Directory of files-
--output
Output file-
--method
Algorithm (tfidf, rake, frequency)
tfidf
--top
Number of keywords20
--ngrams
N-gram range (e.g., "1,2")
1,3
--wordcloud
Generate word cloud-
--stopwords
Custom stopwords file-
参数描述默认值
--text
待分析的文本-
--input
输入文件路径-
--input-dir
文档目录路径-
--output
输出文件路径-
--method
使用的算法(tfidf, rake, frequency)
tfidf
--top
返回的关键词数量20
--ngrams
N-gram范围(例如:"1,2")
1,3
--wordcloud
生成词云-
--stopwords
自定义停用词文件路径-

Examples

示例

Article Keyword Extraction

文章关键词提取

python
extractor = KeywordExtractor(method="tfidf")

article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""

keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
    print(f"{score:.3f}: {keyword}")
python
extractor = KeywordExtractor(method="tfidf")

article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""

keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
    print(f"{score:.3f}: {keyword}")

Compare Multiple Documents

多文档对比分析

python
extractor = KeywordExtractor(method="tfidf")

docs = [
    open("doc1.txt").read(),
    open("doc2.txt").read(),
    open("doc3.txt").read()
]
python
extractor = KeywordExtractor(method="tfidf")

docs = [
    open("doc1.txt").read(),
    open("doc2.txt").read(),
    open("doc3.txt").read()
]

Extract keywords from each

Extract keywords from each

for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}")
undefined
for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}")
undefined

SEO Keyword Research

SEO关键词研究

python
extractor = KeywordExtractor(
    method="rake",
    ngram_range=(2, 4),  # Focus on phrases
    max_keywords=30
)

webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)
python
extractor = KeywordExtractor(
    method="rake",
    ngram_range=(2, 4),  # Focus on phrases
    max_keywords=30
)

webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)

Filter by score threshold

Filter by score threshold

high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}")
undefined
high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}")
undefined

Algorithm Comparison

算法对比

AlgorithmBest ForStrengths
TF-IDFDocument comparisonFinds unique terms, good for search
RAKEKey phrasesExtracts multi-word concepts
FrequencyQuick overviewSimple, fast, interpretable
算法适用场景优势
TF-IDF文档对比识别独特术语,适用于搜索场景
RAKE关键短语提取提取多词概念
Frequency快速概览简单、快速、易于理解

Dependencies

依赖库

scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0
scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0

Limitations

局限性

  • English optimized (other languages need language-specific stopwords)
  • Very short texts may not have enough data for TF-IDF
  • Domain-specific jargon may need custom stopword handling
  • 针对英语优化(其他语言需要特定语言的停用词库)
  • 极短文本可能没有足够数据支撑TF-IDF分析
  • 领域特定术语可能需要自定义停用词处理