Keyword Extractor

关键词提取工具

Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.

从文本文档中使用多种算法提取重要的关键词和关键短语。支持TF-IDF、RAKE和基于频率的简单分析，并提供词云可视化功能。

Quick Start

快速开始

python

from scripts.keyword_extractor import KeywordExtractor

python

from scripts.keyword_extractor import KeywordExtractor

Extract keywords

extractor = KeywordExtractor() keywords = extractor.extract("Your long text document here...") print(keywords[:10]) # Top 10 keywords

From file

keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png")

undefined

keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png")

undefined

Features

功能特性

Multiple Algorithms: TF-IDF, RAKE, frequency-based
Key Phrases: Extract multi-word phrases, not just single words
Scoring: Relevance scores for ranking
Stopword Filtering: Built-in + custom stopwords
N-gram Support: Unigrams, bigrams, trigrams
Word Cloud: Visualize keyword importance
Batch Processing: Process multiple documents

多种算法支持：TF-IDF、RAKE、基于频率的分析
关键短语提取：提取多词短语，而非仅单个单词
相关性评分：通过评分对关键词进行排序
停用词过滤：内置停用词库 + 自定义停用词
N-gram支持：一元词、二元词、三元词
词云可视化：直观展示关键词重要性
批量处理：处理多个文档

API Reference

API 参考

Initialization

初始化

python

extractor = KeywordExtractor(
    method="tfidf",      # tfidf, rake, frequency
    max_keywords=20,     # Maximum keywords to return
    min_word_length=3,   # Minimum word length
    ngram_range=(1, 3)   # Unigrams to trigrams
)

python

extractor = KeywordExtractor(
    method="tfidf",      # tfidf, rake, frequency
    max_keywords=20,     # Maximum keywords to return
    min_word_length=3,   # Minimum word length
    ngram_range=(1, 3)   # Unigrams to trigrams
)

Extraction Methods

提取方法

python

undefined

python

undefined

TF-IDF (best for comparing documents)

keywords = extractor.extract(text, method="tfidf")

RAKE (best for key phrases)

keywords = extractor.extract(text, method="rake")

Frequency (simple word counts)

keywords = extractor.extract(text, method="frequency")

undefined

keywords = extractor.extract(text, method="frequency")

undefined

Results Format

结果格式

python

keywords = extractor.extract(text)

python

keywords = extractor.extract(text)

Returns list of tuples: [(keyword, score), ...]

[('machine learning', 0.85), ('data science', 0.72), ...]

Get just keywords

keyword_list = extractor.get_keywords(text)

['machine learning', 'data science', ...]

undefined

undefined

Customization

自定义配置

python

undefined

python

undefined

Add custom stopwords

extractor.add_stopwords(['company', 'product', 'service'])

Set minimum frequency

extractor.min_frequency = 2

Filter by part of speech (nouns only)

extractor.pos_filter = ['NN', 'NNS', 'NNP']

undefined

extractor.pos_filter = ['NN', 'NNS', 'NNP']

undefined

Visualization

可视化功能

python

undefined

python

undefined

Generate word cloud

extractor.to_wordcloud("wordcloud.png", colormap="viridis")

Bar chart of top keywords

extractor.plot_keywords("keywords.png", top_n=15)

undefined

extractor.plot_keywords("keywords.png", top_n=15)

undefined

Export

导出功能

python

undefined

python

undefined

To JSON

extractor.to_json("keywords.json")

To CSV

extractor.to_csv("keywords.csv")

To plain text

extractor.to_text("keywords.txt")

undefined

extractor.to_text("keywords.txt")

undefined

CLI Usage

CLI 使用方法

bash

undefined

bash

undefined

Extract from text

python keyword_extractor.py --text "Your text here" --top 10

Extract from file

python keyword_extractor.py --input document.txt --method tfidf --output keywords.json

Generate word cloud

python keyword_extractor.py --input document.txt --wordcloud cloud.png

Batch process directory

python keyword_extractor.py --input-dir ./docs --output keywords_all.csv

undefined

python keyword_extractor.py --input-dir ./docs --output keywords_all.csv

undefined

CLI Arguments

CLI 参数说明

Argument	Description	Default
`--text`	Text to analyze	-
`--input`	Input file path	-
`--input-dir`	Directory of files	-
`--output`	Output file	-
`--method`	Algorithm (tfidf, rake, frequency)	`tfidf`
`--top`	Number of keywords	20
`--ngrams`	N-gram range (e.g., "1,2")	`1,3`
`--wordcloud`	Generate word cloud	-
`--stopwords`	Custom stopwords file	-

参数	描述	默认值
`--text`	待分析的文本	-
`--input`	输入文件路径	-
`--input-dir`	文档目录路径	-
`--output`	输出文件路径	-
`--method`	使用的算法（tfidf, rake, frequency）	`tfidf`
`--top`	返回的关键词数量	20
`--ngrams`	N-gram范围（例如："1,2"）	`1,3`
`--wordcloud`	生成词云	-
`--stopwords`	自定义停用词文件路径	-

Examples

示例

Article Keyword Extraction

文章关键词提取

python

extractor = KeywordExtractor(method="tfidf")

article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""

keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
    print(f"{score:.3f}: {keyword}")

python

extractor = KeywordExtractor(method="tfidf")

article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""

keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
    print(f"{score:.3f}: {keyword}")

Compare Multiple Documents

多文档对比分析

python

extractor = KeywordExtractor(method="tfidf")

docs = [
    open("doc1.txt").read(),
    open("doc2.txt").read(),
    open("doc3.txt").read()
]

python

extractor = KeywordExtractor(method="tfidf")

docs = [
    open("doc1.txt").read(),
    open("doc2.txt").read(),
    open("doc3.txt").read()
]

Extract keywords from each

for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}")

undefined

for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}")

undefined

SEO Keyword Research

SEO关键词研究

python

extractor = KeywordExtractor(
    method="rake",
    ngram_range=(2, 4),  # Focus on phrases
    max_keywords=30
)

webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)

python

extractor = KeywordExtractor(
    method="rake",
    ngram_range=(2, 4),  # Focus on phrases
    max_keywords=30
)

webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)

Filter by score threshold

high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}")

undefined

high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}")

undefined

Algorithm Comparison

算法对比

Algorithm	Best For	Strengths
TF-IDF	Document comparison	Finds unique terms, good for search
RAKE	Key phrases	Extracts multi-word concepts
Frequency	Quick overview	Simple, fast, interpretable

算法	适用场景	优势
TF-IDF	文档对比	识别独特术语，适用于搜索场景
RAKE	关键短语提取	提取多词概念
Frequency	快速概览	简单、快速、易于理解

Dependencies

依赖库

scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0

scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0

Limitations

局限性

English optimized (other languages need language-specific stopwords)
Very short texts may not have enough data for TF-IDF
Domain-specific jargon may need custom stopword handling

针对英语优化（其他语言需要特定语言的停用词库）
极短文本可能没有足够数据支撑TF-IDF分析
领域特定术语可能需要自定义停用词处理

keyword-extractor

Original

Translation

Keyword Extractor

关键词提取工具

Quick Start

快速开始

Extract keywords

Extract keywords

From file

From file

Features

功能特性

API Reference

API 参考

Initialization

初始化

Extraction Methods

提取方法

TF-IDF (best for comparing documents)

TF-IDF (best for comparing documents)

RAKE (best for key phrases)

RAKE (best for key phrases)

Frequency (simple word counts)

Frequency (simple word counts)

Results Format

结果格式

Returns list of tuples: [(keyword, score), ...]

Returns list of tuples: [(keyword, score), ...]

[('machine learning', 0.85), ('data science', 0.72), ...]

[('machine learning', 0.85), ('data science', 0.72), ...]

Get just keywords

Get just keywords

['machine learning', 'data science', ...]

['machine learning', 'data science', ...]

Customization

自定义配置

Add custom stopwords

Add custom stopwords

Set minimum frequency

Set minimum frequency

Filter by part of speech (nouns only)

Filter by part of speech (nouns only)

Visualization

可视化功能

Generate word cloud

Generate word cloud

Bar chart of top keywords

Bar chart of top keywords

Export

导出功能

To JSON

To JSON

To CSV

To CSV

To plain text

To plain text

CLI Usage

CLI 使用方法

Extract from text

Extract from text

Extract from file

Extract from file

Generate word cloud

Generate word cloud

Batch process directory

Batch process directory

CLI Arguments

CLI 参数说明

Examples

示例

Article Keyword Extraction

文章关键词提取

Compare Multiple Documents

多文档对比分析

Extract keywords from each

Extract keywords from each

SEO Keyword Research

SEO关键词研究

Filter by score threshold