article-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseArticle Extractor
文章提取器
This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
本技能可从网页文章和博客帖子中提取主要内容,移除导航栏、广告、新闻通讯注册表单及其他杂乱内容,保存为干净、易读的文本。
When to Use This Skill
何时使用本技能
Activate when the user:
- Provides an article/blog URL and wants the text content
- Asks to "download this article"
- Wants to "extract the content from [URL]"
- Asks to "save this blog post as text"
- Needs clean article text without distractions
当用户有以下需求时激活:
- 提供文章/博客URL并想要获取文本内容
- 请求“下载这篇文章”
- 想要“从[URL]提取内容”
- 请求“将这篇博客帖子保存为文本”
- 需要无干扰的干净文章文本
How It Works
工作原理
Priority Order:
优先级顺序:
- Check if tools are installed (reader or trafilatura)
- Download and extract article using best available tool
- Clean up the content (remove extra whitespace, format properly)
- Save to file with article title as filename
- Confirm location and show preview
- 检查工具是否已安装(reader或trafilatura)
- 使用最优可用工具下载并提取文章
- 清理内容(移除多余空白、规范格式)
- 以文章标题为文件名保存到文件
- 确认保存位置并显示预览
Installation Check
安装检查
Check for article extraction tools in this order:
按以下顺序检查文章提取工具:
Option 1: reader (Recommended - Mozilla's Readability)
选项1:reader(推荐 - 基于Mozilla的Readability)
bash
command -v readerIf not installed:
bash
npm install -g @mozilla/readability-clibash
command -v reader如果未安装:
bash
npm install -g @mozilla/readability-clior
or
npm install -g reader-cli
undefinednpm install -g reader-cli
undefinedOption 2: trafilatura (Python-based, very good)
选项2:trafilatura(基于Python,性能优异)
bash
command -v trafilaturaIf not installed:
bash
pip3 install trafilaturabash
command -v trafilatura如果未安装:
bash
pip3 install trafilaturaOption 3: Fallback (curl + simple parsing)
选项3:备选方案(curl + 简单解析)
If no tools available, use basic curl + text extraction (less reliable but works)
如果没有可用工具,使用基础curl + 文本提取(可靠性较低但可用)
Extraction Methods
提取方法
Method 1: Using reader (Best for most articles)
方法1:使用reader(适用于大多数文章)
bash
undefinedbash
undefinedExtract article
提取文章
reader "URL" > article.txt
**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structurereader "URL" > article.txt
**优点:**
- 基于Mozilla的Readability算法
- 移除杂乱内容效果出色
- 保留文章结构Method 2: Using trafilatura (Best for blogs/news)
方法2:使用trafilatura(适用于博客/新闻站点)
bash
undefinedbash
undefinedExtract article
提取文章
trafilatura --URL "URL" --output-format txt > article.txt
trafilatura --URL "URL" --output-format txt > article.txt
Or with more options
或使用更多选项
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages
**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
**优点:**
- 提取精度极高
- 适配多种站点结构
- 支持多语言
**可选参数:**
- `--no-comments`: 跳过评论区
- `--no-tables`: 跳过数据表格
- `--precision`: 优先保证提取精度
- `--recall`: 提取更多内容(可能包含少量干扰信息)Method 3: Fallback (curl + basic parsing)
方法3:备选方案(curl + 基础解析)
bash
undefinedbash
undefinedDownload and extract basic content
下载并提取基础内容
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
**Note:** This is less reliable but works without dependencies.curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
**注意:** 此方法可靠性较低,但无需依赖其他工具即可运行。Getting Article Title
获取文章标题
Extract title for filename:
提取标题用于命名文件:
Using reader:
使用reader:
bash
undefinedbash
undefinedreader outputs markdown with title at top
reader输出的markdown格式中标题位于顶部
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefinedTITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefinedUsing trafilatura:
使用trafilatura:
bash
undefinedbash
undefinedGet metadata including title
获取包含标题的元数据
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefinedTITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefinedUsing curl (fallback):
使用curl(备选方案):
bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')Filename Creation
文件名生成
Clean title for filesystem:
bash
undefined清理标题以适配文件系统:
bash
undefinedGet title
获取标题
TITLE="Article Title from Website"
TITLE="Article Title from Website"
Clean for filesystem (remove special chars, limit length)
清理标题以适配文件系统(移除特殊字符,限制长度)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
Add extension
添加扩展名
FILENAME="${FILENAME}.txt"
undefinedFILENAME="${FILENAME}.txt"
undefinedComplete Workflow
完整工作流
bash
ARTICLE_URL="https://example.com/article"bash
ARTICLE_URL="https://example.com/article"Check for tools
检查可用工具
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
Extract article
提取文章
case $TOOL in
reader)
# Get content
reader "$ARTICLE_URL" > temp_article.txt
# Get title (first line after # in markdown)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# Get title from metadata
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# Get clean content
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# Get title
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # Remove site name
TITLE=${TITLE%% | *} # Remove site name (alternate)
# Get content (basic extraction)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
case $TOOL in
reader)
# 获取内容
reader "$ARTICLE_URL" > temp_article.txt
# 获取标题(markdown格式中的第一行#后的内容)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# 从元数据中获取标题
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# 获取干净内容
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# 获取标题
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # 移除站点名称
TITLE=${TITLE%% | *} # 移除站点名称(备选方式)
# 获取内容(基础提取)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
Clean filename
清理文件名
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
Move to final filename
移动到最终文件名
mv temp_article.txt "$FILENAME"
mv temp_article.txt "$FILENAME"
Show result
显示结果
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
undefinedecho "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
undefinedError Handling
错误处理
Common Issues
常见问题
1. Tool not installed
- Try alternate tool (reader → trafilatura → fallback)
- Offer to install: "Install reader with: npm install -g reader-cli"
2. Paywall or login required
- Extraction tools may fail
- Inform user: "This article requires authentication. Cannot extract."
3. Invalid URL
- Check URL format
- Try with and without redirects
4. No content extracted
- Site may use heavy JavaScript
- Try fallback method
- Inform user if extraction fails
5. Special characters in title
- Clean title for filesystem
- Remove: ,
/,:,?,",<,>| - Replace with or remove
-
1. 工具未安装
- 尝试使用备选工具(reader → trafilatura → 备选方案)
- 提示用户安装:“使用以下命令安装reader:npm install -g reader-cli”
2. 需要付费墙或登录
- 提取工具可能失败
- 告知用户:“这篇文章需要验证身份,无法提取。”
3. URL无效
- 检查URL格式
- 尝试跟随或不跟随重定向
4. 未提取到内容
- 站点可能使用大量JavaScript
- 尝试使用备选方案
- 如果提取失败,告知用户
5. 标题包含特殊字符
- 清理标题以适配文件系统
- 移除:,
/,:,?,",<,>| - 替换为或直接移除
-
Output Format
输出格式
Saved File Contains:
保存的文件包含:
- Article title (if available)
- Author (if available from tool)
- Main article text
- Section headings
- No navigation, ads, or clutter
- 文章标题(如果可用)
- 作者(如果工具能获取到)
- 文章主要文本
- 章节标题
- 无导航栏、广告或杂乱内容
What Gets Removed:
会被移除的内容:
- Navigation menus
- Ads and promotional content
- Newsletter signup forms
- Related articles sidebars
- Comment sections (optional)
- Social media buttons
- Cookie notices
- 导航菜单
- 广告和推广内容
- 新闻通讯注册表单
- 相关文章侧边栏
- 评论区(可选)
- 社交媒体按钮
- Cookie通知
Tips for Best Results
最佳使用技巧
1. Use reader for most articles
- Best all-around tool
- Based on Firefox Reader View
- Works on most news sites and blogs
2. Use trafilatura for:
- Academic articles
- News sites
- Blogs with complex layouts
- Non-English content
3. Fallback method limitations:
- May include some noise
- Less accurate paragraph detection
- Better than nothing for simple sites
4. Check extraction quality:
- Always show preview to user
- Ask if it looks correct
- Offer to try different tool if needed
1. 大多数文章使用reader
- 综合表现最佳的工具
- 基于Firefox阅读器视图
- 适用于大多数新闻站点和博客
2. 以下场景使用trafilatura:
- 学术文章
- 新闻站点
- 布局复杂的博客
- 非英语内容
3. 备选方案的局限性:
- 可能包含少量干扰信息
- 段落检测精度较低
- 对于简单站点聊胜于无
4. 检查提取质量:
- 始终向用户显示预览
- 询问用户内容是否正确
- 如果需要,提供使用其他工具重试的选项
Example Usage
使用示例
Simple extraction:
bash
undefined简单提取:
bash
undefinedUser: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
**With error handling:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fireader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
**带错误处理的提取:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fiBest Practices
最佳实践
- ✅ Always show preview after extraction (first 10 lines)
- ✅ Verify extraction succeeded before saving
- ✅ Clean filename for filesystem compatibility
- ✅ Try fallback method if primary fails
- ✅ Inform user which tool was used
- ✅ Keep filename length reasonable (< 100 chars)
- ✅ 提取后始终显示预览(前10行)
- ✅ 保存前验证提取是否成功
- ✅ 清理文件名以适配文件系统
- ✅ 如果主工具失败,尝试备选方案
- ✅ 告知用户使用的工具
- ✅ 控制文件名长度在合理范围(<100字符)
After Extraction
提取完成后
Display to user:
- "✓ Extracted: [Article Title]"
- "✓ Saved to: [filename]"
- Show preview (first 10-15 lines)
- File size and location
Ask if needed:
- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
- "Should I extract another article?"
向用户展示:
- "✓ 已提取:[文章标题]"
- "✓ 已保存至:[文件名]"
- 显示预览(前10-15行)
- 文件大小和位置
可询问用户:
- “是否需要我基于此内容创建Ship-Learn-Next计划?”(如果使用ship-learn-next技能)
- “是否需要提取另一篇文章?”