article-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseArticle Extractor
文章提取器
This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
该技能可提取网页文章和博客帖的核心内容,移除导航、广告、通讯订阅弹窗及其他冗余内容,输出干净易读的文本。
When to Use This Skill
何时使用该技能
Activate when the user:
- Provides an article/blog URL and wants the text content
- Asks to "download this article"
- Wants to "extract the content from [URL]"
- Asks to "save this blog post as text"
- Needs clean article text without distractions
当用户出现以下需求时启用:
- 提供了文章/博客URL,想要获取文本内容
- 要求“下载这篇文章”
- 想要“提取[URL]中的内容”
- 要求“把这篇博客保存为文本”
- 需要无干扰的纯净文章文本
How It Works
工作原理
Priority Order:
优先级顺序:
- Check if tools are installed (reader or trafilatura)
- Download and extract article using best available tool
- Clean up the content (remove extra whitespace, format properly)
- Save to file with article title as filename
- Confirm location and show preview
- 检查工具是否安装(reader或trafilatura)
- 使用最优可用工具下载并提取文章
- 清理内容(移除多余空白、规范格式)
- 以文章标题为文件名保存到文件
- 确认保存位置并展示内容预览
Installation Check
安装检查
Check for article extraction tools in this order:
按以下顺序检查文章提取工具:
Option 1: reader (Recommended - Mozilla's Readability)
选项1:reader(推荐 - Mozilla的Readability)
bash
command -v readerIf not installed:
bash
npm install -g @mozilla/readability-clibash
command -v reader如果未安装:
bash
npm install -g @mozilla/readability-clior
或
npm install -g reader-cli
undefinednpm install -g reader-cli
undefinedOption 2: trafilatura (Python-based, very good)
选项2:trafilatura(基于Python,效果优异)
bash
command -v trafilaturaIf not installed:
bash
pip3 install trafilaturabash
command -v trafilatura如果未安装:
bash
pip3 install trafilaturaOption 3: Fallback (curl + simple parsing)
选项3:降级方案(curl + 简单解析)
If no tools available, use basic curl + text extraction (less reliable but works)
如果没有可用工具,使用基础的curl + 文本提取方案(可靠性稍低但可用)
Extraction Methods
提取方法
Method 1: Using reader (Best for most articles)
方法1:使用reader(适合绝大多数文章)
bash
undefinedbash
undefinedExtract article
提取文章
reader "URL" > article.txt
**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structurereader "URL" > article.txt
**优势:**
- 基于Mozilla的Readability算法
- 移除冗余内容的效果极佳
- 保留文章结构Method 2: Using trafilatura (Best for blogs/news)
方法2:使用trafilatura(最适合博客/新闻内容)
bash
undefinedbash
undefinedExtract article
提取文章
trafilatura --URL "URL" --output-format txt > article.txt
trafilatura --URL "URL" --output-format txt > article.txt
Or with more options
或带更多参数使用
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages
**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
**优势:**
- 提取准确率极高
- 适配多种网站结构
- 支持多语言
**参数说明:**
- `--no-comments`:跳过评论区
- `--no-tables`:跳过数据表格
- `--precision`:优先保障提取准确率
- `--recall`:提取更多内容(可能包含少量噪声)Method 3: Fallback (curl + basic parsing)
方法3:降级方案(curl + 基础解析)
bash
undefinedbash
undefinedDownload and extract basic content
下载并提取基础内容
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
**Note:** This is less reliable but works without dependencies.curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
**注意:** 该方案可靠性较低,但无需依赖其他工具即可运行。Getting Article Title
获取文章标题
Extract title for filename:
提取标题用作文件名:
Using reader:
使用reader:
bash
undefinedbash
undefinedreader outputs markdown with title at top
reader输出的markdown格式内容顶部为标题
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefinedTITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefinedUsing trafilatura:
使用trafilatura:
bash
undefinedbash
undefinedGet metadata including title
获取包含标题的元数据
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefinedTITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefinedUsing curl (fallback):
使用curl(降级方案):
bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')Filename Creation
生成文件名
Clean title for filesystem:
bash
undefined清理标题以适配文件系统规则:
bash
undefinedGet title
获取标题
TITLE="Article Title from Website"
TITLE="Article Title from Website"
Clean for filesystem (remove special chars, limit length)
清理为适配文件系统的格式(移除特殊字符,限制长度)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
Add extension
添加后缀
FILENAME="${FILENAME}.txt"
undefinedFILENAME="${FILENAME}.txt"
undefinedComplete Workflow
完整工作流
bash
ARTICLE_URL="https://example.com/article"bash
ARTICLE_URL="https://example.com/article"Check for tools
检查工具
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
Extract article
提取文章
case $TOOL in
reader)
# Get content
reader "$ARTICLE_URL" > temp_article.txt
# Get title (first line after # in markdown)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# Get title from metadata
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# Get clean content
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# Get title
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # Remove site name
TITLE=${TITLE%% | *} # Remove site name (alternate)
# Get content (basic extraction)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
case $TOOL in
reader)
# 获取内容
reader "$ARTICLE_URL" > temp_article.txt
# 获取标题(markdown中#后的第一行)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# 从元数据获取标题
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# 获取干净内容
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# 获取标题
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # 移除站点名称
TITLE=${TITLE%% | *} # 移除站点名称(另一种格式)
# 获取内容(基础提取)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def init(self):
super().init()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
Clean filename
清理文件名
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
Move to final filename
移动到最终文件名
mv temp_article.txt "$FILENAME"
mv temp_article.txt "$FILENAME"
Show result
展示结果
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
undefinedecho "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
undefinedError Handling
错误处理
Common Issues
常见问题
1. Tool not installed
- Try alternate tool (reader → trafilatura → fallback)
- Offer to install: "Install reader with: npm install -g reader-cli"
2. Paywall or login required
- Extraction tools may fail
- Inform user: "This article requires authentication. Cannot extract."
3. Invalid URL
- Check URL format
- Try with and without redirects
4. No content extracted
- Site may use heavy JavaScript
- Try fallback method
- Inform user if extraction fails
5. Special characters in title
- Clean title for filesystem
- Remove: ,
/,:,?,",<,>| - Replace with or remove
-
1. 工具未安装
- 尝试备选工具(reader → trafilatura → 降级方案)
- 提供安装指引:"使用以下命令安装reader:npm install -g reader-cli"
2. 需要付费墙或登录权限
- 提取工具可能运行失败
- 告知用户:"该文章需要身份验证,无法提取内容。"
3. 无效URL
- 检查URL格式
- 尝试带/不带重定向访问
4. 未提取到内容
- 站点可能使用大量JavaScript渲染
- 尝试降级方案
- 如果提取失败告知用户
5. 标题包含特殊字符
- 清理标题以适配文件系统规则
- 移除:、
/、:、?、"、<、>| - 替换为或直接删除
-
Output Format
输出格式
Saved File Contains:
保存的文件包含:
- Article title (if available)
- Author (if available from tool)
- Main article text
- Section headings
- No navigation, ads, or clutter
- 文章标题(如果可获取)
- 作者(如果工具可获取)
- 文章正文
- 章节标题
- 无导航、广告或冗余内容
What Gets Removed:
会被移除的内容:
- Navigation menus
- Ads and promotional content
- Newsletter signup forms
- Related articles sidebars
- Comment sections (optional)
- Social media buttons
- Cookie notices
- 导航菜单
- 广告和推广内容
- 通讯订阅表单
- 相关文章侧边栏
- 评论区(可选)
- 社交媒体按钮
- Cookie提示
Tips for Best Results
最优效果提示
1. Use reader for most articles
- Best all-around tool
- Based on Firefox Reader View
- Works on most news sites and blogs
2. Use trafilatura for:
- Academic articles
- News sites
- Blogs with complex layouts
- Non-English content
3. Fallback method limitations:
- May include some noise
- Less accurate paragraph detection
- Better than nothing for simple sites
4. Check extraction quality:
- Always show preview to user
- Ask if it looks correct
- Offer to try different tool if needed
1. 大多数文章优先使用reader
- 综合表现最佳
- 基于Firefox阅读器视图开发
- 适配绝大多数新闻站点和博客
2. 以下场景使用trafilatura:
- 学术文章
- 新闻站点
- 布局复杂的博客
- 非英文内容
3. 降级方案的限制:
- 可能包含部分噪声
- 段落检测准确率较低
- 对于简单站点仍可用
4. 检查提取质量:
- 始终向用户展示内容预览
- 询问提取结果是否正确
- 如果需要可提供尝试其他工具的选项
Example Usage
使用示例
Simple extraction:
bash
undefined简单提取:
bash
undefinedUser: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
**With error handling:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fireader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
**带错误处理的提取:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fiBest Practices
最佳实践
- ✅ Always show preview after extraction (first 10 lines)
- ✅ Verify extraction succeeded before saving
- ✅ Clean filename for filesystem compatibility
- ✅ Try fallback method if primary fails
- ✅ Inform user which tool was used
- ✅ Keep filename length reasonable (< 100 chars)
- ✅ 提取完成后始终展示预览(前10行)
- 保存前确认提取成功
- ✅ 清理文件名保障文件系统兼容性
- ✅ 主工具失败时尝试降级方案
- ✅ 告知用户使用的是哪款工具
- ✅ 保持文件名长度合理(<100字符)
After Extraction
提取完成后
Display to user:
- "✓ Extracted: [Article Title]"
- "✓ Saved to: [filename]"
- Show preview (first 10-15 lines)
- File size and location
Ask if needed:
- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
- "Should I extract another article?"
向用户展示:
- "✓ 提取完成:[文章标题]"
- "✓ 保存路径:[文件名]"
- 展示预览(前10-15行)
- 文件大小和存储位置
如有需要可询问:
- "是否需要我基于这篇文章生成Ship-Learn-Next学习计划?"(如果使用ship-learn-next技能)
- "是否需要我提取另一篇文章?"