article-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Article Extractor

文章提取器

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
该技能可提取网页文章和博客帖的核心内容,移除导航、广告、通讯订阅弹窗及其他冗余内容,输出干净易读的文本。

When to Use This Skill

何时使用该技能

Activate when the user:
  • Provides an article/blog URL and wants the text content
  • Asks to "download this article"
  • Wants to "extract the content from [URL]"
  • Asks to "save this blog post as text"
  • Needs clean article text without distractions
当用户出现以下需求时启用:
  • 提供了文章/博客URL,想要获取文本内容
  • 要求“下载这篇文章”
  • 想要“提取[URL]中的内容”
  • 要求“把这篇博客保存为文本”
  • 需要无干扰的纯净文章文本

How It Works

工作原理

Priority Order:

优先级顺序:

  1. Check if tools are installed (reader or trafilatura)
  2. Download and extract article using best available tool
  3. Clean up the content (remove extra whitespace, format properly)
  4. Save to file with article title as filename
  5. Confirm location and show preview
  1. 检查工具是否安装(reader或trafilatura)
  2. 使用最优可用工具下载并提取文章
  3. 清理内容(移除多余空白、规范格式)
  4. 以文章标题为文件名保存到文件
  5. 确认保存位置并展示内容预览

Installation Check

安装检查

Check for article extraction tools in this order:
按以下顺序检查文章提取工具:

Option 1: reader (Recommended - Mozilla's Readability)

选项1:reader(推荐 - Mozilla的Readability)

bash
command -v reader
If not installed:
bash
npm install -g @mozilla/readability-cli
bash
command -v reader
如果未安装:
bash
npm install -g @mozilla/readability-cli

or

npm install -g reader-cli
undefined
npm install -g reader-cli
undefined

Option 2: trafilatura (Python-based, very good)

选项2:trafilatura(基于Python,效果优异)

bash
command -v trafilatura
If not installed:
bash
pip3 install trafilatura
bash
command -v trafilatura
如果未安装:
bash
pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

选项3:降级方案(curl + 简单解析)

If no tools available, use basic curl + text extraction (less reliable but works)
如果没有可用工具,使用基础的curl + 文本提取方案(可靠性稍低但可用)

Extraction Methods

提取方法

Method 1: Using reader (Best for most articles)

方法1:使用reader(适合绝大多数文章)

bash
undefined
bash
undefined

Extract article

提取文章

reader "URL" > article.txt

**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure
reader "URL" > article.txt

**优势:**
- 基于Mozilla的Readability算法
- 移除冗余内容的效果极佳
- 保留文章结构

Method 2: Using trafilatura (Best for blogs/news)

方法2:使用trafilatura(最适合博客/新闻内容)

bash
undefined
bash
undefined

Extract article

提取文章

trafilatura --URL "URL" --output-format txt > article.txt
trafilatura --URL "URL" --output-format txt > article.txt

Or with more options

或带更多参数使用

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages

**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

**优势:**
- 提取准确率极高
- 适配多种网站结构
- 支持多语言

**参数说明:**
- `--no-comments`:跳过评论区
- `--no-tables`:跳过数据表格
- `--precision`:优先保障提取准确率
- `--recall`:提取更多内容(可能包含少量噪声)

Method 3: Fallback (curl + basic parsing)

方法3:降级方案(curl + 基础解析)

bash
undefined
bash
undefined

Download and extract basic content

下载并提取基础内容

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

**Note:** This is less reliable but works without dependencies.
curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

**注意:** 该方案可靠性较低,但无需依赖其他工具即可运行。

Getting Article Title

获取文章标题

Extract title for filename:
提取标题用作文件名:

Using reader:

使用reader:

bash
undefined
bash
undefined

reader outputs markdown with title at top

reader输出的markdown格式内容顶部为标题

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefined
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefined

Using trafilatura:

使用trafilatura:

bash
undefined
bash
undefined

Get metadata including title

获取包含标题的元数据

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefined
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefined

Using curl (fallback):

使用curl(降级方案):

bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

生成文件名

Clean title for filesystem:
bash
undefined
清理标题以适配文件系统规则:
bash
undefined

Get title

获取标题

TITLE="Article Title from Website"
TITLE="Article Title from Website"

Clean for filesystem (remove special chars, limit length)

清理为适配文件系统的格式(移除特殊字符,限制长度)

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

Add extension

添加后缀

FILENAME="${FILENAME}.txt"
undefined
FILENAME="${FILENAME}.txt"
undefined

Complete Workflow

完整工作流

bash
ARTICLE_URL="https://example.com/article"
bash
ARTICLE_URL="https://example.com/article"

Check for tools

检查工具

if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi
if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi

Extract article

提取文章

case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac
case $TOOL in reader) # 获取内容 reader "$ARTICLE_URL" > temp_article.txt # 获取标题(markdown中#后的第一行) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # 从元数据获取标题 METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # 获取干净内容 trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # 获取标题 TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # 移除站点名称 TITLE=${TITLE%% | *} # 移除站点名称(另一种格式) # 获取内容(基础提取) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac

Clean filename

清理文件名

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"

Move to final filename

移动到最终文件名

mv temp_article.txt "$FILENAME"
mv temp_article.txt "$FILENAME"

Show result

展示结果

echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"
undefined
echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"
undefined

Error Handling

错误处理

Common Issues

常见问题

1. Tool not installed
  • Try alternate tool (reader → trafilatura → fallback)
  • Offer to install: "Install reader with: npm install -g reader-cli"
2. Paywall or login required
  • Extraction tools may fail
  • Inform user: "This article requires authentication. Cannot extract."
3. Invalid URL
  • Check URL format
  • Try with and without redirects
4. No content extracted
  • Site may use heavy JavaScript
  • Try fallback method
  • Inform user if extraction fails
5. Special characters in title
  • Clean title for filesystem
  • Remove:
    /
    ,
    :
    ,
    ?
    ,
    "
    ,
    <
    ,
    >
    ,
    |
  • Replace with
    -
    or remove
1. 工具未安装
  • 尝试备选工具(reader → trafilatura → 降级方案)
  • 提供安装指引:"使用以下命令安装reader:npm install -g reader-cli"
2. 需要付费墙或登录权限
  • 提取工具可能运行失败
  • 告知用户:"该文章需要身份验证,无法提取内容。"
3. 无效URL
  • 检查URL格式
  • 尝试带/不带重定向访问
4. 未提取到内容
  • 站点可能使用大量JavaScript渲染
  • 尝试降级方案
  • 如果提取失败告知用户
5. 标题包含特殊字符
  • 清理标题以适配文件系统规则
  • 移除:
    /
    :
    ?
    "
    <
    >
    |
  • 替换为
    -
    或直接删除

Output Format

输出格式

Saved File Contains:

保存的文件包含:

  • Article title (if available)
  • Author (if available from tool)
  • Main article text
  • Section headings
  • No navigation, ads, or clutter
  • 文章标题(如果可获取)
  • 作者(如果工具可获取)
  • 文章正文
  • 章节标题
  • 无导航、广告或冗余内容

What Gets Removed:

会被移除的内容:

  • Navigation menus
  • Ads and promotional content
  • Newsletter signup forms
  • Related articles sidebars
  • Comment sections (optional)
  • Social media buttons
  • Cookie notices
  • 导航菜单
  • 广告和推广内容
  • 通讯订阅表单
  • 相关文章侧边栏
  • 评论区(可选)
  • 社交媒体按钮
  • Cookie提示

Tips for Best Results

最优效果提示

1. Use reader for most articles
  • Best all-around tool
  • Based on Firefox Reader View
  • Works on most news sites and blogs
2. Use trafilatura for:
  • Academic articles
  • News sites
  • Blogs with complex layouts
  • Non-English content
3. Fallback method limitations:
  • May include some noise
  • Less accurate paragraph detection
  • Better than nothing for simple sites
4. Check extraction quality:
  • Always show preview to user
  • Ask if it looks correct
  • Offer to try different tool if needed
1. 大多数文章优先使用reader
  • 综合表现最佳
  • 基于Firefox阅读器视图开发
  • 适配绝大多数新闻站点和博客
2. 以下场景使用trafilatura:
  • 学术文章
  • 新闻站点
  • 布局复杂的博客
  • 非英文内容
3. 降级方案的限制:
  • 可能包含部分噪声
  • 段落检测准确率较低
  • 对于简单站点仍可用
4. 检查提取质量:
  • 始终向用户展示内容预览
  • 询问提取结果是否正确
  • 如果需要可提供尝试其他工具的选项

Example Usage

使用示例

Simple extraction:
bash
undefined
简单提取:
bash
undefined
reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

**With error handling:**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi
reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

**带错误处理的提取:**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

最佳实践

  • ✅ Always show preview after extraction (first 10 lines)
  • ✅ Verify extraction succeeded before saving
  • ✅ Clean filename for filesystem compatibility
  • ✅ Try fallback method if primary fails
  • ✅ Inform user which tool was used
  • ✅ Keep filename length reasonable (< 100 chars)
  • ✅ 提取完成后始终展示预览(前10行)
  • 保存前确认提取成功
  • ✅ 清理文件名保障文件系统兼容性
  • ✅ 主工具失败时尝试降级方案
  • ✅ 告知用户使用的是哪款工具
  • ✅ 保持文件名长度合理(<100字符)

After Extraction

提取完成后

Display to user:
  1. "✓ Extracted: [Article Title]"
  2. "✓ Saved to: [filename]"
  3. Show preview (first 10-15 lines)
  4. File size and location
Ask if needed:
  • "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
  • "Should I extract another article?"
向用户展示:
  1. "✓ 提取完成:[文章标题]"
  2. "✓ 保存路径:[文件名]"
  3. 展示预览(前10-15行)
  4. 文件大小和存储位置
如有需要可询问:
  • "是否需要我基于这篇文章生成Ship-Learn-Next学习计划?"(如果使用ship-learn-next技能)
  • "是否需要我提取另一篇文章?"