web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraper

网页爬虫工具

Overview

概述

Recursively scrape web pages with concurrent processing, extracting clean text content while following links. The scraper automatically handles URL deduplication, creates proper directory hierarchies based on URL structure, filters out unwanted content, and respects domain boundaries.
通过并发处理递归爬取网页,提取干净的文本内容并跟随链接。该爬虫可自动处理URL去重,根据URL结构创建合理的目录层级,过滤无用内容,并遵守域名边界限制。

When to Use This Skill

何时使用此技能

Use this skill when users request:
  • Scraping content from websites
  • Downloading documentation from online sources
  • Extracting text from web pages at scale
  • Crawling websites to gather information
  • Archiving web content locally
  • Following and downloading linked pages
  • Research data collection from web sources
  • Building text datasets from websites
当用户提出以下需求时使用此技能:
  • 从网站抓取内容
  • 从在线来源下载文档
  • 批量提取网页文本
  • 爬取网站以收集信息
  • 本地存档网页内容
  • 跟随并下载链接页面
  • 从网页来源收集研究数据
  • 从网站构建文本数据集

Prerequisites

前置条件

Install required dependencies:
bash
pip install aiohttp beautifulsoup4 lxml aiofiles
These libraries provide:
  • aiohttp
    - Async HTTP client for concurrent requests
  • beautifulsoup4
    - HTML parsing and content extraction
  • lxml
    - Fast HTML/XML parser
  • aiofiles
    - Async file I/O
安装所需依赖:
bash
pip install aiohttp beautifulsoup4 lxml aiofiles
这些库的作用:
  • aiohttp
    - 用于并发请求的异步HTTP客户端
  • beautifulsoup4
    - HTML解析与内容提取
  • lxml
    - 快速HTML/XML解析器
  • aiofiles
    - 异步文件I/O

Core Capabilities

核心功能

1. Basic Single-Page Scraping

1. 基础单页爬取

Scrape a single page without following links:
bash
python scripts/scrape.py <URL> <output-directory> --depth 0
Example:
bash
python scripts/scrape.py https://example.com/article output/
This downloads only the specified page, extracts clean text content, and saves it to
output/example.com/article.txt
.
爬取单个页面且不跟随链接:
bash
python scripts/scrape.py <URL> <output-directory> --depth 0
示例:
bash
python scripts/scrape.py https://example.com/article output/
此命令仅下载指定页面,提取干净的文本内容,并保存至
output/example.com/article.txt

2. Recursive Scraping with Link Following

2. 跟随链接的递归爬取

Scrape a page and follow links up to a specified depth:
bash
python scripts/scrape.py <URL> <output-directory> --depth <N>
Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2
Depth levels:
  • --depth 0
    - Only the start URL(s)
  • --depth 1
    - Start URLs + all links on those pages
  • --depth 2
    - Start URLs + links + links found on those linked pages
  • --depth 3+
    - Continue following links to the specified depth
爬取页面并跟随链接至指定深度:
bash
python scripts/scrape.py <URL> <output-directory> --depth <N>
示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2
深度级别说明:
  • --depth 0
    - 仅爬取起始URL
  • --depth 1
    - 起始URL + 这些页面上的所有链接
  • --depth 2
    - 起始URL + 链接页面 + 链接页面上的所有链接
  • --depth 3+
    - 继续跟随链接至指定深度

3. Limiting the Number of Pages

3. 限制爬取页面数量

Prevent excessive scraping by setting a maximum page limit:
bash
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100
Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50
Useful for:
  • Testing scraper configuration before full run
  • Limiting resource usage
  • Sampling content from large sites
  • Staying within rate limits
通过设置最大页面数防止过度爬取:
bash
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100
示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50
适用场景:
  • 完整爬取前测试爬虫配置
  • 限制资源使用
  • 从大型网站抽样内容
  • 遵守速率限制

4. Concurrent Processing

4. 并发处理

Control the number of simultaneous requests for faster scraping:
bash
python scripts/scrape.py <URL> <output-directory> --concurrent <N>
Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20
Default is 10 concurrent requests. Increase for faster scraping, decrease for more conservative resource usage.
Guidelines:
  • Small sites or slow servers:
    --concurrent 5
  • Medium sites:
    --concurrent 10
    (default)
  • Large, fast sites:
    --concurrent 20-30
  • Be respectful of server resources
控制同时请求的数量以加快爬取速度:
bash
python scripts/scrape.py <URL> <output-directory> --concurrent <N>
示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20
默认并发请求数为10。增加数值可加快爬取速度,减少数值则更节省资源。
参考准则:
  • 小型网站或慢速服务器:
    --concurrent 5
  • 中型网站:
    --concurrent 10
    (默认值)
  • 大型、快速网站:
    --concurrent 20-30
  • 请尊重服务器资源

5. Domain Restrictions

5. 域名限制

By default, the scraper only follows links on the same domain as the start URL. This can be controlled:
Same domain only (default):
bash
python scripts/scrape.py https://example.com output/ --depth 2
Follow external links:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --follow-external
Specify allowed domains:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com
Use
--allowed-domains
when:
  • Documentation is split across multiple subdomains
  • Content spans related domains
  • You want to limit to specific trusted domains
默认情况下,爬虫仅跟随与起始URL同域名的链接。可通过以下方式控制:
仅同域名(默认):
bash
python scripts/scrape.py https://example.com output/ --depth 2
跟随外部链接:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --follow-external
指定允许的域名:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com
在以下场景使用
--allowed-domains
  • 文档分散在多个子域名
  • 内容分布在相关域名
  • 希望限制到特定可信域名

6. Multiple Start URLs

6. 多个起始URL

Scrape from multiple starting points simultaneously:
bash
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>
Example:
bash
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2
All start URLs are processed with the same configuration (depth, domain restrictions, etc.).
同时从多个起始点爬取:
bash
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>
示例:
bash
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2
所有起始URL将使用相同配置(深度、域名限制等)进行处理。

7. Request Configuration

7. 请求配置

Customize HTTP request behavior:
bash
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60
Options:
  • --user-agent
    - Custom User-Agent header (default: "Mozilla/5.0 (compatible; WebScraper/1.0)")
  • --timeout
    - Request timeout in seconds (default: 30)
Example:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45
自定义HTTP请求行为:
bash
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60
可选参数:
  • --user-agent
    - 自定义User-Agent请求头(默认值:"Mozilla/5.0 (compatible; WebScraper/1.0)")
  • --timeout
    - 请求超时时间(秒,默认值:30)
示例:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45

8. Verbose Output

8. 详细输出

Enable detailed logging to monitor scraping progress:
bash
python scripts/scrape.py <URL> <output-directory> --verbose
Verbose mode shows:
  • Each URL being fetched
  • Successful saves with file paths
  • Errors and timeouts
  • Detailed error information
启用详细日志以监控爬取进度:
bash
python scripts/scrape.py <URL> <output-directory> --verbose
详细模式将显示:
  • 正在获取的每个URL
  • 成功保存的文件路径
  • 错误与超时信息
  • 详细的错误详情

Output Structure

输出结构

Directory Hierarchy

目录层级

The scraper creates a directory hierarchy that mirrors the URL structure:
output/
├── example.com/
│   ├── index.txt              # https://example.com/
│   ├── about.txt              # https://example.com/about
│   ├── docs/
│   │   ├── index.txt          # https://example.com/docs/
│   │   ├── getting-started.txt
│   │   └── api/
│   │       └── reference.txt
│   └── blog/
│       ├── post-1.txt
│       └── post-2.txt
├── docs.example.com/
│   └── guide.txt
└── _metadata.json
爬虫会创建与URL结构对应的目录层级:
output/
├── example.com/
│   ├── index.txt              # https://example.com/
│   ├── about.txt              # https://example.com/about
│   ├── docs/
│   │   ├── index.txt          # https://example.com/docs/
│   │   ├── getting-started.txt
│   │   └── api/
│   │       └── reference.txt
│   └── blog/
│       ├── post-1.txt
│       └── post-2.txt
├── docs.example.com/
│   └── guide.txt
└── _metadata.json

File Format

文件格式

Each scraped page is saved as a text file with the following structure:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00

================================================================================

[Clean extracted text content]
每个爬取的页面将保存为文本文件,格式如下:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00

================================================================================

[提取的干净文本内容]

Metadata File

元数据文件

_metadata.json
contains scraping session information:
json
{
  "start_time": "2025-10-21T14:30:00",
  "end_time": "2025-10-21T14:35:30",
  "pages_scraped": 42,
  "total_visited": 45,
  "errors": {
    "https://example.com/broken": "HTTP 404",
    "https://example.com/slow": "Timeout"
  }
}
_metadata.json
包含爬取会话的相关信息:
json
{
  "start_time": "2025-10-21T14:30:00",
  "end_time": "2025-10-21T14:35:30",
  "pages_scraped": 42,
  "total_visited": 45,
  "errors": {
    "https://example.com/broken": "HTTP 404",
    "https://example.com/slow": "Timeout"
  }
}

Content Extraction and Filtering

内容提取与过滤

What Gets Extracted

提取内容

The scraper extracts clean text content by:
  1. Focusing on main content - Prioritizes
    <main>
    ,
    <article>
    , or
    <body>
    tags
  2. Removing unwanted elements - Strips out:
    • Scripts and styles
    • Navigation menus
    • Headers and footers
    • Sidebars (aside tags)
    • Iframes and embedded content
    • SVG graphics
    • Comments
  3. Filtering common patterns - Removes:
    • Cookie consent messages
    • Privacy policy links
    • Terms of service boilerplate
    • UI elements (arrows, single numbers)
    • Very short lines (likely navigation items)
  4. Preserving structure - Maintains line breaks between content blocks
爬虫通过以下方式提取干净的文本内容:
  1. 聚焦主内容 - 优先提取
    <main>
    <article>
    <body>
    标签内的内容
  2. 移除无用元素 - 剥离以下内容:
    • 脚本与样式
    • 导航菜单
    • 页眉与页脚
    • 侧边栏(aside标签)
    • 内嵌框架与嵌入内容
    • SVG图形
    • 注释
  3. 过滤常见模式 - 移除以下内容:
    • Cookie授权提示
    • 隐私政策链接
    • 服务条款 boilerplate
    • UI元素(箭头、单个数字)
    • 极短行(可能为导航项)
  4. 保留结构 - 保留内容块之间的换行

What Gets Filtered Out

过滤内容

Common unwanted patterns automatically removed:
  • "Accept cookies" / "Reject all"
  • "Cookie settings"
  • "Privacy policy"
  • "Terms of service"
  • Navigation arrows (←, →, ↑, ↓)
  • Isolated numbers
  • Lines shorter than 3 characters
自动移除的常见无用模式:
  • "接受Cookie" / "全部拒绝"
  • "Cookie设置"
  • "隐私政策"
  • "服务条款"
  • 导航箭头(←, →, ↑, ↓)
  • 孤立数字
  • 长度不足3个字符的行

Common Usage Patterns

常见使用场景

Download Documentation Site

下载文档站点

Scrape an entire documentation site with reasonable limits:
bash
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15
使用合理限制爬取整个文档站点:
bash
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15

Archive a Blog

存档博客

Download all blog posts from a blog (following pagination):
bash
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500
下载博客中的所有文章(跟随分页):
bash
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500

Research Data Collection

研究数据采集

Gather text content from multiple related sources:
bash
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20
从多个相关来源收集文本内容:
bash
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20

Sample a Large Site

大型网站抽样

Test configuration on a small sample before full scrape:
bash
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose
Then run full scrape after confirming results:
bash
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15
在完整爬取前通过小样本测试配置:
bash
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose
确认结果后再执行完整爬取:
bash
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15

Multi-Domain Knowledge Base

多域名知识库

Scrape across multiple authorized domains:
bash
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300
爬取多个授权域名的内容:
bash
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300

Implementation Approach

实现流程

When users request web scraping:
  1. Identify the scope:
    • What URLs to start from?
    • Should links be followed? How deep?
    • Any domain restrictions needed?
    • Is there a reasonable page limit?
  2. Configure the scraper:
    • Set appropriate depth (typically 1-3)
    • Set max-pages to avoid runaway scraping
    • Choose concurrent level based on site size
    • Determine domain restrictions
  3. Run with monitoring:
    • Start with verbose mode or small sample
    • Monitor output for errors or unexpected content
    • Adjust configuration if needed
  4. Verify output:
    • Check the output directory structure
    • Review
      _metadata.json
      for statistics
    • Sample a few text files for quality
    • Check for errors in metadata
  5. Process the content:
    • Text files are ready for loading into context
    • Use Read tool to examine specific files
    • Use Grep to search across all scraped content
    • Load files as needed for analysis
当用户请求网页爬取时:
  1. 确定范围
    • 起始URL有哪些?
    • 是否需要跟随链接?深度多少?
    • 是否需要域名限制?
    • 是否设置合理的页面数量限制?
  2. 配置爬虫
    • 设置合适的深度(通常1-3)
    • 设置最大页面数以避免无限制爬取
    • 根据网站大小选择并发数
    • 确定域名限制规则
  3. 监控运行
    • 先使用详细模式或小样本测试
    • 监控输出中的错误或意外内容
    • 必要时调整配置
  4. 验证输出
    • 检查输出目录结构
    • 查看
      _metadata.json
      中的统计数据
    • 抽样检查文本文件的质量
    • 检查元数据中的错误信息
  5. 处理内容
    • 文本文件可直接加载到上下文
    • 使用Read工具查看特定文件
    • 使用Grep工具在所有爬取内容中搜索
    • 根据需要加载文件进行分析

Quick Reference

快速参考

Command structure:
bash
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]
Essential options:
  • -d, --depth N
    - Maximum link depth (default: 2)
  • -m, --max-pages N
    - Maximum pages to scrape
  • -c, --concurrent N
    - Concurrent requests (default: 10)
  • -f, --follow-external
    - Follow external links
  • -a, --allowed-domains
    - Specify allowed domains
  • -v, --verbose
    - Detailed output
  • -u, --user-agent
    - Custom User-Agent
  • -t, --timeout
    - Request timeout in seconds
Get full help:
bash
python scripts/scrape.py --help
命令结构:
bash
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]
核心参数:
  • -d, --depth N
    - 最大链接深度(默认值:2)
  • -m, --max-pages N
    - 最大爬取页面数
  • -c, --concurrent N
    - 并发请求数(默认值:10)
  • -f, --follow-external
    - 跟随外部链接
  • -a, --allowed-domains
    - 指定允许的域名
  • -v, --verbose
    - 详细输出
  • -u, --user-agent
    - 自定义User-Agent
  • -t, --timeout
    - 请求超时时间(秒)
获取完整帮助:
bash
python scripts/scrape.py --help

Best Practices

最佳实践

  1. Start small - Test with
    --depth 1 --max-pages 10
    before large scrapes
  2. Respect servers - Use reasonable concurrency and timeouts
  3. Set limits - Always use
    --max-pages
    for initial runs
  4. Check robots.txt - Manually verify the site allows scraping
  5. Use verbose mode - Monitor for errors and unexpected behavior
  6. Identify yourself - Use a descriptive User-Agent with contact info
  7. Monitor output - Check
    _metadata.json
    for errors and statistics
  8. Handle errors gracefully - Review error log in metadata for problematic URLs
  1. 从小规模开始 - 大规模爬取前先用
    --depth 1 --max-pages 10
    测试
  2. 尊重服务器 - 使用合理的并发数与超时时间
  3. 设置限制 - 初始运行时始终使用
    --max-pages
  4. 检查robots.txt - 手动确认网站允许爬取
  5. 使用详细模式 - 监控错误与意外行为
  6. 标识自身 - 使用包含联系信息的描述性User-Agent
  7. 监控输出 - 查看
    _metadata.json
    中的错误与统计数据
  8. 优雅处理错误 - 查看元数据中的错误日志以处理问题URL

Troubleshooting

故障排除

Common issues:
  • "Missing required dependency": Run
    pip install aiohttp beautifulsoup4 lxml aiofiles
  • Too many timeouts: Increase
    --timeout
    or reduce
    --concurrent
  • Scraping too slow: Increase
    --concurrent
    (e.g., 20-30)
  • Memory issues with large scrapes: Reduce
    --concurrent
    or use
    --max-pages
    to chunk the work
  • Following too many links: Reduce
    --depth
    or enable same-domain-only (default)
  • Missing content: Some sites may require JavaScript; this scraper only handles static HTML
  • HTTP errors: Check
    _metadata.json
    errors section for specific issues
Limitations:
  • Does not execute JavaScript (single-page apps may not work)
  • Does not handle authentication or login
  • Does not follow links in JavaScript or dynamically loaded content
  • No built-in rate limiting (use
    --concurrent
    to control request rate)
常见问题:
  • "缺少必需依赖":执行
    pip install aiohttp beautifulsoup4 lxml aiofiles
    安装
  • 超时过多:增加
    --timeout
    或减少
    --concurrent
  • 爬取速度过慢:增加
    --concurrent
    (如20-30)
  • 大规模爬取时内存不足:减少
    --concurrent
    或使用
    --max-pages
    分块处理
  • 跟随链接过多:减少
    --depth
    或启用仅同域名模式(默认)
  • 内容缺失:部分网站需要JavaScript支持;此爬虫仅处理静态HTML
  • HTTP错误:查看
    _metadata.json
    的errors部分获取具体问题
限制:
  • 不执行JavaScript(单页应用可能无法正常爬取)
  • 不处理认证或登录
  • 不跟随JavaScript或动态加载内容中的链接
  • 无内置速率限制(使用
    --concurrent
    控制请求速率)

Advanced Use Cases

高级使用场景

Loading Scraped Content

加载爬取内容

After scraping, use the Read tool to load content into context:
bash
undefined
爬取完成后,使用Read工具将内容加载到上下文:
bash
undefined

Read a specific scraped page

读取特定爬取页面

Read file_path: output/docs.example.com/guide.txt
Read file_path: output/docs.example.com/guide.txt

Search across all scraped content

在所有爬取内容中搜索

Grep pattern: "API endpoint" path: output/ -r
undefined
Grep pattern: "API endpoint" path: output/ -r
undefined

Selective Re-scraping

选择性重新爬取

The scraper tracks visited URLs in memory during a session but doesn't persist this between runs. To avoid re-downloading:
  1. Run initial scrape with limits
  2. Check output directory for what was downloaded
  3. Run additional scrapes with different start URLs or configurations
爬虫在会话期间会在内存中跟踪已访问的URL,但不会在会话间持久化。如需避免重复下载:
  1. 先使用限制条件执行初始爬取
  2. 检查输出目录确认已下载内容
  3. 使用不同起始URL或配置执行额外爬取

Combining with Other Tools

与其他工具结合

Chain the scraper with other processing:
bash
undefined
将爬虫与其他处理工具结合使用:
bash
undefined

Scrape then process with custom script

爬取后使用自定义脚本处理

python scripts/scrape.py https://example.com output/ --depth 2 python your_analysis_script.py output/
undefined
python scripts/scrape.py https://example.com output/ --depth 2 python your_analysis_script.py output/
undefined

Resources

资源

scripts/scrape.py

scripts/scrape.py

The main web scraping tool implementing concurrent crawling, content extraction, and intelligent filtering. Key features:
  • Async/concurrent processing - Uses
    asyncio
    and
    aiohttp
    for high-performance concurrent requests
  • URL normalization - Removes fragments and trailing slashes for proper deduplication
  • Visited tracking - Maintains
    visited_urls
    and
    queued_urls
    sets to prevent re-downloading
  • Smart content extraction - Removes scripts, styles, navigation, and common unwanted patterns
  • Directory hierarchy - Converts URLs to safe filesystem paths maintaining structure
  • Error handling - Tracks and reports errors in metadata file
  • Metadata generation - Creates
    _metadata.json
    with scraping statistics and errors
The script can be executed directly and includes comprehensive command-line help via
--help
.
实现并发爬取、内容提取与智能过滤的核心网页爬取工具。主要特性:
  • 异步/并发处理 - 使用
    asyncio
    aiohttp
    实现高性能并发请求
  • URL标准化 - 移除片段和末尾斜杠以实现正确去重
  • 访问跟踪 - 维护
    visited_urls
    queued_urls
    集合以避免重复下载
  • 智能内容提取 - 移除脚本、样式、导航和常见无用模式
  • 目录层级 - 将URL转换为安全的文件系统路径并保持结构
  • 错误处理 - 在元数据文件中跟踪并报告错误
  • 元数据生成 - 创建包含爬取统计数据和错误信息的
    _metadata.json
可直接执行该脚本,通过
--help
获取完整命令行帮助。