web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraper

网页爬虫工具

Overview

概述

Recursively scrape web pages with concurrent processing, extracting clean text content while following links. The scraper automatically handles URL deduplication, creates proper directory hierarchies based on URL structure, filters out unwanted content, and respects domain boundaries.

通过并发处理递归爬取网页，提取干净的文本内容并跟随链接。该爬虫可自动处理URL去重，根据URL结构创建合理的目录层级，过滤无用内容，并遵守域名边界限制。

When to Use This Skill

何时使用此技能

Use this skill when users request:

Scraping content from websites
Downloading documentation from online sources
Extracting text from web pages at scale
Crawling websites to gather information
Archiving web content locally
Following and downloading linked pages
Research data collection from web sources
Building text datasets from websites

当用户提出以下需求时使用此技能：

从网站抓取内容
从在线来源下载文档
批量提取网页文本
爬取网站以收集信息
本地存档网页内容
跟随并下载链接页面
从网页来源收集研究数据
从网站构建文本数据集

Prerequisites

前置条件

Install required dependencies:

bash

pip install aiohttp beautifulsoup4 lxml aiofiles

These libraries provide:

```
aiohttp
```
- Async HTTP client for concurrent requests
```
beautifulsoup4
```
- HTML parsing and content extraction
```
lxml
```
- Fast HTML/XML parser
```
aiofiles
```
- Async file I/O

安装所需依赖：

bash

pip install aiohttp beautifulsoup4 lxml aiofiles

这些库的作用：

```
aiohttp
```
- 用于并发请求的异步HTTP客户端
```
beautifulsoup4
```
- HTML解析与内容提取
```
lxml
```
- 快速HTML/XML解析器
```
aiofiles
```
- 异步文件I/O

Core Capabilities

核心功能

1. Basic Single-Page Scraping

1. 基础单页爬取

Scrape a single page without following links:

bash

python scripts/scrape.py <URL> <output-directory> --depth 0

Example:

bash

python scripts/scrape.py https://example.com/article output/

This downloads only the specified page, extracts clean text content, and saves it to

output/example.com/article.txt

爬取单个页面且不跟随链接：

bash

python scripts/scrape.py <URL> <output-directory> --depth 0

示例：

bash

python scripts/scrape.py https://example.com/article output/

此命令仅下载指定页面，提取干净的文本内容，并保存至

output/example.com/article.txt

。

2. Recursive Scraping with Link Following

2. 跟随链接的递归爬取

Scrape a page and follow links up to a specified depth:

bash

python scripts/scrape.py <URL> <output-directory> --depth <N>

Example:

bash

python scripts/scrape.py https://docs.example.com output/ --depth 2

Depth levels:

```
--depth 0
```
- Only the start URL(s)
```
--depth 1
```
- Start URLs + all links on those pages
```
--depth 2
```
- Start URLs + links + links found on those linked pages
```
--depth 3+
```
- Continue following links to the specified depth

爬取页面并跟随链接至指定深度：

bash

python scripts/scrape.py <URL> <output-directory> --depth <N>

示例：

bash

python scripts/scrape.py https://docs.example.com output/ --depth 2

深度级别说明：

```
--depth 0
```
- 仅爬取起始URL
```
--depth 1
```
- 起始URL + 这些页面上的所有链接
```
--depth 2
```
- 起始URL + 链接页面 + 链接页面上的所有链接
```
--depth 3+
```
- 继续跟随链接至指定深度

3. Limiting the Number of Pages

3. 限制爬取页面数量

Prevent excessive scraping by setting a maximum page limit:

bash

python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100

Example:

bash

python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50

Useful for:

Testing scraper configuration before full run
Limiting resource usage
Sampling content from large sites
Staying within rate limits

通过设置最大页面数防止过度爬取：

bash

python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100

示例：

bash

python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50

适用场景：

完整爬取前测试爬虫配置
限制资源使用
从大型网站抽样内容
遵守速率限制

4. Concurrent Processing

4. 并发处理

Control the number of simultaneous requests for faster scraping:

bash

python scripts/scrape.py <URL> <output-directory> --concurrent <N>

Example:

bash

python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20

Default is 10 concurrent requests. Increase for faster scraping, decrease for more conservative resource usage.

Guidelines:

Small sites or slow servers:
```
--concurrent 5
```
Medium sites:
```
--concurrent 10
```
(default)
Large, fast sites:
```
--concurrent 20-30
```
Be respectful of server resources

控制同时请求的数量以加快爬取速度：

bash

python scripts/scrape.py <URL> <output-directory> --concurrent <N>

示例：

bash

python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20

默认并发请求数为10。增加数值可加快爬取速度，减少数值则更节省资源。

参考准则：

小型网站或慢速服务器：
```
--concurrent 5
```
中型网站：
```
--concurrent 10
```
（默认值）
大型、快速网站：
```
--concurrent 20-30
```
请尊重服务器资源

5. Domain Restrictions

5. 域名限制

By default, the scraper only follows links on the same domain as the start URL. This can be controlled:

Same domain only (default):

bash

python scripts/scrape.py https://example.com output/ --depth 2

Follow external links:

bash

python scripts/scrape.py https://example.com output/ --depth 2 --follow-external

Specify allowed domains:

bash

python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com

Use

--allowed-domains

when:

Documentation is split across multiple subdomains
Content spans related domains
You want to limit to specific trusted domains

默认情况下，爬虫仅跟随与起始URL同域名的链接。可通过以下方式控制：

仅同域名（默认）：

bash

python scripts/scrape.py https://example.com output/ --depth 2

跟随外部链接：

bash

python scripts/scrape.py https://example.com output/ --depth 2 --follow-external

指定允许的域名：

bash

python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com

在以下场景使用

--allowed-domains

：

文档分散在多个子域名
内容分布在相关域名
希望限制到特定可信域名

6. Multiple Start URLs

6. 多个起始URL

Scrape from multiple starting points simultaneously:

bash

python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>

Example:

bash

python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2

All start URLs are processed with the same configuration (depth, domain restrictions, etc.).

同时从多个起始点爬取：

bash

python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>

示例：

bash

python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2

所有起始URL将使用相同配置（深度、域名限制等）进行处理。

7. Request Configuration

7. 请求配置

Customize HTTP request behavior:

bash

python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60

Options:

```
--user-agent
```
- Custom User-Agent header (default: "Mozilla/5.0 (compatible; WebScraper/1.0)")
```
--timeout
```
- Request timeout in seconds (default: 30)

Example:

bash

python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45

自定义HTTP请求行为：

bash

python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60

可选参数：

```
--user-agent
```
- 自定义User-Agent请求头（默认值："Mozilla/5.0 (compatible; WebScraper/1.0)"）
```
--timeout
```
- 请求超时时间（秒，默认值：30）

示例：

bash

python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45

8. Verbose Output

8. 详细输出

Enable detailed logging to monitor scraping progress:

bash

python scripts/scrape.py <URL> <output-directory> --verbose

Verbose mode shows:

Each URL being fetched
Successful saves with file paths
Errors and timeouts
Detailed error information

启用详细日志以监控爬取进度：

bash

python scripts/scrape.py <URL> <output-directory> --verbose

详细模式将显示：

正在获取的每个URL
成功保存的文件路径
错误与超时信息
详细的错误详情

Output Structure

输出结构

Directory Hierarchy

目录层级

The scraper creates a directory hierarchy that mirrors the URL structure:

output/
├── example.com/
│   ├── index.txt              # https://example.com/
│   ├── about.txt              # https://example.com/about
│   ├── docs/
│   │   ├── index.txt          # https://example.com/docs/
│   │   ├── getting-started.txt
│   │   └── api/
│   │       └── reference.txt
│   └── blog/
│       ├── post-1.txt
│       └── post-2.txt
├── docs.example.com/
│   └── guide.txt
└── _metadata.json

爬虫会创建与URL结构对应的目录层级：

output/
├── example.com/
│   ├── index.txt              # https://example.com/
│   ├── about.txt              # https://example.com/about
│   ├── docs/
│   │   ├── index.txt          # https://example.com/docs/
│   │   ├── getting-started.txt
│   │   └── api/
│   │       └── reference.txt
│   └── blog/
│       ├── post-1.txt
│       └── post-2.txt
├── docs.example.com/
│   └── guide.txt
└── _metadata.json

File Format

文件格式

Each scraped page is saved as a text file with the following structure:

URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00

================================================================================

[Clean extracted text content]

每个爬取的页面将保存为文本文件，格式如下：

URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00

================================================================================

[提取的干净文本内容]

Metadata File

元数据文件

_metadata.json

contains scraping session information:

json

{
  "start_time": "2025-10-21T14:30:00",
  "end_time": "2025-10-21T14:35:30",
  "pages_scraped": 42,
  "total_visited": 45,
  "errors": {
    "https://example.com/broken": "HTTP 404",
    "https://example.com/slow": "Timeout"
  }
}

_metadata.json

包含爬取会话的相关信息：

json

{
  "start_time": "2025-10-21T14:30:00",
  "end_time": "2025-10-21T14:35:30",
  "pages_scraped": 42,
  "total_visited": 45,
  "errors": {
    "https://example.com/broken": "HTTP 404",
    "https://example.com/slow": "Timeout"
  }
}

Content Extraction and Filtering

内容提取与过滤

What Gets Extracted

提取内容

The scraper extracts clean text content by:

Focusing on main content - Prioritizes
```
<main>
```
,
```
<article>
```
, or
```
<body>
```
tags
Removing unwanted elements - Strips out:
- Scripts and styles
- Navigation menus
- Headers and footers
- Sidebars (aside tags)
- Iframes and embedded content
- SVG graphics
- Comments
Filtering common patterns - Removes:
- Cookie consent messages
- Privacy policy links
- Terms of service boilerplate
- UI elements (arrows, single numbers)
- Very short lines (likely navigation items)
Preserving structure - Maintains line breaks between content blocks

爬虫通过以下方式提取干净的文本内容：

聚焦主内容 - 优先提取
```
<main>
```
、
```
<article>
```
或
```
<body>
```
标签内的内容
移除无用元素 - 剥离以下内容：
- 脚本与样式
- 导航菜单
- 页眉与页脚
- 侧边栏（aside标签）
- 内嵌框架与嵌入内容
- SVG图形
- 注释
过滤常见模式 - 移除以下内容：
- Cookie授权提示
- 隐私政策链接
- 服务条款 boilerplate
- UI元素（箭头、单个数字）
- 极短行（可能为导航项）
保留结构 - 保留内容块之间的换行

What Gets Filtered Out

过滤内容

Common unwanted patterns automatically removed:

"Accept cookies" / "Reject all"
"Cookie settings"
"Privacy policy"
"Terms of service"
Navigation arrows (←, →, ↑, ↓)
Isolated numbers
Lines shorter than 3 characters

自动移除的常见无用模式：

"接受Cookie" / "全部拒绝"
"Cookie设置"
"隐私政策"
"服务条款"
导航箭头（←, →, ↑, ↓）
孤立数字
长度不足3个字符的行

Common Usage Patterns

常见使用场景

Download Documentation Site

下载文档站点

Scrape an entire documentation site with reasonable limits:

bash

python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15

使用合理限制爬取整个文档站点：

bash

python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15

Archive a Blog

存档博客

Download all blog posts from a blog (following pagination):

bash

python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500

下载博客中的所有文章（跟随分页）：

bash

python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500

Research Data Collection

研究数据采集

Gather text content from multiple related sources:

bash

python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20

从多个相关来源收集文本内容：

bash

python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20

Sample a Large Site

大型网站抽样

Test configuration on a small sample before full scrape:

bash

python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose

Then run full scrape after confirming results:

bash

python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15

在完整爬取前通过小样本测试配置：

bash

python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose

确认结果后再执行完整爬取：

bash

python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15

Multi-Domain Knowledge Base

多域名知识库

Scrape across multiple authorized domains:

bash

python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300

爬取多个授权域名的内容：

bash

python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300

Implementation Approach

实现流程

When users request web scraping:

Identify the scope:
- What URLs to start from?
- Should links be followed? How deep?
- Any domain restrictions needed?
- Is there a reasonable page limit?
Configure the scraper:
- Set appropriate depth (typically 1-3)
- Set max-pages to avoid runaway scraping
- Choose concurrent level based on site size
- Determine domain restrictions
Run with monitoring:
- Start with verbose mode or small sample
- Monitor output for errors or unexpected content
- Adjust configuration if needed
Verify output:
- Check the output directory structure
- Review
```
_metadata.json
```
  for statistics
- Sample a few text files for quality
- Check for errors in metadata
Process the content:
- Text files are ready for loading into context
- Use Read tool to examine specific files
- Use Grep to search across all scraped content
- Load files as needed for analysis

当用户请求网页爬取时：

确定范围：
- 起始URL有哪些？
- 是否需要跟随链接？深度多少？
- 是否需要域名限制？
- 是否设置合理的页面数量限制？
配置爬虫：
- 设置合适的深度（通常1-3）
- 设置最大页面数以避免无限制爬取
- 根据网站大小选择并发数
- 确定域名限制规则
监控运行：
- 先使用详细模式或小样本测试
- 监控输出中的错误或意外内容
- 必要时调整配置
验证输出：
- 检查输出目录结构
- 查看
```
_metadata.json
```
  中的统计数据
- 抽样检查文本文件的质量
- 检查元数据中的错误信息
处理内容：
- 文本文件可直接加载到上下文
- 使用Read工具查看特定文件
- 使用Grep工具在所有爬取内容中搜索
- 根据需要加载文件进行分析

Quick Reference

快速参考

Command structure:

bash

python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]

Essential options:

```
-d, --depth N
```
- Maximum link depth (default: 2)
```
-m, --max-pages N
```
- Maximum pages to scrape
```
-c, --concurrent N
```
- Concurrent requests (default: 10)
```
-f, --follow-external
```
- Follow external links
```
-a, --allowed-domains
```
- Specify allowed domains
```
-v, --verbose
```
- Detailed output
```
-u, --user-agent
```
- Custom User-Agent
```
-t, --timeout
```
- Request timeout in seconds

Get full help:

bash

python scripts/scrape.py --help

命令结构：

bash

python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]

核心参数：

```
-d, --depth N
```
- 最大链接深度（默认值：2）
```
-m, --max-pages N
```
- 最大爬取页面数
```
-c, --concurrent N
```
- 并发请求数（默认值：10）
```
-f, --follow-external
```
- 跟随外部链接
```
-a, --allowed-domains
```
- 指定允许的域名
```
-v, --verbose
```
- 详细输出
```
-u, --user-agent
```
- 自定义User-Agent
```
-t, --timeout
```
- 请求超时时间（秒）

获取完整帮助：

bash

python scripts/scrape.py --help

Best Practices

最佳实践

Start small - Test with
```
--depth 1 --max-pages 10
```
before large scrapes
Respect servers - Use reasonable concurrency and timeouts
Set limits - Always use
```
--max-pages
```
for initial runs
Check robots.txt - Manually verify the site allows scraping
Use verbose mode - Monitor for errors and unexpected behavior
Identify yourself - Use a descriptive User-Agent with contact info
Monitor output - Check
```
_metadata.json
```
for errors and statistics
Handle errors gracefully - Review error log in metadata for problematic URLs

从小规模开始 - 大规模爬取前先用
```
--depth 1 --max-pages 10
```
测试
尊重服务器 - 使用合理的并发数与超时时间
设置限制 - 初始运行时始终使用
```
--max-pages
```
检查robots.txt - 手动确认网站允许爬取
使用详细模式 - 监控错误与意外行为
标识自身 - 使用包含联系信息的描述性User-Agent
监控输出 - 查看
```
_metadata.json
```
中的错误与统计数据
优雅处理错误 - 查看元数据中的错误日志以处理问题URL

Troubleshooting

故障排除

Common issues:

"Missing required dependency": Run

pip install aiohttp beautifulsoup4 lxml aiofiles

Too many timeouts: Increase
```
--timeout
```
or reduce
```
--concurrent
```
Scraping too slow: Increase
```
--concurrent
```
(e.g., 20-30)
Memory issues with large scrapes: Reduce
```
--concurrent
```
or use
```
--max-pages
```
to chunk the work
Following too many links: Reduce
```
--depth
```
or enable same-domain-only (default)
Missing content: Some sites may require JavaScript; this scraper only handles static HTML
HTTP errors: Check
```
_metadata.json
```
errors section for specific issues

Limitations:

Does not execute JavaScript (single-page apps may not work)
Does not handle authentication or login
Does not follow links in JavaScript or dynamically loaded content
No built-in rate limiting (use
```
--concurrent
```
to control request rate)

常见问题：

"缺少必需依赖"：执行

pip install aiohttp beautifulsoup4 lxml aiofiles

安装

超时过多：增加
```
--timeout
```
或减少
```
--concurrent
```
爬取速度过慢：增加
```
--concurrent
```
（如20-30）
大规模爬取时内存不足：减少
```
--concurrent
```
或使用
```
--max-pages
```
分块处理
跟随链接过多：减少
```
--depth
```
或启用仅同域名模式（默认）
内容缺失：部分网站需要JavaScript支持；此爬虫仅处理静态HTML
HTTP错误：查看
```
_metadata.json
```
的errors部分获取具体问题

限制：

不执行JavaScript（单页应用可能无法正常爬取）
不处理认证或登录
不跟随JavaScript或动态加载内容中的链接
无内置速率限制（使用
```
--concurrent
```
控制请求速率）

Advanced Use Cases

高级使用场景

Loading Scraped Content

加载爬取内容

After scraping, use the Read tool to load content into context:

bash

undefined

爬取完成后，使用Read工具将内容加载到上下文：

bash

undefined

Read a specific scraped page

读取特定爬取页面

Read file_path: output/docs.example.com/guide.txt

Search across all scraped content

在所有爬取内容中搜索

Grep pattern: "API endpoint" path: output/ -r

undefined

Grep pattern: "API endpoint" path: output/ -r

undefined

Selective Re-scraping

选择性重新爬取

The scraper tracks visited URLs in memory during a session but doesn't persist this between runs. To avoid re-downloading:

Run initial scrape with limits
Check output directory for what was downloaded
Run additional scrapes with different start URLs or configurations

爬虫在会话期间会在内存中跟踪已访问的URL，但不会在会话间持久化。如需避免重复下载：

先使用限制条件执行初始爬取
检查输出目录确认已下载内容
使用不同起始URL或配置执行额外爬取

Combining with Other Tools

与其他工具结合

Chain the scraper with other processing:

bash

undefined

将爬虫与其他处理工具结合使用：

bash

undefined

Scrape then process with custom script

爬取后使用自定义脚本处理

python scripts/scrape.py https://example.com output/ --depth 2 python your_analysis_script.py output/

undefined

python scripts/scrape.py https://example.com output/ --depth 2 python your_analysis_script.py output/

undefined

Resources

资源

scripts/scrape.py

The main web scraping tool implementing concurrent crawling, content extraction, and intelligent filtering. Key features:

Async/concurrent processing - Uses
```
asyncio
```
and
```
aiohttp
```
for high-performance concurrent requests
URL normalization - Removes fragments and trailing slashes for proper deduplication
Visited tracking - Maintains
```
visited_urls
```
and
```
queued_urls
```
sets to prevent re-downloading
Smart content extraction - Removes scripts, styles, navigation, and common unwanted patterns
Directory hierarchy - Converts URLs to safe filesystem paths maintaining structure
Error handling - Tracks and reports errors in metadata file
Metadata generation - Creates
```
_metadata.json
```
with scraping statistics and errors

The script can be executed directly and includes comprehensive command-line help via

--help

实现并发爬取、内容提取与智能过滤的核心网页爬取工具。主要特性：

异步/并发处理 - 使用
```
asyncio
```
和
```
aiohttp
```
实现高性能并发请求
URL标准化 - 移除片段和末尾斜杠以实现正确去重
访问跟踪 - 维护
```
visited_urls
```
和
```
queued_urls
```
集合以避免重复下载
智能内容提取 - 移除脚本、样式、导航和常见无用模式
目录层级 - 将URL转换为安全的文件系统路径并保持结构
错误处理 - 在元数据文件中跟踪并报告错误
元数据生成 - 创建包含爬取统计数据和错误信息的
```
_metadata.json
```

可直接执行该脚本，通过

--help

获取完整命令行帮助。