web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraper
网页爬虫工具
Overview
概述
Recursively scrape web pages with concurrent processing, extracting clean text content while following links. The scraper automatically handles URL deduplication, creates proper directory hierarchies based on URL structure, filters out unwanted content, and respects domain boundaries.
通过并发处理递归爬取网页,提取干净的文本内容并跟随链接。该爬虫可自动处理URL去重,根据URL结构创建合理的目录层级,过滤无用内容,并遵守域名边界限制。
When to Use This Skill
何时使用此技能
Use this skill when users request:
- Scraping content from websites
- Downloading documentation from online sources
- Extracting text from web pages at scale
- Crawling websites to gather information
- Archiving web content locally
- Following and downloading linked pages
- Research data collection from web sources
- Building text datasets from websites
当用户提出以下需求时使用此技能:
- 从网站抓取内容
- 从在线来源下载文档
- 批量提取网页文本
- 爬取网站以收集信息
- 本地存档网页内容
- 跟随并下载链接页面
- 从网页来源收集研究数据
- 从网站构建文本数据集
Prerequisites
前置条件
Install required dependencies:
bash
pip install aiohttp beautifulsoup4 lxml aiofilesThese libraries provide:
- - Async HTTP client for concurrent requests
aiohttp - - HTML parsing and content extraction
beautifulsoup4 - - Fast HTML/XML parser
lxml - - Async file I/O
aiofiles
安装所需依赖:
bash
pip install aiohttp beautifulsoup4 lxml aiofiles这些库的作用:
- - 用于并发请求的异步HTTP客户端
aiohttp - - HTML解析与内容提取
beautifulsoup4 - - 快速HTML/XML解析器
lxml - - 异步文件I/O
aiofiles
Core Capabilities
核心功能
1. Basic Single-Page Scraping
1. 基础单页爬取
Scrape a single page without following links:
bash
python scripts/scrape.py <URL> <output-directory> --depth 0Example:
bash
python scripts/scrape.py https://example.com/article output/This downloads only the specified page, extracts clean text content, and saves it to .
output/example.com/article.txt爬取单个页面且不跟随链接:
bash
python scripts/scrape.py <URL> <output-directory> --depth 0示例:
bash
python scripts/scrape.py https://example.com/article output/此命令仅下载指定页面,提取干净的文本内容,并保存至。
output/example.com/article.txt2. Recursive Scraping with Link Following
2. 跟随链接的递归爬取
Scrape a page and follow links up to a specified depth:
bash
python scripts/scrape.py <URL> <output-directory> --depth <N>Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2Depth levels:
- - Only the start URL(s)
--depth 0 - - Start URLs + all links on those pages
--depth 1 - - Start URLs + links + links found on those linked pages
--depth 2 - - Continue following links to the specified depth
--depth 3+
爬取页面并跟随链接至指定深度:
bash
python scripts/scrape.py <URL> <output-directory> --depth <N>示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2深度级别说明:
- - 仅爬取起始URL
--depth 0 - - 起始URL + 这些页面上的所有链接
--depth 1 - - 起始URL + 链接页面 + 链接页面上的所有链接
--depth 2 - - 继续跟随链接至指定深度
--depth 3+
3. Limiting the Number of Pages
3. 限制爬取页面数量
Prevent excessive scraping by setting a maximum page limit:
bash
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50Useful for:
- Testing scraper configuration before full run
- Limiting resource usage
- Sampling content from large sites
- Staying within rate limits
通过设置最大页面数防止过度爬取:
bash
python scripts/scrape.py <URL> <output-directory> --depth 3 --max-pages 100示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 3 --max-pages 50适用场景:
- 完整爬取前测试爬虫配置
- 限制资源使用
- 从大型网站抽样内容
- 遵守速率限制
4. Concurrent Processing
4. 并发处理
Control the number of simultaneous requests for faster scraping:
bash
python scripts/scrape.py <URL> <output-directory> --concurrent <N>Example:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20Default is 10 concurrent requests. Increase for faster scraping, decrease for more conservative resource usage.
Guidelines:
- Small sites or slow servers:
--concurrent 5 - Medium sites: (default)
--concurrent 10 - Large, fast sites:
--concurrent 20-30 - Be respectful of server resources
控制同时请求的数量以加快爬取速度:
bash
python scripts/scrape.py <URL> <output-directory> --concurrent <N>示例:
bash
python scripts/scrape.py https://docs.example.com output/ --depth 2 --concurrent 20默认并发请求数为10。增加数值可加快爬取速度,减少数值则更节省资源。
参考准则:
- 小型网站或慢速服务器:
--concurrent 5 - 中型网站:(默认值)
--concurrent 10 - 大型、快速网站:
--concurrent 20-30 - 请尊重服务器资源
5. Domain Restrictions
5. 域名限制
By default, the scraper only follows links on the same domain as the start URL. This can be controlled:
Same domain only (default):
bash
python scripts/scrape.py https://example.com output/ --depth 2Follow external links:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --follow-externalSpecify allowed domains:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.comUse when:
--allowed-domains- Documentation is split across multiple subdomains
- Content spans related domains
- You want to limit to specific trusted domains
默认情况下,爬虫仅跟随与起始URL同域名的链接。可通过以下方式控制:
仅同域名(默认):
bash
python scripts/scrape.py https://example.com output/ --depth 2跟随外部链接:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --follow-external指定允许的域名:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --allowed-domains example.com docs.example.com blog.example.com在以下场景使用:
--allowed-domains- 文档分散在多个子域名
- 内容分布在相关域名
- 希望限制到特定可信域名
6. Multiple Start URLs
6. 多个起始URL
Scrape from multiple starting points simultaneously:
bash
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>Example:
bash
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2All start URLs are processed with the same configuration (depth, domain restrictions, etc.).
同时从多个起始点爬取:
bash
python scripts/scrape.py <URL1> <URL2> <URL3> <output-directory>示例:
bash
python scripts/scrape.py https://example.com/docs https://example.com/guides https://example.com/tutorials output/ --depth 2所有起始URL将使用相同配置(深度、域名限制等)进行处理。
7. Request Configuration
7. 请求配置
Customize HTTP request behavior:
bash
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60Options:
- - Custom User-Agent header (default: "Mozilla/5.0 (compatible; WebScraper/1.0)")
--user-agent - - Request timeout in seconds (default: 30)
--timeout
Example:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 45自定义HTTP请求行为:
bash
python scripts/scrape.py <URL> <output-directory> --user-agent "MyBot/1.0" --timeout 60可选参数:
- - 自定义User-Agent请求头(默认值:"Mozilla/5.0 (compatible; WebScraper/1.0)")
--user-agent - - 请求超时时间(秒,默认值:30)
--timeout
示例:
bash
python scripts/scrape.py https://example.com output/ --depth 2 --user-agent "MyResearchBot/1.0 (+https://mysite.com/bot)" --timeout 458. Verbose Output
8. 详细输出
Enable detailed logging to monitor scraping progress:
bash
python scripts/scrape.py <URL> <output-directory> --verboseVerbose mode shows:
- Each URL being fetched
- Successful saves with file paths
- Errors and timeouts
- Detailed error information
启用详细日志以监控爬取进度:
bash
python scripts/scrape.py <URL> <output-directory> --verbose详细模式将显示:
- 正在获取的每个URL
- 成功保存的文件路径
- 错误与超时信息
- 详细的错误详情
Output Structure
输出结构
Directory Hierarchy
目录层级
The scraper creates a directory hierarchy that mirrors the URL structure:
output/
├── example.com/
│ ├── index.txt # https://example.com/
│ ├── about.txt # https://example.com/about
│ ├── docs/
│ │ ├── index.txt # https://example.com/docs/
│ │ ├── getting-started.txt
│ │ └── api/
│ │ └── reference.txt
│ └── blog/
│ ├── post-1.txt
│ └── post-2.txt
├── docs.example.com/
│ └── guide.txt
└── _metadata.json爬虫会创建与URL结构对应的目录层级:
output/
├── example.com/
│ ├── index.txt # https://example.com/
│ ├── about.txt # https://example.com/about
│ ├── docs/
│ │ ├── index.txt # https://example.com/docs/
│ │ ├── getting-started.txt
│ │ └── api/
│ │ └── reference.txt
│ └── blog/
│ ├── post-1.txt
│ └── post-2.txt
├── docs.example.com/
│ └── guide.txt
└── _metadata.jsonFile Format
文件格式
Each scraped page is saved as a text file with the following structure:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00
================================================================================
[Clean extracted text content]每个爬取的页面将保存为文本文件,格式如下:
URL: https://example.com/docs/guide
Title: Getting Started Guide
Scraped: 2025-10-21T14:30:00
================================================================================
[提取的干净文本内容]Metadata File
元数据文件
_metadata.jsonjson
{
"start_time": "2025-10-21T14:30:00",
"end_time": "2025-10-21T14:35:30",
"pages_scraped": 42,
"total_visited": 45,
"errors": {
"https://example.com/broken": "HTTP 404",
"https://example.com/slow": "Timeout"
}
}_metadata.jsonjson
{
"start_time": "2025-10-21T14:30:00",
"end_time": "2025-10-21T14:35:30",
"pages_scraped": 42,
"total_visited": 45,
"errors": {
"https://example.com/broken": "HTTP 404",
"https://example.com/slow": "Timeout"
}
}Content Extraction and Filtering
内容提取与过滤
What Gets Extracted
提取内容
The scraper extracts clean text content by:
-
Focusing on main content - Prioritizes,
<main>, or<article>tags<body> -
Removing unwanted elements - Strips out:
- Scripts and styles
- Navigation menus
- Headers and footers
- Sidebars (aside tags)
- Iframes and embedded content
- SVG graphics
- Comments
-
Filtering common patterns - Removes:
- Cookie consent messages
- Privacy policy links
- Terms of service boilerplate
- UI elements (arrows, single numbers)
- Very short lines (likely navigation items)
-
Preserving structure - Maintains line breaks between content blocks
爬虫通过以下方式提取干净的文本内容:
-
聚焦主内容 - 优先提取、
<main>或<article>标签内的内容<body> -
移除无用元素 - 剥离以下内容:
- 脚本与样式
- 导航菜单
- 页眉与页脚
- 侧边栏(aside标签)
- 内嵌框架与嵌入内容
- SVG图形
- 注释
-
过滤常见模式 - 移除以下内容:
- Cookie授权提示
- 隐私政策链接
- 服务条款 boilerplate
- UI元素(箭头、单个数字)
- 极短行(可能为导航项)
-
保留结构 - 保留内容块之间的换行
What Gets Filtered Out
过滤内容
Common unwanted patterns automatically removed:
- "Accept cookies" / "Reject all"
- "Cookie settings"
- "Privacy policy"
- "Terms of service"
- Navigation arrows (←, →, ↑, ↓)
- Isolated numbers
- Lines shorter than 3 characters
自动移除的常见无用模式:
- "接受Cookie" / "全部拒绝"
- "Cookie设置"
- "隐私政策"
- "服务条款"
- 导航箭头(←, →, ↑, ↓)
- 孤立数字
- 长度不足3个字符的行
Common Usage Patterns
常见使用场景
Download Documentation Site
下载文档站点
Scrape an entire documentation site with reasonable limits:
bash
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15使用合理限制爬取整个文档站点:
bash
python scripts/scrape.py https://docs.example.com docs-archive/ --depth 3 --max-pages 200 --concurrent 15Archive a Blog
存档博客
Download all blog posts from a blog (following pagination):
bash
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500下载博客中的所有文章(跟随分页):
bash
python scripts/scrape.py https://blog.example.com blog-archive/ --depth 2 --max-pages 500Research Data Collection
研究数据采集
Gather text content from multiple related sources:
bash
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20从多个相关来源收集文本内容:
bash
python scripts/scrape.py https://research.edu/papers https://research.edu/publications research-data/ --depth 2 --allowed-domains research.edu --concurrent 20Sample a Large Site
大型网站抽样
Test configuration on a small sample before full scrape:
bash
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verboseThen run full scrape after confirming results:
bash
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15在完整爬取前通过小样本测试配置:
bash
python scripts/scrape.py https://largeSite.com sample/ --depth 2 --max-pages 20 --verbose确认结果后再执行完整爬取:
bash
python scripts/scrape.py https://largeSite.com full-archive/ --depth 3 --max-pages 500 --concurrent 15Multi-Domain Knowledge Base
多域名知识库
Scrape across multiple authorized domains:
bash
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300爬取多个授权域名的内容:
bash
python scripts/scrape.py https://main.example.com knowledge-base/ --depth 3 --allowed-domains main.example.com docs.example.com wiki.example.com --max-pages 300Implementation Approach
实现流程
When users request web scraping:
-
Identify the scope:
- What URLs to start from?
- Should links be followed? How deep?
- Any domain restrictions needed?
- Is there a reasonable page limit?
-
Configure the scraper:
- Set appropriate depth (typically 1-3)
- Set max-pages to avoid runaway scraping
- Choose concurrent level based on site size
- Determine domain restrictions
-
Run with monitoring:
- Start with verbose mode or small sample
- Monitor output for errors or unexpected content
- Adjust configuration if needed
-
Verify output:
- Check the output directory structure
- Review for statistics
_metadata.json - Sample a few text files for quality
- Check for errors in metadata
-
Process the content:
- Text files are ready for loading into context
- Use Read tool to examine specific files
- Use Grep to search across all scraped content
- Load files as needed for analysis
当用户请求网页爬取时:
-
确定范围:
- 起始URL有哪些?
- 是否需要跟随链接?深度多少?
- 是否需要域名限制?
- 是否设置合理的页面数量限制?
-
配置爬虫:
- 设置合适的深度(通常1-3)
- 设置最大页面数以避免无限制爬取
- 根据网站大小选择并发数
- 确定域名限制规则
-
监控运行:
- 先使用详细模式或小样本测试
- 监控输出中的错误或意外内容
- 必要时调整配置
-
验证输出:
- 检查输出目录结构
- 查看中的统计数据
_metadata.json - 抽样检查文本文件的质量
- 检查元数据中的错误信息
-
处理内容:
- 文本文件可直接加载到上下文
- 使用Read工具查看特定文件
- 使用Grep工具在所有爬取内容中搜索
- 根据需要加载文件进行分析
Quick Reference
快速参考
Command structure:
bash
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]Essential options:
- - Maximum link depth (default: 2)
-d, --depth N - - Maximum pages to scrape
-m, --max-pages N - - Concurrent requests (default: 10)
-c, --concurrent N - - Follow external links
-f, --follow-external - - Specify allowed domains
-a, --allowed-domains - - Detailed output
-v, --verbose - - Custom User-Agent
-u, --user-agent - - Request timeout in seconds
-t, --timeout
Get full help:
bash
python scripts/scrape.py --help命令结构:
bash
python scripts/scrape.py <URL> [URL2 ...] <output-dir> [options]核心参数:
- - 最大链接深度(默认值:2)
-d, --depth N - - 最大爬取页面数
-m, --max-pages N - - 并发请求数(默认值:10)
-c, --concurrent N - - 跟随外部链接
-f, --follow-external - - 指定允许的域名
-a, --allowed-domains - - 详细输出
-v, --verbose - - 自定义User-Agent
-u, --user-agent - - 请求超时时间(秒)
-t, --timeout
获取完整帮助:
bash
python scripts/scrape.py --helpBest Practices
最佳实践
- Start small - Test with before large scrapes
--depth 1 --max-pages 10 - Respect servers - Use reasonable concurrency and timeouts
- Set limits - Always use for initial runs
--max-pages - Check robots.txt - Manually verify the site allows scraping
- Use verbose mode - Monitor for errors and unexpected behavior
- Identify yourself - Use a descriptive User-Agent with contact info
- Monitor output - Check for errors and statistics
_metadata.json - Handle errors gracefully - Review error log in metadata for problematic URLs
- 从小规模开始 - 大规模爬取前先用测试
--depth 1 --max-pages 10 - 尊重服务器 - 使用合理的并发数与超时时间
- 设置限制 - 初始运行时始终使用
--max-pages - 检查robots.txt - 手动确认网站允许爬取
- 使用详细模式 - 监控错误与意外行为
- 标识自身 - 使用包含联系信息的描述性User-Agent
- 监控输出 - 查看中的错误与统计数据
_metadata.json - 优雅处理错误 - 查看元数据中的错误日志以处理问题URL
Troubleshooting
故障排除
Common issues:
- "Missing required dependency": Run
pip install aiohttp beautifulsoup4 lxml aiofiles - Too many timeouts: Increase or reduce
--timeout--concurrent - Scraping too slow: Increase (e.g., 20-30)
--concurrent - Memory issues with large scrapes: Reduce or use
--concurrentto chunk the work--max-pages - Following too many links: Reduce or enable same-domain-only (default)
--depth - Missing content: Some sites may require JavaScript; this scraper only handles static HTML
- HTTP errors: Check errors section for specific issues
_metadata.json
Limitations:
- Does not execute JavaScript (single-page apps may not work)
- Does not handle authentication or login
- Does not follow links in JavaScript or dynamically loaded content
- No built-in rate limiting (use to control request rate)
--concurrent
常见问题:
- "缺少必需依赖":执行安装
pip install aiohttp beautifulsoup4 lxml aiofiles - 超时过多:增加或减少
--timeout--concurrent - 爬取速度过慢:增加(如20-30)
--concurrent - 大规模爬取时内存不足:减少或使用
--concurrent分块处理--max-pages - 跟随链接过多:减少或启用仅同域名模式(默认)
--depth - 内容缺失:部分网站需要JavaScript支持;此爬虫仅处理静态HTML
- HTTP错误:查看的errors部分获取具体问题
_metadata.json
限制:
- 不执行JavaScript(单页应用可能无法正常爬取)
- 不处理认证或登录
- 不跟随JavaScript或动态加载内容中的链接
- 无内置速率限制(使用控制请求速率)
--concurrent
Advanced Use Cases
高级使用场景
Loading Scraped Content
加载爬取内容
After scraping, use the Read tool to load content into context:
bash
undefined爬取完成后,使用Read工具将内容加载到上下文:
bash
undefinedRead a specific scraped page
读取特定爬取页面
Read file_path: output/docs.example.com/guide.txt
Read file_path: output/docs.example.com/guide.txt
Search across all scraped content
在所有爬取内容中搜索
Grep pattern: "API endpoint" path: output/ -r
undefinedGrep pattern: "API endpoint" path: output/ -r
undefinedSelective Re-scraping
选择性重新爬取
The scraper tracks visited URLs in memory during a session but doesn't persist this between runs. To avoid re-downloading:
- Run initial scrape with limits
- Check output directory for what was downloaded
- Run additional scrapes with different start URLs or configurations
爬虫在会话期间会在内存中跟踪已访问的URL,但不会在会话间持久化。如需避免重复下载:
- 先使用限制条件执行初始爬取
- 检查输出目录确认已下载内容
- 使用不同起始URL或配置执行额外爬取
Combining with Other Tools
与其他工具结合
Chain the scraper with other processing:
bash
undefined将爬虫与其他处理工具结合使用:
bash
undefinedScrape then process with custom script
爬取后使用自定义脚本处理
python scripts/scrape.py https://example.com output/ --depth 2
python your_analysis_script.py output/
undefinedpython scripts/scrape.py https://example.com output/ --depth 2
python your_analysis_script.py output/
undefinedResources
资源
scripts/scrape.py
scripts/scrape.py
The main web scraping tool implementing concurrent crawling, content extraction, and intelligent filtering. Key features:
- Async/concurrent processing - Uses and
asynciofor high-performance concurrent requestsaiohttp - URL normalization - Removes fragments and trailing slashes for proper deduplication
- Visited tracking - Maintains and
visited_urlssets to prevent re-downloadingqueued_urls - Smart content extraction - Removes scripts, styles, navigation, and common unwanted patterns
- Directory hierarchy - Converts URLs to safe filesystem paths maintaining structure
- Error handling - Tracks and reports errors in metadata file
- Metadata generation - Creates with scraping statistics and errors
_metadata.json
The script can be executed directly and includes comprehensive command-line help via .
--help实现并发爬取、内容提取与智能过滤的核心网页爬取工具。主要特性:
- 异步/并发处理 - 使用和
asyncio实现高性能并发请求aiohttp - URL标准化 - 移除片段和末尾斜杠以实现正确去重
- 访问跟踪 - 维护和
visited_urls集合以避免重复下载queued_urls - 智能内容提取 - 移除脚本、样式、导航和常见无用模式
- 目录层级 - 将URL转换为安全的文件系统路径并保持结构
- 错误处理 - 在元数据文件中跟踪并报告错误
- 元数据生成 - 创建包含爬取统计数据和错误信息的
_metadata.json
可直接执行该脚本,通过获取完整命令行帮助。
--help