web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraper

网页抓取工具

Fetch web page content (text + images) and save as HTML or Markdown locally.

Minimal dependencies: Only requires

requests

and

beautifulsoup4

- no browser automation.

Default behavior: Downloads images to local

images/

directory automatically.

抓取网页内容（文本+图片）并保存为HTML或Markdown格式到本地。

依赖项极少：仅需

requests

和

beautifulsoup4

——无需浏览器自动化。

默认行为：自动将图片下载到本地

images/

目录。

Quick start

快速开始

Single page

单页抓取

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

Recursive (follow links)

递归抓取（跟随链接）

bash

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

bash

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

Setup

安装配置

Requires Python 3.8+ and minimal dependencies:

bash

cd {baseDir}
pip install -r requirements.txt

Or install manually:

bash

pip install requests beautifulsoup4

Note: No browser or driver needed - uses pure HTTP requests.

需要Python 3.8+及少量依赖项：

bash

cd {baseDir}
pip install -r requirements.txt

或手动安装：

bash

pip install requests beautifulsoup4

注意：无需浏览器或驱动程序——使用纯HTTP请求。

Inputs to collect

需要收集的输入信息

Single page mode

单页模式

URL: The web page to scrape (required)
Format:
```
html
```
or
```
md
```
(default:
```
html
```
)
Output path: Where to save the file (default: current directory with auto-generated name)
Images: Downloads images by default (use
```
--no-download-images
```
to disable)

URL：要抓取的网页地址（必填）
格式：
```
html
```
或
```
md
```
（默认：
```
html
```
）
输出路径：文件保存位置（默认：当前目录，自动生成文件名）
图片：默认下载图片（使用
```
--no-download-images
```
参数禁用）

Recursive mode (--recursive)

递归模式（--recursive）

URL: Starting point for recursive scraping
Format:
```
html
```
or
```
md
```
Output directory: Where to save all scraped pages
Max depth: How many levels deep to follow links (default: 2)
Max pages: Maximum total pages to scrape (default: 50)
Domain filter: Whether to stay within same domain (default: yes)
Images: Downloads images by default

URL：递归抓取的起始地址
格式：
```
html
```
或
```
md
```
输出目录：所有抓取页面的保存位置
最大深度：跟随链接的层级数（默认：2）
最大页面数：抓取的总页面数上限（默认：50）
域名过滤：是否限制在同一域名内（默认：是）
图片：默认下载图片

Conversation Flow

对话流程

Ask user for the URL to scrape
Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in
```
images/
```
  folder
For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
Ask where to save (or suggest a default path like
```
/tmp/
```
or
```
~/Downloads/
```
)
Run the script and confirm success
Show the saved file/directory path

询问用户要抓取的URL
询问偏好的输出格式（HTML或Markdown）
- 注意：两种格式默认都包含文本和图片
- HTML：保留原始页面结构，搭配下载后的图片
- Markdown：简洁文本格式，图片保存到
```
images/
```
  文件夹
若为递归模式：询问最大深度和最大页面数（可选，有合理默认值）
询问保存位置（或建议默认路径如
```
/tmp/
```
或
```
~/Downloads/
```
）
运行脚本并确认成功
显示保存的文件/目录路径

Examples

示例

Single Page Scraping

单页抓取

Save as HTML

保存为HTML格式

bash

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

bash

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

Save as Markdown (with images, default)

保存为Markdown格式（默认包含图片）

bash

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

Result: Creates

web-scraping.md

images/

folder with all downloaded images (text + images).

bash

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

结果：生成

web-scraping.md

文件和

images/

文件夹，包含所有下载的图片（文本+图片）。

Without downloading images (optional)

不下载图片（可选）

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

Result: Only text + image URLs (not downloaded locally).

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

结果：仅保留文本+图片原始URL（不下载到本地）。

Auto-generate filename

自动生成文件名

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format html

bash

{baseDir}/scripts/scrape.py --url "https://example.com" --format html

Saves to: example-com-{timestamp}.html

保存路径：example-com-{timestamp}.html

undefined

undefined

Recursive Scraping

递归抓取

Basic recursive crawl (depth 2, same domain, with images)

基础递归爬取（深度2，同一域名，包含图片）

bash

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

Output structure (text + images for all pages):

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg

bash

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

输出结构（所有页面的文本+图片）：

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # 所有页面共享的图片
    ├── logo.png
    └── diagram.svg

Deep crawl with custom limits

自定义限制的深度爬取

bash

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

bash

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

Ignore robots.txt (use with caution)

忽略robots.txt（谨慎使用）

bash

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

bash

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

Faster scraping (reduced rate limit)

快速抓取（降低速率限制）

bash

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

bash

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

Features

功能特性

Single Page Mode

单页模式

HTML output: Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to
```
images/
```
  folder
- ✅ Suitable for offline viewing
Markdown output: Extracts clean text content
- ✅ Auto-downloads images to local
```
images/
```
  directory (default)
- ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use
```
--no-download-images
```
  flag to keep original URLs only
Simple and fast: Pure HTTP requests, no browser needed
Auto filename: Generates safe filename from URL if not specified

HTML输出：保留原始页面结构
- ✅ 简洁可读的HTML文档
- ✅ 所有图片下载到
```
images/
```
  文件夹
- ✅ 适合离线查看
Markdown输出：提取简洁的文本内容
- ✅ 默认自动下载图片到本地
```
images/
```
  目录
- ✅ 将图片URL转换为相对路径
- ✅ 简洁可读的存档格式
- ✅ 若下载失败，自动回退到原始URL
- 使用
```
--no-download-images
```
  参数仅保留原始URL
简单快速：纯HTTP请求，无需浏览器
自动命名：若未指定文件名，从URL生成安全的文件名

Recursive Mode (

--recursive

)

递归模式（--recursive）

✅ Intelligent link discovery: Automatically follows all links on crawled pages
✅ Depth control:
```
--max-depth
```
limits how many levels deep to crawl (default: 2)
✅ Page limit:
```
--max-pages
```
caps total pages to prevent runaway crawls (default: 50)
✅ Domain filtering:
```
--same-domain
```
keeps crawl within starting domain (default: on)
✅ robots.txt compliance: Respects site's crawling rules by default
✅ Rate limiting:
```
--rate-limit
```
adds delay between requests (default: 0.5s)
✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
✅ Progress tracking: Real-time console output with success/fail/skip counts
✅ Organized output: Preserves URL structure in directory hierarchy
✅ Efficient crawling: Sequential with rate limiting to respect servers

✅ 智能链接发现：自动跟随已爬取页面上的所有链接
✅ 深度控制：
```
--max-depth
```
参数限制爬取层级（默认：2）
✅ 页面限制：
```
--max-pages
```
参数限制总爬取页面数，防止无限制爬取（默认：50）
✅ 域名过滤：
```
--same-domain
```
参数将爬取限制在起始域名内（默认开启）
✅ 遵守robots.txt：默认遵循网站的爬取规则
✅ 速率限制：
```
--rate-limit
```
参数添加请求间隔延迟（默认：0.5秒）
✅ 智能URL过滤：跳过图片、脚本、CSS和重复URL
✅ 进度跟踪：实时控制台输出，显示成功/失败/跳过的计数
✅ 结构化输出：在目录层级中保留URL结构
✅ 高效爬取：按顺序爬取并添加速率限制，以尊重服务器

Guardrails

注意事项

Single Page Mode

单页模式

Respect robots.txt and site terms of service
Some sites may block automated access; this tool uses standard HTTP requests
Large pages with many images may take time to download

遵守robots.txt和网站服务条款
部分网站可能阻止自动化访问；本工具使用标准HTTP请求
包含大量图片的大页面可能需要较长下载时间

Recursive Mode

递归模式

Start small: Test with
```
--max-depth 1 --max-pages 10
```
first
Respect robots.txt: Default is on; only use
```
--no-respect-robots
```
for your own sites
Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
Same domain: Strongly recommended to keep
```
--same-domain
```
enabled
Monitor progress: Watch for high fail rates (may indicate blocking)
Storage: Recursive crawls can generate many files; ensure sufficient disk space
Legal: Ensure you have permission to crawl and archive the target site

从小规模开始：先使用
```
--max-depth 1 --max-pages 10
```
进行测试
遵守robots.txt：默认开启；仅对自己的网站使用
```
--no-respect-robots
```
速率限制：默认0.5秒是比较友好的；公共网站不要设置低于0.2秒
同一域名：强烈建议开启
```
--same-domain
```
监控进度：注意高失败率（可能表示被拦截）
存储：递归爬取可能生成大量文件；确保有足够的磁盘空间
合法性：确保你有爬取和存档目标网站的权限

Troubleshooting

故障排除

Connection errors: Check your internet connection and URL validity
403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
Timeout: Increase
```
--timeout
```
flag for slow-loading pages (value in seconds)
Image download fails: Images will fall back to original URLs
Missing images: Some sites use JavaScript to load images dynamically (not supported)

连接错误：检查网络连接和URL有效性
403/被拦截：部分网站会阻止抓取工具；本工具使用真实的User-Agent头
超时：对于加载缓慢的页面，增加
```
--timeout
```
参数的值（单位：秒）
图片下载失败：图片将自动回退到原始URL
图片缺失：部分网站使用JavaScript动态加载图片（本工具不支持此场景）

web-scraper

Original

Translation

Web Scraper

网页抓取工具

Quick start

快速开始

Single page

单页抓取

Recursive (follow links)

递归抓取（跟随链接）

Setup

安装配置

Inputs to collect

需要收集的输入信息

Single page mode

单页模式

Recursive mode (--recursive)

递归模式（--recursive）

Conversation Flow

对话流程

Examples

示例

Single Page Scraping

单页抓取

Save as HTML

保存为HTML格式

Save as Markdown (with images, default)

保存为Markdown格式（默认包含图片）

Without downloading images (optional)

不下载图片（可选）

Auto-generate filename

自动生成文件名

Saves to: example-com-{timestamp}.html

保存路径：example-com-{timestamp}.html

Recursive Scraping

递归抓取

Basic recursive crawl (depth 2, same domain, with images)

基础递归爬取（深度2，同一域名，包含图片）

Deep crawl with custom limits

自定义限制的深度爬取

Ignore robots.txt (use with caution)

忽略robots.txt（谨慎使用）

Faster scraping (reduced rate limit)

快速抓取（降低速率限制）

Features

功能特性

Single Page Mode

单页模式

Recursive Mode (--recursive)

递归模式（--recursive）

Guardrails

注意事项

Single Page Mode

单页模式

Recursive Mode

递归模式

Troubleshooting

故障排除

Recursive Mode (
`--recursive`
)