web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraper

网页抓取工具

Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires
requests
and
beautifulsoup4
- no browser automation.
Default behavior: Downloads images to local
images/
directory automatically.
抓取网页内容(文本+图片)并保存为HTML或Markdown格式到本地。
依赖项极少:仅需
requests
beautifulsoup4
——无需浏览器自动化。
默认行为:自动将图片下载到本地
images/
目录。

Quick start

快速开始

Single page

单页抓取

bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

Recursive (follow links)

递归抓取(跟随链接)

bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

Setup

安装配置

Requires Python 3.8+ and minimal dependencies:
bash
cd {baseDir}
pip install -r requirements.txt
Or install manually:
bash
pip install requests beautifulsoup4
Note: No browser or driver needed - uses pure HTTP requests.
需要Python 3.8+及少量依赖项:
bash
cd {baseDir}
pip install -r requirements.txt
或手动安装:
bash
pip install requests beautifulsoup4
注意:无需浏览器或驱动程序——使用纯HTTP请求。

Inputs to collect

需要收集的输入信息

Single page mode

单页模式

  • URL: The web page to scrape (required)
  • Format:
    html
    or
    md
    (default:
    html
    )
  • Output path: Where to save the file (default: current directory with auto-generated name)
  • Images: Downloads images by default (use
    --no-download-images
    to disable)
  • URL:要抓取的网页地址(必填)
  • 格式
    html
    md
    (默认:
    html
  • 输出路径:文件保存位置(默认:当前目录,自动生成文件名)
  • 图片:默认下载图片(使用
    --no-download-images
    参数禁用)

Recursive mode (--recursive)

递归模式(--recursive)

  • URL: Starting point for recursive scraping
  • Format:
    html
    or
    md
  • Output directory: Where to save all scraped pages
  • Max depth: How many levels deep to follow links (default: 2)
  • Max pages: Maximum total pages to scrape (default: 50)
  • Domain filter: Whether to stay within same domain (default: yes)
  • Images: Downloads images by default
  • URL:递归抓取的起始地址
  • 格式
    html
    md
  • 输出目录:所有抓取页面的保存位置
  • 最大深度:跟随链接的层级数(默认:2)
  • 最大页面数:抓取的总页面数上限(默认:50)
  • 域名过滤:是否限制在同一域名内(默认:是)
  • 图片:默认下载图片

Conversation Flow

对话流程

  1. Ask user for the URL to scrape
  2. Ask preferred output format (HTML or Markdown)
    • Note: Both formats include text and images by default
    • HTML: Preserves original structure with downloaded images
    • Markdown: Clean text format with downloaded images in
      images/
      folder
  3. For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
  4. Ask where to save (or suggest a default path like
    /tmp/
    or
    ~/Downloads/
    )
  5. Run the script and confirm success
  6. Show the saved file/directory path
  1. 询问用户要抓取的URL
  2. 询问偏好的输出格式(HTML或Markdown)
    • 注意:两种格式默认都包含文本和图片
    • HTML:保留原始页面结构,搭配下载后的图片
    • Markdown:简洁文本格式,图片保存到
      images/
      文件夹
  3. 若为递归模式:询问最大深度和最大页面数(可选,有合理默认值)
  4. 询问保存位置(或建议默认路径如
    /tmp/
    ~/Downloads/
  5. 运行脚本并确认成功
  6. 显示保存的文件/目录路径

Examples

示例

Single Page Scraping

单页抓取

Save as HTML

保存为HTML格式

bash
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
bash
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

Save as Markdown (with images, default)

保存为Markdown格式(默认包含图片)

bash
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
Result: Creates
web-scraping.md
+
images/
folder with all downloaded images (text + images).
bash
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
结果:生成
web-scraping.md
文件和
images/
文件夹,包含所有下载的图片(文本+图片)。

Without downloading images (optional)

不下载图片(可选)

bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
Result: Only text + image URLs (not downloaded locally).
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
结果:仅保留文本+图片原始URL(不下载到本地)。

Auto-generate filename

自动生成文件名

bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html

Saves to: example-com-{timestamp}.html

保存路径:example-com-{timestamp}.html

undefined
undefined

Recursive Scraping

递归抓取

Basic recursive crawl (depth 2, same domain, with images)

基础递归爬取(深度2,同一域名,包含图片)

bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
Output structure (text + images for all pages):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg
bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
输出结构(所有页面的文本+图片):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # 所有页面共享的图片
    ├── logo.png
    └── diagram.svg

Deep crawl with custom limits

自定义限制的深度爬取

bash
{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup
bash
{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

Ignore robots.txt (use with caution)

忽略robots.txt(谨慎使用)

bash
{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0
bash
{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

Faster scraping (reduced rate limit)

快速抓取(降低速率限制)

bash
{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2
bash
{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

Features

功能特性

Single Page Mode

单页模式

  • HTML output: Preserves original page structure
    • ✅ Clean, readable HTML document
    • ✅ All images downloaded to
      images/
      folder
    • ✅ Suitable for offline viewing
  • Markdown output: Extracts clean text content
    • Auto-downloads images to local
      images/
      directory (default)
    • ✅ Converts image URLs to relative paths
    • ✅ Clean, readable format for archiving
    • ✅ Fallback to original URLs if download fails
    • Use
      --no-download-images
      flag to keep original URLs only
  • Simple and fast: Pure HTTP requests, no browser needed
  • Auto filename: Generates safe filename from URL if not specified
  • HTML输出:保留原始页面结构
    • ✅ 简洁可读的HTML文档
    • ✅ 所有图片下载到
      images/
      文件夹
    • ✅ 适合离线查看
  • Markdown输出:提取简洁的文本内容
    • 默认自动下载图片到本地
      images/
      目录
    • ✅ 将图片URL转换为相对路径
    • ✅ 简洁可读的存档格式
    • ✅ 若下载失败,自动回退到原始URL
    • 使用
      --no-download-images
      参数仅保留原始URL
  • 简单快速:纯HTTP请求,无需浏览器
  • 自动命名:若未指定文件名,从URL生成安全的文件名

Recursive Mode (
--recursive
)

递归模式(--recursive)

  • ✅ Intelligent link discovery: Automatically follows all links on crawled pages
  • ✅ Depth control:
    --max-depth
    limits how many levels deep to crawl (default: 2)
  • ✅ Page limit:
    --max-pages
    caps total pages to prevent runaway crawls (default: 50)
  • ✅ Domain filtering:
    --same-domain
    keeps crawl within starting domain (default: on)
  • ✅ robots.txt compliance: Respects site's crawling rules by default
  • ✅ Rate limiting:
    --rate-limit
    adds delay between requests (default: 0.5s)
  • ✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
  • ✅ Progress tracking: Real-time console output with success/fail/skip counts
  • ✅ Organized output: Preserves URL structure in directory hierarchy
  • ✅ Efficient crawling: Sequential with rate limiting to respect servers
  • ✅ 智能链接发现:自动跟随已爬取页面上的所有链接
  • ✅ 深度控制
    --max-depth
    参数限制爬取层级(默认:2)
  • ✅ 页面限制
    --max-pages
    参数限制总爬取页面数,防止无限制爬取(默认:50)
  • ✅ 域名过滤
    --same-domain
    参数将爬取限制在起始域名内(默认开启)
  • ✅ 遵守robots.txt:默认遵循网站的爬取规则
  • ✅ 速率限制
    --rate-limit
    参数添加请求间隔延迟(默认:0.5秒)
  • ✅ 智能URL过滤:跳过图片、脚本、CSS和重复URL
  • ✅ 进度跟踪:实时控制台输出,显示成功/失败/跳过的计数
  • ✅ 结构化输出:在目录层级中保留URL结构
  • ✅ 高效爬取:按顺序爬取并添加速率限制,以尊重服务器

Guardrails

注意事项

Single Page Mode

单页模式

  • Respect robots.txt and site terms of service
  • Some sites may block automated access; this tool uses standard HTTP requests
  • Large pages with many images may take time to download
  • 遵守robots.txt和网站服务条款
  • 部分网站可能阻止自动化访问;本工具使用标准HTTP请求
  • 包含大量图片的大页面可能需要较长下载时间

Recursive Mode

递归模式

  • Start small: Test with
    --max-depth 1 --max-pages 10
    first
  • Respect robots.txt: Default is on; only use
    --no-respect-robots
    for your own sites
  • Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
  • Same domain: Strongly recommended to keep
    --same-domain
    enabled
  • Monitor progress: Watch for high fail rates (may indicate blocking)
  • Storage: Recursive crawls can generate many files; ensure sufficient disk space
  • Legal: Ensure you have permission to crawl and archive the target site
  • 从小规模开始:先使用
    --max-depth 1 --max-pages 10
    进行测试
  • 遵守robots.txt:默认开启;仅对自己的网站使用
    --no-respect-robots
  • 速率限制:默认0.5秒是比较友好的;公共网站不要设置低于0.2秒
  • 同一域名:强烈建议开启
    --same-domain
  • 监控进度:注意高失败率(可能表示被拦截)
  • 存储:递归爬取可能生成大量文件;确保有足够的磁盘空间
  • 合法性:确保你有爬取和存档目标网站的权限

Troubleshooting

故障排除

  • Connection errors: Check your internet connection and URL validity
  • 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
  • Timeout: Increase
    --timeout
    flag for slow-loading pages (value in seconds)
  • Image download fails: Images will fall back to original URLs
  • Missing images: Some sites use JavaScript to load images dynamically (not supported)
  • 连接错误:检查网络连接和URL有效性
  • 403/被拦截:部分网站会阻止抓取工具;本工具使用真实的User-Agent头
  • 超时:对于加载缓慢的页面,增加
    --timeout
    参数的值(单位:秒)
  • 图片下载失败:图片将自动回退到原始URL
  • 图片缺失:部分网站使用JavaScript动态加载图片(本工具不支持此场景)