web-crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Rust Web Crawler (rcrawler)

Rust网络爬虫(rcrawler)

High-performance web crawler built in pure Rust with production-grade features for fast, reliable site crawling.
一款基于纯Rust开发的高性能网络爬虫,具备生产级特性,可实现快速、可靠的站点爬取。

When to Use This Skill

何时使用本工具

Use this skill when the user requests:
  • Web crawling or site mapping
  • Sitemap discovery and analysis
  • Link extraction and validation
  • Site structure visualization
  • robots.txt compliance checking
  • Performance-critical web scraping
  • Generating interactive web reports with graph visualization
当用户提出以下需求时,可使用本工具:
  • 网络爬取或站点测绘
  • 站点地图发现与分析
  • 链接提取与验证
  • 站点结构可视化
  • robots.txt合规性检查
  • 对性能要求较高的网页抓取
  • 生成带图形可视化的交互式网页报告

Core Capabilities

核心功能

🚀 Performance

🚀 性能表现

  • 60+ pages/sec throughput with async Tokio runtime
  • <50ms startup time - Near-instant initialization
  • ~50MB memory usage - Efficient resource consumption
  • 5.4 MB binary - Single executable, no dependencies
  • 每秒60+页面:基于异步Tokio runtime的吞吐量
  • <50ms启动时间:近乎即时的初始化速度
  • 约50MB内存占用:高效的资源消耗
  • 5.4 MB二进制文件:单一可执行文件,无依赖项

🤖 Intelligence

🤖 智能特性

  • Sitemap discovery: Automatically finds and parses sitemap.xml (3 standard locations)
  • robots.txt compliance: Respects crawling rules with per-domain caching
  • Smart filtering: Auto-excludes images, CSS, JS, PDFs by default
  • Domain auto-detection: Extracts and restricts to base domain automatically
  • 站点地图发现:自动查找并解析sitemap.xml(支持3个标准位置)
  • robots.txt合规:遵循爬取规则,支持按域名缓存
  • 智能过滤:默认自动排除图片、CSS、JS、PDF文件
  • 域名自动检测:自动提取并限制在基础域名范围内爬取

🔒 Safety

🔒 安全保障

  • Rate limiting: Token bucket algorithm (default 2 req/s)
  • Configurable timeout: 30 second default
  • Memory safe: Rust's ownership system prevents crashes
  • Graceful shutdown: 2-second grace period for pending requests
  • 速率限制:采用令牌桶算法(默认每秒2次请求)
  • 可配置超时:默认30秒
  • 内存安全:Rust的所有权系统可防止崩溃
  • 优雅停机:为待处理请求提供2秒宽限期

📊 Output

📊 输出功能

  • Multiple formats: JSON, Markdown, HTML, CSV, Links, Text
  • LLM-ready Markdown: Clean content with YAML frontmatter
  • Interactive HTML report: Dashboard with graph visualization
  • Stealth mode: User-agent rotation and realistic headers
  • Content filtering: Remove nav, ads, scripts for clean data
  • Real-time progress: Updates every 5 seconds during crawl
  • 多格式支持:JSON、Markdown、HTML、CSV、链接列表、纯文本
  • 适配LLM的Markdown:带YAML前置元数据的干净内容
  • 交互式HTML报告:带图形可视化的仪表盘
  • 隐身模式:User-Agent轮换与真实请求头
  • 内容过滤:移除导航栏、广告、脚本以获取干净数据
  • 实时进度更新:爬取过程中每5秒更新一次进度

📝 Monitoring

📝 监控功能

  • Structured logging: tracing with timestamps and log levels
  • Progress tracking:
    [Progress] Pages: X/Y | Active jobs: Z | Errors: N
  • Detailed statistics: Pages found, crawled, external links, errors, duration
  • 结构化日志:带时间戳和日志级别的tracing日志
  • 进度跟踪
    [Progress] 页面数: X/Y | 活跃任务数: Z | 错误数: N
  • 详细统计:已发现页面数、已爬取页面数、外部链接数、错误数、耗时

Installation & Setup

安装与设置

Binary Location

二进制文件位置

bash
~/.claude/skills/web-crawler/bin/rcrawler
bash
~/.claude/skills/web-crawler/bin/rcrawler

Build from Source

从源码构建

bash
undefined
bash
undefined

Clone the repository

克隆仓库

Build release binary

构建发布版二进制文件

cargo build --release
cargo build --release

Copy to skill directory

复制到工具目录

cp target/release/rcrawler ~/.claude/skills/web-crawler/bin/

Build time: ~2 minutes
Binary size: 5.4 MB
cp target/release/rcrawler ~/.claude/skills/web-crawler/bin/

构建时间:约2分钟
二进制文件大小:5.4 MB

Command Line Interface

命令行界面

Basic Syntax

基本语法

bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> [OPTIONS]
bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> [OPTIONS]

Options

选项

Core Options:
  • -w, --workers <N>
    : Number of concurrent workers (default: 20, range: 1-50)
  • -d, --depth <N>
    : Maximum crawl depth (default: 2)
  • -r, --rate <N>
    : Rate limit in requests/second (default: 2.0)
Configuration:
  • -p, --profile <NAME>
    : Use predefined profile (fast/deep/gentle)
  • --domain <DOMAIN>
    : Restrict to specific domain (auto-detected from URL)
  • -o, --output <PATH>
    : Custom output directory (default: ./output)
Features:
  • -s, --sitemap
    : Enable/disable sitemap discovery (default: true)
  • --stealth
    : Enable stealth mode with user-agent rotation
  • --markdown
    : Convert HTML to LLM-ready Markdown with frontmatter
  • --filter-content
    : Enable content filtering (remove nav, ads, scripts)
  • --debug
    : Enable debug logging with detailed trace information
  • --resume
    : Resume from checkpoint if available
Output:
  • -f, --formats <LIST>
    : Output formats (json,markdown,html,csv,links,text)
核心选项:
  • -w, --workers <N>
    :并发工作线程数(默认:20,范围:1-50)
  • -d, --depth <N>
    :最大爬取深度(默认:2)
  • -r, --rate <N>
    :速率限制(每秒请求数,默认:2.0)
配置选项:
  • -p, --profile <NAME>
    :使用预定义配置文件(fast/deep/gentle)
  • --domain <DOMAIN>
    :限制爬取的特定域名(默认从URL自动检测)
  • -o, --output <PATH>
    :自定义输出目录(默认:./output)
功能选项:
  • -s, --sitemap
    :启用/禁用站点地图发现(默认:启用)
  • --stealth
    :启用隐身模式,轮换User-Agent
  • --markdown
    :将HTML转换为适配LLM的Markdown(带前置元数据)
  • --filter-content
    :启用内容过滤(移除导航栏、广告、脚本)
  • --debug
    :启用调试日志,输出详细跟踪信息
  • --resume
    :从检查点恢复爬取(如果可用)
输出格式选项:
  • -f, --formats <LIST>
    :输出格式(json,markdown,html,csv,links,text)

Profiles

预定义配置文件

Fast Profile (Quick Mapping)

快速配置(快速测绘)

bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p fast
  • Workers: 50
  • Depth: 3
  • Rate: 10 req/s
  • Use case: Quick site structure overview
bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p fast
  • 工作线程数:50
  • 爬取深度:3
  • 速率限制:10次请求/秒
  • 适用场景:快速获取站点结构概览

Deep Profile (Comprehensive Crawl)

深度配置(全面爬取)

bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p deep
  • Workers: 20
  • Depth: 10
  • Rate: 3 req/s
  • Use case: Complete site analysis
bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p deep
  • 工作线程数:20
  • 爬取深度:10
  • 速率限制:3次请求/秒
  • 适用场景:完整站点分析

Gentle Profile (Server-Friendly)

温和配置(服务器友好)

bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p gentle
  • Workers: 5
  • Depth: 5
  • Rate: 1 req/s
  • Use case: Respecting server resources
bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p gentle
  • 工作线程数:5
  • 爬取深度:5
  • 速率限制:1次请求/秒
  • 适用场景:避免给服务器造成过大压力

Usage Examples

使用示例

Example 1: Basic Crawl

示例1:基础爬取

bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com
Output:
console
[2026-01-10T01:17:27Z] INFO Starting crawl of: https://example.com
[2026-01-10T01:17:27Z] INFO Config: 20 workers, depth 2
Fetching sitemap URLs...
[Progress] Pages: 50/120 | Active jobs: 15 | Errors: 0
[Progress] Pages: 100/180 | Active jobs: 8 | Errors: 0

Crawl complete!
Pages crawled: 150
Duration: 8542ms
Results saved to: ./output/results.json
HTML report: ./output/index.html
bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com
输出:
console
[2026-01-10T01:17:27Z] INFO 开始爬取: https://example.com
[2026-01-10T01:17:27Z] INFO 配置: 20个工作线程,深度2
正在获取站点地图URL...
[Progress] 页面数: 50/120 | 活跃任务数: 15 | 错误数: 0
[Progress] 页面数: 100/180 | 活跃任务数: 8 | 错误数: 0

爬取完成!
已爬取页面数: 150
耗时: 8542ms
结果已保存至: ./output/results.json
HTML报告: ./output/index.html

Example 2: Stealth Mode with Markdown Export

示例2:隐身模式+Markdown导出

bash
~/.claude/skills/web-crawler/bin/rcrawler https://docs.example.com \
  --stealth --markdown -f markdown -d 3
Use case: Content extraction for LLM/RAG pipelines Expected: Clean Markdown with frontmatter, anti-detection headers
bash
~/.claude/skills/web-crawler/bin/rcrawler https://docs.example.com \
  --stealth --markdown -f markdown -d 3
适用场景:为LLM/RAG管道提取内容 预期结果:带前置元数据的干净Markdown文件,防检测请求头

Example 3: Fast Scan

示例3:快速扫描

bash
~/.claude/skills/web-crawler/bin/rcrawler https://blog.example.com -p fast
Use case: Quick blog mapping Expected: 50 workers, depth 3, ~3-5 seconds for 100 pages
bash
~/.claude/skills/web-crawler/bin/rcrawler https://blog.example.com -p fast
适用场景:快速测绘博客站点 预期结果:50个工作线程,深度3,爬取100个页面约需3-5秒

Example 4: Multi-Format Export

示例4:多格式导出

bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com \
  -f json,markdown,csv,links -o ./export
Use case: Export data in multiple formats simultaneously Expected: Generates results.json, results.md, results.csv, results.txt
bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com \
  -f json,markdown,csv,links -o ./export
适用场景:同时以多种格式导出数据 预期结果:生成results.json、results.md、results.csv、results.txt

Example 5: Debug Mode

示例5:调试模式

bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com --debug
Output: Detailed trace logs for troubleshooting
bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com --debug
输出:用于故障排查的详细跟踪日志

Output Format

输出格式

Directory Structure

目录结构

text
./output/
├── results.json       # Structured crawl data
├── results.md         # LLM-ready Markdown (with --markdown)
├── results.html       # Interactive report
├── results.csv        # Spreadsheet format (with -f csv)
├── results.txt        # URL list (with -f links)
└── checkpoint.json    # Auto-saved state (every 30s)
text
./output/
├── results.json       # 结构化爬取数据
├── results.md         # 适配LLM的Markdown文件(启用--markdown时生成)
├── results.html       # 交互式报告
├── results.csv        # 电子表格格式(启用-f csv时生成)
├── results.txt        # URL列表(启用-f links时生成)
└── checkpoint.json    # 自动保存的状态(每30秒更新一次)

JSON Structure (results.json)

JSON结构(results.json)

json
{
  "stats": {
    "pages_found": 450,
    "pages_crawled": 450,
    "external_links": 23,
    "excluded_links": 89,
    "errors": 0,
    "start_time": "2026-01-10T01:00:00Z",
    "end_time": "2026-01-10T01:00:07Z",
    "duration": 7512
  },
  "results": [
    {
      "url": "https://example.com",
      "title": "Example Domain",
      "status_code": 200,
      "depth": 0,
      "links": ["https://example.com/page1", "..."],
      "crawled_at": "2026-01-10T01:00:01Z",
      "content_type": "text/html"
    }
  ]
}
json
{
  "stats": {
    "pages_found": 450,
    "pages_crawled": 450,
    "external_links": 23,
    "excluded_links": 89,
    "errors": 0,
    "start_time": "2026-01-10T01:00:00Z",
    "end_time": "2026-01-10T01:00:07Z",
    "duration": 7512
  },
  "results": [
    {
      "url": "https://example.com",
      "title": "Example Domain",
      "status_code": 200,
      "depth": 0,
      "links": ["https://example.com/page1", "..."],
      "crawled_at": "2026-01-10T01:00:01Z",
      "content_type": "text/html"
    }
  ]
}

HTML Report Features

HTML报告特性

  • Interactive dashboard with key statistics
  • Graph visualization using force-graph library
  • Node sizing based on link count (logarithmic scale)
  • Status color coding: Green (success), red (errors)
  • Hover tooltips: In-degree and out-degree information
  • Click to navigate: Opens page URL in new tab
  • Light/dark mode: Auto-detection via CSS
  • Collapsible sections: Reduces scroll for large crawls
  • Mobile responsive: Works on all devices
  • 交互式仪表盘:展示关键统计数据
  • 图形可视化:使用force-graph库
  • 节点大小:基于链接数量(对数刻度)
  • 状态颜色编码:绿色(成功)、红色(错误)
  • 悬浮提示:入度和出度信息
  • 点击导航:在新标签页打开页面URL
  • 明暗模式:通过CSS自动检测
  • 可折叠章节:减少大量爬取结果的滚动量
  • 移动端适配:支持所有设备

Implementation Workflow

实现流程

When a user requests a crawl, follow these steps:
当用户请求爬取时,遵循以下步骤:

1. Parse Request

1. 解析请求

Extract from user message:
  • URL (required): Target website
  • Workers (optional): Number of concurrent workers
  • Depth (optional): Maximum crawl depth
  • Rate (optional): Requests per second
  • Profile (optional): fast/deep/gentle
从用户消息中提取:
  • URL(必填):目标网站
  • 工作线程数(可选):并发工作线程数量
  • 爬取深度(可选):最大爬取深度
  • 速率限制(可选):每秒请求数
  • 配置文件(可选):fast/deep/gentle

2. Validate Input

2. 验证输入

  • Check URL format (add https:// if missing)
  • Validate workers range (1-50)
  • Validate depth (1-10 recommended)
  • Validate rate (0.1-20.0 recommended)
  • 检查URL格式(如果缺失则添加https://)
  • 验证工作线程数范围(1-50)
  • 验证爬取深度(推荐1-10)
  • 验证速率限制(推荐0.1-20.0)

3. Build Command

3. 构建命令

bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> \
  -w <workers> \
  -d <depth> \
  -r <rate> \
  [--debug] \
  [-o <output>]
bash
~/.claude/skills/web-crawler/bin/rcrawler <URL> \
  -w <workers> \
  -d <depth> \
  -r <rate> \
  [--debug] \
  [-o <output>]

4. Execute Crawl

4. 执行爬取

Use Bash tool to run the command:
bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com -w 20 -d 2
使用Bash工具运行命令:
bash
~/.claude/skills/web-crawler/bin/rcrawler https://example.com -w 20 -d 2

5. Monitor Progress

5. 监控进度

Watch for progress updates in output:
  • [Progress] Pages: X/Y | Active jobs: Z | Errors: N
  • Updates appear every 5 seconds
  • Shows real-time crawl status
关注输出中的进度更新:
  • [Progress] 页面数: X/Y | 活跃任务数: Z | 错误数: N
  • 每5秒更新一次
  • 显示实时爬取状态

6. Report Results

6. 报告结果

When crawl completes, inform user:
  • Number of pages crawled
  • Duration in seconds or minutes
  • Path to results:
    ./output/results.json
  • Path to HTML report:
    ./output/index.html
  • Offer to open HTML report:
    open ./output/index.html
爬取完成后,告知用户:
  • 已爬取页面数
  • 耗时(秒或分钟)
  • 结果路径:
    ./output/results.json
  • HTML报告路径:
    ./output/index.html
  • 可提供打开HTML报告的命令:
    open ./output/index.html

Natural Language Parsing

自然语言解析

Example User Requests

用户请求示例

Request: "Crawl docs.example.com" Parse: URL = https://docs.example.com, use defaults Command:
rcrawler https://docs.example.com
Request: "Quick scan of blog.example.com" Parse: URL = blog.example.com, profile = fast Command:
rcrawler https://blog.example.com -p fast
Request: "Deep crawl of api-docs.example.com with 40 workers" Parse: URL = api-docs.example.com, workers = 40, depth = deep Command:
rcrawler https://api-docs.example.com -w 40 -d 5
Request: "Crawl example.com carefully, don't overload their server" Parse: URL = example.com, profile = gentle Command:
rcrawler https://example.com -p gentle
Request: "Map the structure of help.example.com" Parse: URL = help.example.com, depth = moderate Command:
rcrawler https://help.example.com -d 3
请求:"爬取docs.example.com" 解析结果:URL = https://docs.example.com,使用默认配置 命令
rcrawler https://docs.example.com
请求:"快速扫描blog.example.com" 解析结果:URL = blog.example.com,配置文件 = fast 命令
rcrawler https://blog.example.com -p fast
请求:"深度爬取api-docs.example.com,使用40个工作线程" 解析结果:URL = api-docs.example.com,工作线程数 = 40,爬取深度 = 深度模式 命令
rcrawler https://api-docs.example.com -w 40 -d 5
请求:"小心爬取example.com,不要给服务器造成过大压力" 解析结果:URL = example.com,配置文件 = gentle 命令
rcrawler https://example.com -p gentle
请求:"测绘help.example.com的站点结构" 解析结果:URL = help.example.com,爬取深度 = 中等 命令
rcrawler https://help.example.com -d 3

Error Handling

错误处理

Binary Not Found

二进制文件未找到

console
undefined
console
undefined

Check if binary exists

检查二进制文件是否存在

ls ~/.claude/skills/web-crawler/bin/rcrawler
ls ~/.claude/skills/web-crawler/bin/rcrawler

If missing, build it

如果缺失,重新构建

cd ~/.claude/skills/web-crawler/scripts && cargo build --release
undefined
cd ~/.claude/skills/web-crawler/scripts && cargo build --release
undefined

Crawl Failures

爬取失败

Network errors:
  • Verify URL is accessible:
    curl -I <URL>
  • Check if site is down or blocking crawlers
  • Try with lower rate:
    -r 1
robots.txt blocking:
  • Crawler respects robots.txt by default
  • Check rules:
    curl <URL>/robots.txt
  • Inform user of restrictions
Timeout errors:
  • Increase timeout in code (default 30s)
  • Reduce workers:
    -w 10
  • Lower rate limit:
    -r 1
Too many errors:
  • Enable debug mode:
    --debug
  • Check specific failing URLs
  • May need to exclude certain patterns
网络错误:
  • 验证URL是否可访问:
    curl -I <URL>
  • 检查站点是否下线或阻止爬虫
  • 尝试降低速率限制:
    -r 1
robots.txt阻止:
  • 爬虫默认遵循robots.txt规则
  • 检查规则:
    curl <URL>/robots.txt
  • 告知用户相关限制
超时错误:
  • 在代码中增加超时时间(默认30秒)
  • 减少工作线程数:
    -w 10
  • 降低速率限制:
    -r 1
错误过多:
  • 启用调试模式:
    --debug
  • 检查具体失败的URL
  • 可能需要排除某些特定模式

Performance Benchmarks

性能基准测试

Test: adonisjs.com

测试站点:adonisjs.com

  • Pages: 450
  • Duration: 7.5 seconds
  • Throughput: 60 pages/sec
  • Workers: 20
  • Depth: 2
  • 页面数:450
  • 耗时:7.5秒
  • 吞吐量:每秒60个页面
  • 工作线程数:20
  • 爬取深度:2

Test: rust-lang.org

测试站点:rust-lang.org

  • Pages: 16
  • Duration: 3.9 seconds
  • Workers: 10
  • Depth: 1
  • 页面数:16
  • 耗时:3.9秒
  • 工作线程数:10
  • 爬取深度:1

Test: example.com

测试站点:example.com

  • Pages: 2
  • Duration: 2.7 seconds
  • Workers: 5
  • Depth: 1
  • 页面数:2
  • 耗时:2.7秒
  • 工作线程数:5
  • 爬取深度:1

Technical Architecture

技术架构

Core Components

核心组件

  1. CrawlEngine (src/crawler/engine.rs)
    • Worker pool management
    • Job queue coordination
    • Shutdown signaling
    • Statistics tracking
  2. RobotsChecker (src/crawler/robots.rs)
    • Per-domain caching
    • Rule validation
    • Fallback on errors
  3. RateLimiter (src/crawler/rate_limiter.rs)
    • Token bucket algorithm
    • Configurable rate
    • Shared across workers
  4. UrlFilter (src/utils/filters.rs)
    • Regex-based filtering
    • Include/exclude patterns
    • Default exclusions
  5. HtmlParser (src/parser/html.rs)
    • CSS selector queries
    • Title extraction
    • Link discovery
  6. SitemapParser (src/parser/sitemap.rs)
    • XML parsing
    • Index traversal
    • URL extraction
  1. CrawlEngine(src/crawler/engine.rs)
    • 工作线程池管理
    • 任务队列协调
    • 停机信号处理
    • 统计数据跟踪
  2. RobotsChecker(src/crawler/robots.rs)
    • 按域名缓存
    • 规则验证
    • 错误时的回退处理
  3. RateLimiter(src/crawler/rate_limiter.rs)
    • 令牌桶算法
    • 可配置速率
    • 跨工作线程共享
  4. UrlFilter(src/utils/filters.rs)
    • 基于正则表达式的过滤
    • 包含/排除模式
    • 默认排除规则
  5. HtmlParser(src/parser/html.rs)
    • CSS选择器查询
    • 标题提取
    • 链接发现
  6. SitemapParser(src/parser/sitemap.rs)
    • XML解析
    • 索引遍历
    • URL提取

Key Dependencies

关键依赖

  • tokio: Async runtime (multi-threaded)
  • reqwest: HTTP client (connection pooling)
  • scraper: HTML parsing (CSS selectors)
  • quick-xml: Sitemap parsing
  • governor: Rate limiting (token bucket)
  • tracing: Structured logging
  • dashmap: Concurrent HashMap
  • robotstxt: robots.txt compliance
  • clap: CLI argument parsing
  • serde/serde_json: Serialization
  • tokio:异步运行时(多线程)
  • reqwest:HTTP客户端(连接池)
  • scraper:HTML解析(CSS选择器)
  • quick-xml:站点地图解析
  • governor:速率限制(令牌桶)
  • tracing:结构化日志
  • dashmap:并发HashMap
  • robotstxt:robots.txt合规性
  • clap:CLI参数解析
  • serde/serde_json:序列化

Tips & Best Practices

技巧与最佳实践

1. Start with Default Settings

1. 从默认配置开始

First crawl should use defaults to understand site structure.
首次爬取应使用默认配置,以了解站点结构。

2. Use Profiles for Common Scenarios

2. 针对常见场景使用预定义配置文件

  • fast: Quick overviews
  • deep: Comprehensive analysis
  • gentle: Respectful crawling
  • fast:快速概览
  • deep:全面分析
  • gentle:友好爬取

3. Monitor Progress

3. 监控进度

Watch the
[Progress]
lines to ensure crawl is progressing.
关注
[Progress]
行,确保爬取正常进行。

4. Check HTML Report

4. 查看HTML报告

Interactive visualization helps understand site structure better than JSON.
交互式可视化比JSON更有助于理解站点结构。

5. Respect Rate Limits

5. 遵守速率限制

Default 2 req/s is safe for most sites. Increase cautiously.
默认每秒2次请求对大多数站点来说是安全的,谨慎提高速率。

6. Enable Debug for Issues

6. 遇到问题时启用调试模式

--debug
flag provides detailed logs for troubleshooting.
--debug
标志提供详细日志用于故障排查。

7. Review robots.txt

7. 查看robots.txt

Check
<URL>/robots.txt
to understand crawling restrictions.
检查
<URL>/robots.txt
以了解爬取限制。

8. Use Custom Output for Multiple Crawls

8. 多爬取任务使用自定义输出目录

Avoid overwriting results with
-o
flag.
使用
-o
标志避免覆盖之前的结果。

Future Enhancements (V2.0)

未来规划(V2.0)

  • Checkpoint resume: Full integration of checkpoint system
  • Per-domain rate limiting: Different rates for different domains
  • JavaScript rendering: chromiumoxide for dynamic sites
  • Distributed crawling: Redis-based job queue
  • Advanced analytics: SEO analysis, link quality scoring
  • 检查点恢复:完整集成检查点系统
  • 按域名速率限制:为不同域名设置不同速率
  • JavaScript渲染:使用chromiumoxide处理动态站点
  • 分布式爬取:基于Redis的任务队列
  • 高级分析:SEO分析、链接质量评分

Support & Resources

支持与资源

  • GitHub Repository: leobrival/rcrawler
  • Binary:
    ~/.claude/skills/web-crawler/bin/rcrawler
  • Skill Documentation: This file (SKILL.md)
  • Quick Start: README.md (this repository)
  • Development Guide: DEVELOPMENT.md

Version: 1.0.0 Status: Production Ready
  • GitHub仓库leobrival/rcrawler
  • 二进制文件
    ~/.claude/skills/web-crawler/bin/rcrawler
  • 工具文档:本文件(SKILL.md)
  • 快速开始:README.md(仓库中)
  • 开发指南DEVELOPMENT.md

版本:1.0.0 状态:可用于生产环境