website-crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Website Crawler

网页爬虫

High-performance web crawler with TypeScript/Bun frontend and Go backend for discovering and mapping website structure.
一款采用TypeScript/Bun前端与Go后端的高性能网页爬虫,用于发现和映射网站结构。

When to Use

适用场景

Use this skill when users ask to:
  • Crawl a website or "spider a site"
  • Map site structure or "discover all pages"
  • Find all URLs on a website
  • Generate sitemap or site report
  • Analyze link relationships between pages
  • Audit website coverage or completeness
  • Extract page metadata (titles, status codes)
Keywords: crawl, spider, map, discover pages, site structure, sitemap, all URLs, website audit
当用户提出以下需求时,可使用本工具:
  • 爬取网站或“遍历站点”
  • 映射站点结构或“发现所有页面”
  • 查找网站内所有URL
  • 生成站点地图或站点报告
  • 分析页面间的链接关系
  • 审核网站覆盖范围或完整性
  • 提取页面元数据(标题、状态码)
关键词:爬取、遍历、映射、发现页面、站点结构、站点地图、所有URL、网站审核

Quick Start

快速开始

Run the crawler from the scripts directory:
bash
cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]
从scripts目录运行爬虫:
bash
cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]

CLI Options

CLI选项

OptionShortDefaultDescription
--depth
-D
2Maximum crawl depth
--workers
-w
20Concurrent workers
--rate
-r
2Rate limit (requests/second)
--profile
-p
-Use preset profile (fast/deep/gentle)
--output
-o
autoOutput directory
--sitemap
-s
trueUse sitemap.xml for discovery
--domain
-d
autoAllowed domain (extracted from URL)
--debug
-falseEnable debug logging
选项缩写默认值描述
--depth
-D
2最大爬取深度
--workers
-w
20并发工作线程数
--rate
-r
2速率限制(请求/秒)
--profile
-p
-使用预设配置(fast/deep/gentle)
--output
-o
auto输出目录
--sitemap
-s
true使用sitemap.xml进行页面发现
--domain
-d
auto允许爬取的域名(从URL中自动提取)
--debug
-false启用调试日志

Profiles

预设配置

Three preset profiles for common use cases:
ProfileWorkersDepthRateUse Case
fast
50310Quick site mapping
deep
20103Thorough crawling
gentle
551Respect server limits
针对常见使用场景提供三种预设配置:
配置文件并发线程数爬取深度速率适用场景
fast
50310快速站点映射
deep
20103深度全面爬取
gentle
551友好爬取,尊重服务器限制

Usage Examples

使用示例

Basic crawl

基础爬取

bash
bun src/index.ts https://example.com
bash
bun src/index.ts https://example.com

Deep crawl with high concurrency

高并发深度爬取

bash
bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5
bash
bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5

Using a profile

使用预设配置

bash
bun src/index.ts https://example.com --profile fast
bash
bun src/index.ts https://example.com --profile fast

Gentle crawl (avoid rate limiting)

友好爬取(避免触发速率限制)

bash
bun src/index.ts https://example.com --profile gentle
bash
bun src/index.ts https://example.com --profile gentle

Output

输出结果

The crawler generates two files in the output directory:
  1. results.json - Structured crawl data with all discovered pages
  2. index.html - Dark-themed HTML report with statistics
爬虫会在输出目录生成两个文件:
  1. results.json - 包含所有已发现页面的结构化爬取数据
  2. index.html - 深色主题的HTML报告,包含爬取统计信息

Results JSON Structure

结果JSON结构

json
{
  "stats": {
    "pages_found": 150,
    "pages_crawled": 147,
    "external_links": 23,
    "errors": 3,
    "duration": 45.2
  },
  "results": [
    {
      "url": "https://example.com/page",
      "title": "Page Title",
      "status_code": 200,
      "depth": 1,
      "links": ["..."],
      "content_type": "text/html"
    }
  ]
}
json
{
  "stats": {
    "pages_found": 150,
    "pages_crawled": 147,
    "external_links": 23,
    "errors": 3,
    "duration": 45.2
  },
  "results": [
    {
      "url": "https://example.com/page",
      "title": "Page Title",
      "status_code": 200,
      "depth": 1,
      "links": ["..."],
      "content_type": "text/html"
    }
  ]
}

Features

功能特性

  • Sitemap Discovery: Automatically finds and parses sitemap.xml
  • Checkpoint/Resume: Auto-saves progress every 30 seconds
  • Rate Limiting: Token bucket algorithm prevents server overload
  • Concurrent Crawling: Go worker pool for high performance
  • HTML Reports: Dark-themed, mobile-responsive reports
  • 站点地图发现:自动查找并解析sitemap.xml
  • 断点续爬:每30秒自动保存爬取进度
  • 速率限制:采用令牌桶算法,避免服务器过载
  • 并发爬取:基于Go工作线程池实现高性能
  • HTML报告:深色主题、适配移动端的可视化报告

Troubleshooting

故障排除

Rate limiting errors

速率限制错误

Reduce the rate limit or use the gentle profile:
bash
bun src/index.ts <url> --rate 1
降低速率限制或使用gentle预设配置:
bash
bun src/index.ts <url> --rate 1

or

bun src/index.ts <url> --profile gentle
undefined
bun src/index.ts <url> --profile gentle
undefined

Go binary not found

找不到Go二进制文件

The TypeScript frontend auto-compiles the Go binary. If compilation fails:
bash
cd ~/.claude/scripts/crawler/engine
go build -o crawler main.go
TypeScript前端会自动编译Go二进制文件。若编译失败,可手动执行:
bash
cd ~/.claude/scripts/crawler/engine
go build -o crawler main.go

Timeout on large sites

大型站点爬取超时

Reduce depth or increase workers:
bash
bun src/index.ts <url> --depth 1 --workers 50
降低爬取深度或增加并发线程数:
bash
bun src/index.ts <url> --depth 1 --workers 50

Architecture

架构设计

For detailed architecture, Go engine specifications, and code conventions, see reference.md.
如需了解详细架构、Go引擎规格和代码规范,请查看reference.md

Related Files

相关文件

  • Command:
    plugins/crawler/commands/crawler.md
  • Reference:
    plugins/crawler/skills/website-crawler/reference.md
  • Scripts:
    plugins/crawler/skills/website-crawler/scripts/
  • Profiles:
    plugins/crawler/skills/website-crawler/scripts/config/profiles/
  • 命令说明:
    plugins/crawler/commands/crawler.md
  • 参考文档:
    plugins/crawler/skills/website-crawler/reference.md
  • 脚本文件:
    plugins/crawler/skills/website-crawler/scripts/
  • 配置文件:
    plugins/crawler/skills/website-crawler/scripts/config/profiles/