website-crawler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Website Crawler

网页爬虫

High-performance web crawler with TypeScript/Bun frontend and Go backend for discovering and mapping website structure.

一款采用TypeScript/Bun前端与Go后端的高性能网页爬虫，用于发现和映射网站结构。

When to Use

适用场景

Use this skill when users ask to:

Crawl a website or "spider a site"
Map site structure or "discover all pages"
Find all URLs on a website
Generate sitemap or site report
Analyze link relationships between pages
Audit website coverage or completeness
Extract page metadata (titles, status codes)

Keywords: crawl, spider, map, discover pages, site structure, sitemap, all URLs, website audit

当用户提出以下需求时，可使用本工具：

爬取网站或“遍历站点”
映射站点结构或“发现所有页面”
查找网站内所有URL
生成站点地图或站点报告
分析页面间的链接关系
审核网站覆盖范围或完整性
提取页面元数据（标题、状态码）

关键词：爬取、遍历、映射、发现页面、站点结构、站点地图、所有URL、网站审核

Quick Start

快速开始

Run the crawler from the scripts directory:

bash

cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]

从scripts目录运行爬虫：

bash

cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]

CLI Options

CLI选项

Option	Short	Default	Description
`--depth`	`-D`	2	Maximum crawl depth
`--workers`	`-w`	20	Concurrent workers
`--rate`	`-r`	2	Rate limit (requests/second)
`--profile`	`-p`	-	Use preset profile (fast/deep/gentle)
`--output`	`-o`	auto	Output directory
`--sitemap`	`-s`	true	Use sitemap.xml for discovery
`--domain`	`-d`	auto	Allowed domain (extracted from URL)
`--debug`	-	false	Enable debug logging

选项	缩写	默认值	描述
`--depth`	`-D`	2	最大爬取深度
`--workers`	`-w`	20	并发工作线程数
`--rate`	`-r`	2	速率限制（请求/秒）
`--profile`	`-p`	-	使用预设配置（fast/deep/gentle）
`--output`	`-o`	auto	输出目录
`--sitemap`	`-s`	true	使用sitemap.xml进行页面发现
`--domain`	`-d`	auto	允许爬取的域名（从URL中自动提取）
`--debug`	-	false	启用调试日志

Profiles

预设配置

Three preset profiles for common use cases:

Profile	Workers	Depth	Rate	Use Case
`fast`	50	3	10	Quick site mapping
`deep`	20	10	3	Thorough crawling
`gentle`	5	5	1	Respect server limits

针对常见使用场景提供三种预设配置：

配置文件	并发线程数	爬取深度	速率	适用场景
`fast`	50	3	10	快速站点映射
`deep`	20	10	3	深度全面爬取
`gentle`	5	5	1	友好爬取，尊重服务器限制

Usage Examples

使用示例

Basic crawl

基础爬取

bash

bun src/index.ts https://example.com

bash

bun src/index.ts https://example.com

Deep crawl with high concurrency

高并发深度爬取

bash

bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5

bash

bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5

Using a profile

使用预设配置

bash

bun src/index.ts https://example.com --profile fast

bash

bun src/index.ts https://example.com --profile fast

Gentle crawl (avoid rate limiting)

友好爬取（避免触发速率限制）

bash

bun src/index.ts https://example.com --profile gentle

bash

bun src/index.ts https://example.com --profile gentle

Output

输出结果

The crawler generates two files in the output directory:

results.json - Structured crawl data with all discovered pages
index.html - Dark-themed HTML report with statistics

爬虫会在输出目录生成两个文件：

results.json - 包含所有已发现页面的结构化爬取数据
index.html - 深色主题的HTML报告，包含爬取统计信息

Results JSON Structure

结果JSON结构

json

{
  "stats": {
    "pages_found": 150,
    "pages_crawled": 147,
    "external_links": 23,
    "errors": 3,
    "duration": 45.2
  },
  "results": [
    {
      "url": "https://example.com/page",
      "title": "Page Title",
      "status_code": 200,
      "depth": 1,
      "links": ["..."],
      "content_type": "text/html"
    }
  ]
}

json

{
  "stats": {
    "pages_found": 150,
    "pages_crawled": 147,
    "external_links": 23,
    "errors": 3,
    "duration": 45.2
  },
  "results": [
    {
      "url": "https://example.com/page",
      "title": "Page Title",
      "status_code": 200,
      "depth": 1,
      "links": ["..."],
      "content_type": "text/html"
    }
  ]
}

Features

功能特性

Sitemap Discovery: Automatically finds and parses sitemap.xml
Checkpoint/Resume: Auto-saves progress every 30 seconds
Rate Limiting: Token bucket algorithm prevents server overload
Concurrent Crawling: Go worker pool for high performance
HTML Reports: Dark-themed, mobile-responsive reports

站点地图发现：自动查找并解析sitemap.xml
断点续爬：每30秒自动保存爬取进度
速率限制：采用令牌桶算法，避免服务器过载
并发爬取：基于Go工作线程池实现高性能
HTML报告：深色主题、适配移动端的可视化报告

Troubleshooting

故障排除

Rate limiting errors

速率限制错误

Reduce the rate limit or use the gentle profile:

bash

bun src/index.ts <url> --rate 1

降低速率限制或使用gentle预设配置：

bash

bun src/index.ts <url> --rate 1

or

或

bun src/index.ts <url> --profile gentle

undefined

bun src/index.ts <url> --profile gentle

undefined

Go binary not found

找不到Go二进制文件

The TypeScript frontend auto-compiles the Go binary. If compilation fails:

bash

cd ~/.claude/scripts/crawler/engine
go build -o crawler main.go

TypeScript前端会自动编译Go二进制文件。若编译失败，可手动执行：

bash

cd ~/.claude/scripts/crawler/engine
go build -o crawler main.go

Timeout on large sites

大型站点爬取超时

Reduce depth or increase workers:

bash

bun src/index.ts <url> --depth 1 --workers 50

降低爬取深度或增加并发线程数：

bash

bun src/index.ts <url> --depth 1 --workers 50

Architecture

架构设计

For detailed architecture, Go engine specifications, and code conventions, see reference.md.

如需了解详细架构、Go引擎规格和代码规范，请查看reference.md。

website-crawler

Original

Translation

Website Crawler

网页爬虫

When to Use

适用场景

Quick Start

快速开始

CLI Options

CLI选项

Profiles

预设配置

Usage Examples

使用示例

Basic crawl

基础爬取

Deep crawl with high concurrency

高并发深度爬取

Using a profile

使用预设配置

Gentle crawl (avoid rate limiting)

友好爬取（避免触发速率限制）

Output

输出结果

Results JSON Structure

结果JSON结构

Features

功能特性

Troubleshooting

故障排除

Rate limiting errors

速率限制错误

or

或

Go binary not found

找不到Go二进制文件

Timeout on large sites

大型站点爬取超时

Architecture

架构设计

Related Files

相关文件