website-crawler
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWebsite Crawler
网页爬虫
High-performance web crawler with TypeScript/Bun frontend and Go backend for discovering and mapping website structure.
一款采用TypeScript/Bun前端与Go后端的高性能网页爬虫,用于发现和映射网站结构。
When to Use
适用场景
Use this skill when users ask to:
- Crawl a website or "spider a site"
- Map site structure or "discover all pages"
- Find all URLs on a website
- Generate sitemap or site report
- Analyze link relationships between pages
- Audit website coverage or completeness
- Extract page metadata (titles, status codes)
Keywords: crawl, spider, map, discover pages, site structure, sitemap, all URLs, website audit
当用户提出以下需求时,可使用本工具:
- 爬取网站或“遍历站点”
- 映射站点结构或“发现所有页面”
- 查找网站内所有URL
- 生成站点地图或站点报告
- 分析页面间的链接关系
- 审核网站覆盖范围或完整性
- 提取页面元数据(标题、状态码)
关键词:爬取、遍历、映射、发现页面、站点结构、站点地图、所有URL、网站审核
Quick Start
快速开始
Run the crawler from the scripts directory:
bash
cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]从scripts目录运行爬虫:
bash
cd ~/.claude/scripts/crawler
bun src/index.ts <URL> [options]CLI Options
CLI选项
| Option | Short | Default | Description |
|---|---|---|---|
| | 2 | Maximum crawl depth |
| | 20 | Concurrent workers |
| | 2 | Rate limit (requests/second) |
| | - | Use preset profile (fast/deep/gentle) |
| | auto | Output directory |
| | true | Use sitemap.xml for discovery |
| | auto | Allowed domain (extracted from URL) |
| - | false | Enable debug logging |
| 选项 | 缩写 | 默认值 | 描述 |
|---|---|---|---|
| | 2 | 最大爬取深度 |
| | 20 | 并发工作线程数 |
| | 2 | 速率限制(请求/秒) |
| | - | 使用预设配置(fast/deep/gentle) |
| | auto | 输出目录 |
| | true | 使用sitemap.xml进行页面发现 |
| | auto | 允许爬取的域名(从URL中自动提取) |
| - | false | 启用调试日志 |
Profiles
预设配置
Three preset profiles for common use cases:
| Profile | Workers | Depth | Rate | Use Case |
|---|---|---|---|---|
| 50 | 3 | 10 | Quick site mapping |
| 20 | 10 | 3 | Thorough crawling |
| 5 | 5 | 1 | Respect server limits |
针对常见使用场景提供三种预设配置:
| 配置文件 | 并发线程数 | 爬取深度 | 速率 | 适用场景 |
|---|---|---|---|---|
| 50 | 3 | 10 | 快速站点映射 |
| 20 | 10 | 3 | 深度全面爬取 |
| 5 | 5 | 1 | 友好爬取,尊重服务器限制 |
Usage Examples
使用示例
Basic crawl
基础爬取
bash
bun src/index.ts https://example.combash
bun src/index.ts https://example.comDeep crawl with high concurrency
高并发深度爬取
bash
bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5bash
bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5Using a profile
使用预设配置
bash
bun src/index.ts https://example.com --profile fastbash
bun src/index.ts https://example.com --profile fastGentle crawl (avoid rate limiting)
友好爬取(避免触发速率限制)
bash
bun src/index.ts https://example.com --profile gentlebash
bun src/index.ts https://example.com --profile gentleOutput
输出结果
The crawler generates two files in the output directory:
- results.json - Structured crawl data with all discovered pages
- index.html - Dark-themed HTML report with statistics
爬虫会在输出目录生成两个文件:
- results.json - 包含所有已发现页面的结构化爬取数据
- index.html - 深色主题的HTML报告,包含爬取统计信息
Results JSON Structure
结果JSON结构
json
{
"stats": {
"pages_found": 150,
"pages_crawled": 147,
"external_links": 23,
"errors": 3,
"duration": 45.2
},
"results": [
{
"url": "https://example.com/page",
"title": "Page Title",
"status_code": 200,
"depth": 1,
"links": ["..."],
"content_type": "text/html"
}
]
}json
{
"stats": {
"pages_found": 150,
"pages_crawled": 147,
"external_links": 23,
"errors": 3,
"duration": 45.2
},
"results": [
{
"url": "https://example.com/page",
"title": "Page Title",
"status_code": 200,
"depth": 1,
"links": ["..."],
"content_type": "text/html"
}
]
}Features
功能特性
- Sitemap Discovery: Automatically finds and parses sitemap.xml
- Checkpoint/Resume: Auto-saves progress every 30 seconds
- Rate Limiting: Token bucket algorithm prevents server overload
- Concurrent Crawling: Go worker pool for high performance
- HTML Reports: Dark-themed, mobile-responsive reports
- 站点地图发现:自动查找并解析sitemap.xml
- 断点续爬:每30秒自动保存爬取进度
- 速率限制:采用令牌桶算法,避免服务器过载
- 并发爬取:基于Go工作线程池实现高性能
- HTML报告:深色主题、适配移动端的可视化报告
Troubleshooting
故障排除
Rate limiting errors
速率限制错误
Reduce the rate limit or use the gentle profile:
bash
bun src/index.ts <url> --rate 1降低速率限制或使用gentle预设配置:
bash
bun src/index.ts <url> --rate 1or
或
bun src/index.ts <url> --profile gentle
undefinedbun src/index.ts <url> --profile gentle
undefinedGo binary not found
找不到Go二进制文件
The TypeScript frontend auto-compiles the Go binary. If compilation fails:
bash
cd ~/.claude/scripts/crawler/engine
go build -o crawler main.goTypeScript前端会自动编译Go二进制文件。若编译失败,可手动执行:
bash
cd ~/.claude/scripts/crawler/engine
go build -o crawler main.goTimeout on large sites
大型站点爬取超时
Reduce depth or increase workers:
bash
bun src/index.ts <url> --depth 1 --workers 50降低爬取深度或增加并发线程数:
bash
bun src/index.ts <url> --depth 1 --workers 50Architecture
架构设计
For detailed architecture, Go engine specifications, and code conventions, see reference.md.
如需了解详细架构、Go引擎规格和代码规范,请查看reference.md。
Related Files
相关文件
- Command:
plugins/crawler/commands/crawler.md - Reference:
plugins/crawler/skills/website-crawler/reference.md - Scripts:
plugins/crawler/skills/website-crawler/scripts/ - Profiles:
plugins/crawler/skills/website-crawler/scripts/config/profiles/
- 命令说明:
plugins/crawler/commands/crawler.md - 参考文档:
plugins/crawler/skills/website-crawler/reference.md - 脚本文件:
plugins/crawler/skills/website-crawler/scripts/ - 配置文件:
plugins/crawler/skills/website-crawler/scripts/config/profiles/