crawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCrawl Skill
Crawl Skill
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
爬取网站以提取多个页面的内容。非常适合文档、知识库和全站内容提取场景。
Prerequisites
前提条件
Tavily API Key Required - Get your key at https://tavily.com
Add to :
~/.claude/settings.jsonjson
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}需要Tavily API Key - 前往https://tavily.com获取你的密钥
添加到:
~/.claude/settings.jsonjson
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}Quick Start
快速开始
Using the Script
使用脚本
bash
./scripts/crawl.sh '<json>' [output_dir]Examples:
bash
undefinedbash
./scripts/crawl.sh '<json>' [output_dir]示例:
bash
undefinedBasic crawl
基础爬取
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
Deeper crawl with limits
带限制的深度爬取
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
Save to files
保存到文件
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
Focused crawl with path filters
带路径过滤的定向爬取
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'
With semantic instructions (for agentic use)
带语义指令的爬取(适用于智能代理场景)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
When `output_dir` is provided, each crawled page is saved as a separate markdown file../scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
当提供`output_dir`时,每个爬取的页面会保存为单独的Markdown文件。Basic Crawl
基础爬取
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'Focused Crawl with Instructions
带指令的定向爬取
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'API Reference
API 参考
Endpoint
端点
POST https://api.tavily.com/crawlPOST https://api.tavily.com/crawlHeaders
请求头
| Header | Value |
|---|---|
| |
| |
| 请求头 | 值 |
|---|---|
| |
| |
Request Body
请求体
| Field | Type | Default | Description |
|---|---|---|---|
| string | Required | Root URL to begin crawling |
| integer | 1 | Levels deep to crawl (1-5) |
| integer | 20 | Links per page |
| integer | 50 | Total pages cap |
| string | null | Natural language guidance for focus |
| integer | 3 | Chunks per page (1-5, requires instructions) |
| string | | |
| string | | |
| array | null | Regex patterns to include |
| array | null | Regex patterns to exclude |
| boolean | true | Include external domain links |
| float | 150 | Max wait (10-150 seconds) |
| 字段 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| string | 必填 | 开始爬取的根URL |
| integer | 1 | 爬取深度(1-5级) |
| integer | 20 | 每页爬取的链接数 |
| integer | 50 | 爬取页面总数上限 |
| string | null | 用于定向爬取的自然语言指令 |
| integer | 3 | 每个页面的内容块数量(1-5,需要配合instructions使用) |
| string | | |
| string | | |
| array | null | 要包含的路径正则表达式 |
| array | null | 要排除的路径正则表达式 |
| boolean | true | 是否包含外部域名链接 |
| float | 150 | 最大等待时间(10-150秒) |
Response Format
响应格式
json
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}json
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}Depth vs Performance
爬取深度与性能
| Depth | Typical Pages | Time |
|---|---|---|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |
Start with and increase only if needed.
max_depth=1| 深度 | 典型页面数量 | 耗时 |
|---|---|---|
| 1 | 10-50 | 数秒 |
| 2 | 50-500 | 数分钟 |
| 3 | 500-5000 | 数十分钟 |
建议从开始,仅在需要时再增加深度。
max_depth=1Crawl for Context vs Data Collection
用于上下文补充 vs 数据采集的爬取
For agentic use (feeding results into context): Always use + . This returns only relevant chunks instead of full pages, preventing context window explosion.
instructionschunks_per_sourceFor data collection (saving to files): Omit to get full page content.
chunks_per_source智能代理场景(将结果输入上下文): 始终使用 + 。这样只会返回相关的内容块,而非完整页面,避免上下文窗口溢出。
instructionschunks_per_source数据采集场景(保存到文件): 省略以获取完整页面内容。
chunks_per_sourceExamples
示例
For Context: Agentic Research (Recommended)
上下文补充:智能代理研究(推荐)
Use when feeding crawl results into an LLM context:
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
当需要将爬取结果输入LLM上下文时使用:
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'返回仅包含最相关的内容块(每个块最多500字符)——可适配上下文窗口,不会造成过载。
For Context: Targeted Technical Docs
上下文补充:定向技术文档爬取
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'For Data Collection: Full Page Archive
数据采集:完整页面归档
Use when saving content to files for later processing:
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'Returns full page content - use the script with to save as markdown files.
output_dir当需要将内容保存到文件供后续处理时使用:
bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'返回完整页面内容——配合脚本的参数可保存为Markdown文件。
output_dirMap API (URL Discovery)
Map API(URL发现)
Use instead of when you only need URLs, not content:
mapcrawlbash
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'Returns URLs only (faster than crawl):
json
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}当你只需要URL而不需要内容时,使用替代:
mapcrawlbash
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'仅返回URL(比crawl更快):
json
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}Tips
小贴士
- Always use for agentic workflows - prevents context explosion when feeding results to LLMs
chunks_per_source - Omit only for data collection - when saving full pages to files
chunks_per_source - Start conservative (,
max_depth=1) and scale uplimit=20 - Use path patterns to focus on relevant sections
- Use Map first to understand site structure before full crawl
- Always set a to prevent runaway crawls
limit
- 智能代理工作流中始终使用——避免将结果输入LLM时出现上下文溢出
chunks_per_source - 仅在数据采集时省略——当需要保存完整页面到文件时
chunks_per_source - 保守设置初始参数(,
max_depth=1),再根据需要扩展limit=20 - 使用路径正则表达式聚焦相关内容区域
- 先使用Map API了解网站结构再进行完整爬取
- **始终设置**防止爬取过程失控
limit