crawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Crawl Skill

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
爬取网站以提取多个页面的内容。非常适合文档、知识库和全站内容提取场景。

Prerequisites

前提条件

Tavily API Key Required - Get your key at https://tavily.com
Add to
~/.claude/settings.json
:
json
{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}
需要Tavily API Key - 前往https://tavily.com获取你的密钥
添加到
~/.claude/settings.json
:
json
{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

快速开始

Using the Script

使用脚本

bash
./scripts/crawl.sh '<json>' [output_dir]
Examples:
bash
undefined
bash
./scripts/crawl.sh '<json>' [output_dir]
示例:
bash
undefined

Basic crawl

基础爬取

./scripts/crawl.sh '{"url": "https://docs.example.com"}'
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

Deeper crawl with limits

带限制的深度爬取

./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

Save to files

保存到文件

./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

Focused crawl with path filters

带路径过滤的定向爬取

./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'

With semantic instructions (for agentic use)

带语义指令的爬取(适用于智能代理场景)

./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When `output_dir` is provided, each crawled page is saved as a separate markdown file.
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

当提供`output_dir`时,每个爬取的页面会保存为单独的Markdown文件。

Basic Crawl

基础爬取

bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

带指令的定向爬取

bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

API 参考

Endpoint

端点

POST https://api.tavily.com/crawl
POST https://api.tavily.com/crawl

Headers

请求头

HeaderValue
Authorization
Bearer <TAVILY_API_KEY>
Content-Type
application/json
请求头
Authorization
Bearer <TAVILY_API_KEY>
Content-Type
application/json

Request Body

请求体

FieldTypeDefaultDescription
url
stringRequiredRoot URL to begin crawling
max_depth
integer1Levels deep to crawl (1-5)
max_breadth
integer20Links per page
limit
integer50Total pages cap
instructions
stringnullNatural language guidance for focus
chunks_per_source
integer3Chunks per page (1-5, requires instructions)
extract_depth
string
"basic"
basic
or
advanced
format
string
"markdown"
markdown
or
text
select_paths
arraynullRegex patterns to include
exclude_paths
arraynullRegex patterns to exclude
allow_external
booleantrueInclude external domain links
timeout
float150Max wait (10-150 seconds)
字段类型默认值描述
url
string必填开始爬取的根URL
max_depth
integer1爬取深度(1-5级)
max_breadth
integer20每页爬取的链接数
limit
integer50爬取页面总数上限
instructions
stringnull用于定向爬取的自然语言指令
chunks_per_source
integer3每个页面的内容块数量(1-5,需要配合instructions使用)
extract_depth
string
"basic"
basic
advanced
format
string
"markdown"
markdown
text
select_paths
arraynull要包含的路径正则表达式
exclude_paths
arraynull要排除的路径正则表达式
allow_external
booleantrue是否包含外部域名链接
timeout
float150最大等待时间(10-150秒)

Response Format

响应格式

json
{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}
json
{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

爬取深度与性能

DepthTypical PagesTime
110-50Seconds
250-500Minutes
3500-5000Many minutes
Start with
max_depth=1
and increase only if needed.
深度典型页面数量耗时
110-50数秒
250-500数分钟
3500-5000数十分钟
建议从
max_depth=1
开始
,仅在需要时再增加深度。

Crawl for Context vs Data Collection

用于上下文补充 vs 数据采集的爬取

For agentic use (feeding results into context): Always use
instructions
+
chunks_per_source
. This returns only relevant chunks instead of full pages, preventing context window explosion.
For data collection (saving to files): Omit
chunks_per_source
to get full page content.
智能代理场景(将结果输入上下文): 始终使用
instructions
+
chunks_per_source
。这样只会返回相关的内容块,而非完整页面,避免上下文窗口溢出。
数据采集场景(保存到文件): 省略
chunks_per_source
以获取完整页面内容。

Examples

示例

For Context: Agentic Research (Recommended)

上下文补充:智能代理研究(推荐)

Use when feeding crawl results into an LLM context:
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
当需要将爬取结果输入LLM上下文时使用:
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'
返回仅包含最相关的内容块(每个块最多500字符)——可适配上下文窗口,不会造成过载。

For Context: Targeted Technical Docs

上下文补充:定向技术文档爬取

bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

数据采集:完整页面归档

Use when saving content to files for later processing:
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'
Returns full page content - use the script with
output_dir
to save as markdown files.
当需要将内容保存到文件供后续处理时使用:
bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'
返回完整页面内容——配合脚本的
output_dir
参数可保存为Markdown文件。

Map API (URL Discovery)

Map API(URL发现)

Use
map
instead of
crawl
when you only need URLs, not content:
bash
curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'
Returns URLs only (faster than crawl):
json
{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}
当你只需要URL而不需要内容时,使用
map
替代
crawl
bash
curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'
仅返回URL(比crawl更快):
json
{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

小贴士

  • Always use
    chunks_per_source
    for agentic workflows
    - prevents context explosion when feeding results to LLMs
  • Omit
    chunks_per_source
    only for data collection
    - when saving full pages to files
  • Start conservative (
    max_depth=1
    ,
    limit=20
    ) and scale up
  • Use path patterns to focus on relevant sections
  • Use Map first to understand site structure before full crawl
  • Always set a
    limit
    to prevent runaway crawls
  • 智能代理工作流中始终使用
    chunks_per_source
    ——避免将结果输入LLM时出现上下文溢出
  • 仅在数据采集时省略
    chunks_per_source
    ——当需要保存完整页面到文件时
  • 保守设置初始参数
    max_depth=1
    ,
    limit=20
    ),再根据需要扩展
  • 使用路径正则表达式聚焦相关内容区域
  • 先使用Map API了解网站结构再进行完整爬取
  • **始终设置
    limit
    **防止爬取过程失控