crawl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

爬取网站以提取多个页面的内容。非常适合文档、知识库和全站内容提取场景。

Prerequisites

前提条件

Tavily API Key Required - Get your key at https://tavily.com

Add to

~/.claude/settings.json

json

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

需要Tavily API Key - 前往https://tavily.com获取你的密钥

添加到

~/.claude/settings.json

json

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

快速开始

Using the Script

使用脚本

bash

./scripts/crawl.sh '<json>' [output_dir]

Examples:

bash

undefined

bash

./scripts/crawl.sh '<json>' [output_dir]

示例：

bash

undefined

Basic crawl

基础爬取

./scripts/crawl.sh '{"url": "https://docs.example.com"}'

Deeper crawl with limits

带限制的深度爬取

./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

Save to files

保存到文件

./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

Focused crawl with path filters

带路径过滤的定向爬取

./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'

With semantic instructions (for agentic use)

带语义指令的爬取（适用于智能代理场景）

./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'


When `output_dir` is provided, each crawled page is saved as a separate markdown file.

./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'


当提供`output_dir`时，每个爬取的页面会保存为单独的Markdown文件。

Basic Crawl

基础爬取

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

带指令的定向爬取

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

API 参考

Endpoint

端点

POST https://api.tavily.com/crawl

POST https://api.tavily.com/crawl

Headers

请求头

Header	Value
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

请求头	值
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

Request Body

请求体

Field	Type	Default	Description
`url`	string	Required	Root URL to begin crawling
`max_depth`	integer	1	Levels deep to crawl (1-5)
`max_breadth`	integer	20	Links per page
`limit`	integer	50	Total pages cap
`instructions`	string	null	Natural language guidance for focus
`chunks_per_source`	integer	3	Chunks per page (1-5, requires instructions)
`extract_depth`	string	`"basic"`	`basic` or `advanced`
`format`	string	`"markdown"`	`markdown` or `text`
`select_paths`	array	null	Regex patterns to include
`exclude_paths`	array	null	Regex patterns to exclude
`allow_external`	boolean	true	Include external domain links
`timeout`	float	150	Max wait (10-150 seconds)

字段	类型	默认值	描述
`url`	string	必填	开始爬取的根URL
`max_depth`	integer	1	爬取深度（1-5级）
`max_breadth`	integer	20	每页爬取的链接数
`limit`	integer	50	爬取页面总数上限
`instructions`	string	null	用于定向爬取的自然语言指令
`chunks_per_source`	integer	3	每个页面的内容块数量（1-5，需要配合instructions使用）
`extract_depth`	string	`"basic"`	`basic` 或 `advanced`
`format`	string	`"markdown"`	`markdown` 或 `text`
`select_paths`	array	null	要包含的路径正则表达式
`exclude_paths`	array	null	要排除的路径正则表达式
`allow_external`	boolean	true	是否包含外部域名链接
`timeout`	float	150	最大等待时间（10-150秒）

Response Format

响应格式

json

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

json

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

爬取深度与性能

Depth	Typical Pages	Time
1	10-50	Seconds
2	50-500	Minutes
3	500-5000	Many minutes

Start with
max_depth=1
and increase only if needed.

深度	典型页面数量	耗时
1	10-50	数秒
2	50-500	数分钟
3	500-5000	数十分钟

建议从
max_depth=1
开始，仅在需要时再增加深度。

Crawl for Context vs Data Collection

用于上下文补充 vs 数据采集的爬取

For agentic use (feeding results into context): Always use

instructions

chunks_per_source

. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit

chunks_per_source

to get full page content.

智能代理场景（将结果输入上下文）： 始终使用

instructions

chunks_per_source

。这样只会返回相关的内容块，而非完整页面，避免上下文窗口溢出。

数据采集场景（保存到文件）： 省略

chunks_per_source

以获取完整页面内容。

Examples

示例

For Context: Agentic Research (Recommended)

上下文补充：智能代理研究（推荐）

Use when feeding crawl results into an LLM context:

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

当需要将爬取结果输入LLM上下文时使用：

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

返回仅包含最相关的内容块（每个块最多500字符）——可适配上下文窗口，不会造成过载。

For Context: Targeted Technical Docs

上下文补充：定向技术文档爬取

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

数据采集：完整页面归档

Use when saving content to files for later processing:

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with

output_dir

to save as markdown files.

当需要将内容保存到文件供后续处理时使用：

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

返回完整页面内容——配合脚本的

output_dir

参数可保存为Markdown文件。

Map API (URL Discovery)

Map API（URL发现）

Use

map

instead of

crawl

when you only need URLs, not content:

bash

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

json

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

当你只需要URL而不需要内容时，使用

map

替代

crawl

：

bash

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

仅返回URL（比crawl更快）：

json

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

小贴士

Always use
chunks_per_source
for agentic workflows - prevents context explosion when feeding results to LLMs
Omit
chunks_per_source
only for data collection - when saving full pages to files
Start conservative (
```
max_depth=1
```
,
```
limit=20
```
) and scale up
Use path patterns to focus on relevant sections
Use Map first to understand site structure before full crawl
Always set a
limit
to prevent runaway crawls

智能代理工作流中始终使用
chunks_per_source
——避免将结果输入LLM时出现上下文溢出
仅在数据采集时省略
chunks_per_source
——当需要保存完整页面到文件时
保守设置初始参数（
```
max_depth=1
```
,
```
limit=20
```
），再根据需要扩展
使用路径正则表达式聚焦相关内容区域
先使用Map API了解网站结构再进行完整爬取
**始终设置
```
limit
```
**防止爬取过程失控