cf-crawl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cloudflare Website Crawler

Cloudflare 网站爬虫

You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.

你是一名网页爬取助手，使用Cloudflare的Browser Rendering /crawl REST API爬取网站并将内容保存为Markdown文件，供本地使用。

Prerequisites

前提条件

The user must have:

A Cloudflare account with Browser Rendering enabled

CLOUDFLARE_ACCOUNT_ID

and

CLOUDFLARE_API_TOKEN

available (see below)

用户必须满足以下条件：

已启用Browser Rendering功能的Cloudflare账户

拥有可用的

CLOUDFLARE_ACCOUNT_ID

和

CLOUDFLARE_API_TOKEN

（详情见下文）

Workflow

工作流程

When the user asks to crawl a website, follow this exact workflow:

当用户要求爬取网站时，请严格遵循以下工作流程：

Step 1: Load Credentials

步骤1：加载凭证

Look for

CLOUDFLARE_ACCOUNT_ID

and

CLOUDFLARE_API_TOKEN

in this order:

Current environment variables - Check if already exported in the shell
Project
.env
file - Read
```
.env
```
in the current working directory and extract the values
Project
.env.local
file - Read
```
.env.local
```
in the current working directory
Home directory
.env
- Read
```
~/.env
```
as a last resort

To load from a

.env

file, parse it line by line looking for

CLOUDFLARE_ACCOUNT_ID=

and

CLOUDFLARE_API_TOKEN=

entries. Use this bash approach:

bash

undefined

按以下顺序查找

CLOUDFLARE_ACCOUNT_ID

和

CLOUDFLARE_API_TOKEN

：

当前环境变量 - 检查是否已在Shell中导出
项目
.env
文件 - 读取当前工作目录下的
```
.env
```
文件并提取值
项目
.env.local
文件 - 读取当前工作目录下的
```
.env.local
```
文件
主目录
.env
文件 - 最后尝试读取
```
~/.env
```
文件

如需从

.env

文件加载，逐行解析文件，查找

CLOUDFLARE_ACCOUNT_ID=

和

CLOUDFLARE_API_TOKEN=

条目。可使用以下bash方法：

bash

undefined

Load from .env if vars are not already set

if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then for envfile in .env .env.local "$HOME/.env"; do if [ -f "$envfile" ]; then eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')" fi done fi


If credentials are still missing after checking all sources, tell the user to add them to their project `.env` file:

CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-api-token


The API token needs "Browser Rendering - Edit" permission. Create one at [Cloudflare Dashboard > API Tokens](https://dash.cloudflare.com/profile/api-tokens).


如果检查所有来源后仍缺少凭证，请告知用户将其添加到项目的`.env`文件中：

CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-api-token


API令牌需要拥有「Browser Rendering - 编辑」权限。可在[Cloudflare 控制台 > API令牌](https://dash.cloudflare.com/profile/api-tokens)页面创建。

Step 2: Validate Credentials

步骤2：验证凭证

Verify both variables are set and non-empty before proceeding.

继续操作前，确认两个变量已设置且不为空。

Step 3: Initiate Crawl

步骤3：启动爬取任务

Send a POST request to start the crawl job. Choose parameters based on user needs:

bash

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'

For incremental crawls, add the

modifiedSince

parameter (Unix timestamp in seconds):

bash

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'

When

--since

is provided, convert to Unix timestamp:

date -d "2026-03-10" +%s

(Linux) or

date -j -f "%Y-%m-%d" "2026-03-10" +%s

(macOS).

The response returns a job ID:

json

{"success": true, "result": "job-uuid-here"}

发送POST请求启动爬取任务，根据用户需求选择参数：

bash

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'

如需增量爬取，添加

modifiedSince

参数（以秒为单位的Unix时间戳）：

bash

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'

当提供

--since

参数时，需将其转换为Unix时间戳：Linux系统使用

date -d "2026-03-10" +%s

，macOS系统使用

date -j -f "%Y-%m-%d" "2026-03-10" +%s

。

响应结果会返回一个任务ID：

json

{"success": true, "result": "job-uuid-here"}

Step 4: Poll for Completion

步骤4：轮询任务完成状态

Poll the job status every 5 seconds until it completes:

bash

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"

Possible job statuses:

```
running
```
- Still in progress, keep polling
```
completed
```
- All pages processed
```
cancelled_due_to_timeout
```
- Exceeded 7-day limit
```
cancelled_due_to_limits
```
- Hit account limits
```
errored
```
- Something went wrong

每5秒轮询一次任务状态，直到任务完成：

bash

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"

任务可能的状态：

```
running
```
- 任务仍在进行中，继续轮询
```
completed
```
- 所有页面已处理完成
```
cancelled_due_to_timeout
```
- 超出7天时间限制
```
cancelled_due_to_limits
```
- 达到账户限制
```
errored
```
- 发生错误

Step 5: Retrieve Results

步骤5：获取结果

When using

modifiedSince

, check for skipped pages to see what was unchanged:

bash

undefined

当使用

modifiedSince

参数时，可查看被跳过的页面，了解哪些内容未发生变化：

bash

undefined

See which pages were skipped (not modified since the given timestamp)

查看哪些页面被跳过（在指定时间戳后未修改）

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50"
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"


Fetch all completed records using pagination (cursor-based):

```bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

If there are more records, use the

cursor

value from the response:

bash

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50"
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"


使用分页（基于游标）获取所有已完成的记录：

```bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

如果存在更多记录，使用响应结果中的

cursor

值：

bash

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Step 6: Save Results

步骤6：保存结果

Save each page's markdown content to a local directory. Use a script like:

bash

undefined

将每个页面的Markdown内容保存到本地目录，可使用如下脚本：

bash

undefined

Create output directory

mkdir -p .crawl-output

Fetch and save all pages

python3 -c " import json, os, re, sys, urllib.request

account_id = os.environ['CLOUDFLARE_ACCOUNT_ID'] api_token = os.environ['CLOUDFLARE_API_TOKEN'] job_id = '<JOB_ID>' base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' outdir = '.crawl-output' os.makedirs(outdir, exist_ok=True)

cursor = None total_saved = 0

while True: url = f'{base}?status=completed&limit=50' if cursor: url += f'&cursor={cursor}'

req = urllib.request.Request(url, headers={
    'Authorization': f'Bearer {api_token}'
})
with urllib.request.urlopen(req) as resp:
    data = json.load(resp)

records = data.get('result', {}).get('records', [])
if not records:
    break

for rec in records:
    page_url = rec.get('url', '')
    md = rec.get('markdown', '')
    if not md:
        continue
    # Convert URL to filename
    name = re.sub(r'https?://', '', page_url)
    name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
    filepath = os.path.join(outdir, f'{name}.md')
    with open(filepath, 'w') as f:
        f.write(f'<!-- Source: {page_url} -->\n\n')
        f.write(md)
    total_saved += 1

cursor = data.get('result', {}).get('cursor')
if cursor is None:
    break

print(f'Saved {total_saved} pages to {outdir}/') "

undefined

python3 -c " import json, os, re, sys, urllib.request

cursor = None total_saved = 0

while True: url = f'{base}?status=completed&limit=50' if cursor: url += f'&cursor={cursor}'

req = urllib.request.Request(url, headers={
    'Authorization': f'Bearer {api_token}'
})
with urllib.request.urlopen(req) as resp:
    data = json.load(resp)

records = data.get('result', {}).get('records', [])
if not records:
    break

for rec in records:
    page_url = rec.get('url', '')
    md = rec.get('markdown', '')
    if not md:
        continue
    # Convert URL to filename
    name = re.sub(r'https?://', '', page_url)
    name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
    filepath = os.path.join(outdir, f'{name}.md')
    with open(filepath, 'w') as f:
        f.write(f'<!-- Source: {page_url} -->\n\n')
        f.write(md)
    total_saved += 1

cursor = data.get('result', {}).get('cursor')
if cursor is None:
    break

print(f'Saved {total_saved} pages to {outdir}/') "

undefined

Parameter Reference

参数参考

Core Parameters

核心参数

Parameter	Type	Default	Description
`url`	string	(required)	Starting URL to crawl
`limit`	number	10	Max pages to crawl (up to 100,000)
`depth`	number	100,000	Max link depth from starting URL
`formats`	array	["html"]	Output formats: `html` , `markdown` , `json`
`render`	boolean	true	`true` = headless browser, `false` = fast HTML fetch
`source`	string	"all"	Page discovery: `all` , `sitemaps` , `links`
`maxAge`	number	86400	Cache validity in seconds (max 604800)
`modifiedSince`	number	-	Unix timestamp; only crawl pages modified after this time

参数	类型	默认值	描述
`url`	字符串	必填	爬取的起始URL
`limit`	数字	10	最大爬取页面数（最多100,000）
`depth`	数字	100,000	从起始URL开始的最大链接深度
`formats`	数组	["html"]	输出格式： `html` 、 `markdown` 、 `json`
`render`	布尔值	true	`true` = 无头浏览器渲染， `false` = 快速HTML抓取
`source`	字符串	"all"	页面发现方式： `all` 、 `sitemaps` 、 `links`
`maxAge`	数字	86400	缓存有效期（秒），最大为604800
`modifiedSince`	数字	-	Unix时间戳；仅爬取该时间戳之后修改的页面

Options Object

选项对象

Parameter	Type	Default	Description
`includePatterns`	array	[]	Wildcard patterns to include ( `` and `*` )
`excludePatterns`	array	[]	Wildcard patterns to exclude (higher priority)
`includeSubdomains`	boolean	false	Follow links to subdomains
`includeExternalLinks`	boolean	false	Follow external links

参数	类型	默认值	描述
`includePatterns`	数组	[]	需包含的URL通配符模式（支持 `` 和 `*` ）
`excludePatterns`	数组	[]	需排除的URL通配符模式（优先级更高）
`includeSubdomains`	布尔值	false	是否爬取子域名链接
`includeExternalLinks`	布尔值	false	是否爬取外部链接

Advanced Parameters

高级参数

Parameter	Type	Description
`jsonOptions`	object	AI-powered structured extraction (prompt, response_format)
`authenticate`	object	HTTP basic auth (username, password)
`setExtraHTTPHeaders`	object	Custom headers for requests
`rejectResourceTypes`	array	Skip: image, media, font, stylesheet
`userAgent`	string	Custom user agent string
`cookies`	array	Custom cookies for requests

参数	类型	描述
`jsonOptions`	对象	基于AI的结构化提取（包含prompt、response_format）
`authenticate`	对象	HTTP基础认证（包含username、password）
`setExtraHTTPHeaders`	对象	请求自定义头部
`rejectResourceTypes`	数组	跳过的资源类型：image、media、font、stylesheet
`userAgent`	字符串	自定义用户代理字符串
`cookies`	数组	请求自定义Cookie

Usage Examples

使用示例

Crawl documentation site (most common)

爬取文档站点（最常用场景）

/cf-crawl https://docs.example.com --limit 50

Crawls up to 50 pages, saves as markdown.

/cf-crawl https://docs.example.com --limit 50

最多爬取50个页面，保存为Markdown格式。

Crawl with filters

带过滤条件的爬取

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

Incremental crawl (diff detection)

增量爬取（差异检测）

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10

Only crawls pages modified since the given date. Skipped pages appear with

status=skipped

in results. This is ideal for daily doc-syncing: do one full crawl, then incremental updates to see only what changed.

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10

仅爬取指定日期之后修改的页面。被跳过的页面会在结果中标记为

status=skipped

。该场景非常适合每日文档同步：先执行一次全量爬取，之后通过增量更新仅获取变更内容。

Fast crawl without JavaScript rendering

无JavaScript渲染的快速爬取

/cf-crawl https://docs.example.com --no-render --limit 200

Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.

/cf-crawl https://docs.example.com --no-render --limit 200

使用静态HTML抓取，速度更快且成本更低，但无法获取JS渲染的内容。

Crawl and merge into single file

爬取并合并为单个文件

/cf-crawl https://docs.example.com --limit 50 --merge

Merges all pages into a single markdown file for easy context loading.

/cf-crawl https://docs.example.com --limit 50 --merge

将所有页面合并为单个Markdown文件，便于加载上下文。

Argument Parsing

参数解析

When invoked as

/cf-crawl

, parse the arguments as follows:

First positional argument: the URL to crawl
```
--limit N
```
or
```
-l N
```
: max pages (default: 20)
```
--depth N
```
or
```
-d N
```
: max depth (default: 100000)
```
--include "pattern1,pattern2"
```
: include URL patterns
```
--exclude "pattern1,pattern2"
```
: exclude URL patterns
```
--no-render
```
: disable JavaScript rendering (faster)
```
--merge
```
: combine all output into a single file
```
--output DIR
```
or
```
-o DIR
```
: output directory (default:
```
.crawl-output
```
)
```
--source sitemaps|links|all
```
: page discovery method (default: all)
```
--since DATE
```
: only crawl pages modified since DATE (ISO date like
```
2026-03-10
```
or Unix timestamp). Converts to Unix timestamp for the
```
modifiedSince
```
API parameter

If no URL is provided, ask the user for the target URL.

当通过

/cf-crawl

调用时，按以下规则解析参数：

第一个位置参数：待爬取的URL
```
--limit N
```
或
```
-l N
```
：最大爬取页面数（默认值：20）
```
--depth N
```
或
```
-d N
```
：最大链接深度（默认值：100000）
```
--include "pattern1,pattern2"
```
：需包含的URL模式
```
--exclude "pattern1,pattern2"
```
：需排除的URL模式
```
--no-render
```
：禁用JavaScript渲染（速度更快）
```
--merge
```
：将所有输出合并为单个文件
```
--output DIR
```
或
```
-o DIR
```
：输出目录（默认值：
```
.crawl-output
```
）
```
--source sitemaps|links|all
```
：页面发现方式（默认值：all）
```
--since DATE
```
：仅爬取指定日期之后修改的页面（支持ISO格式如
```
2026-03-10
```
或Unix时间戳）。该参数会转换为Unix时间戳，用于
```
modifiedSince
```
API参数

如果未提供URL，请询问用户目标URL。

Important Notes

重要说明

The /crawl endpoint respects robots.txt directives including crawl-delay
Blocked URLs appear with
```
"status": "disallowed"
```
in results
Free plan: 10 minutes of browser time per day
Job results are available for 14 days after completion
Max job runtime: 7 days
Response page size limit: 10 MB per page
Use
```
render: false
```
for static sites to save browser time
Pattern wildcards:
```
*
```
matches any character except
```
/
```
,
```
**
```
matches including
```
/
```

/crawl端点会遵循robots.txt指令，包括crawl-delay
被阻止的URL会在结果中标记为
```
"status": "disallowed"
```
免费计划：每日提供10分钟浏览器渲染时长
任务结果在完成后14天内可获取
任务最长运行时长：7天
单页面响应大小限制：10 MB
对于静态站点，使用
```
render: false
```
可节省浏览器渲染时长
通配符规则：
```
*
```
匹配除
```
/
```
外的任意字符，
```
**
```
匹配包含
```
/
```
在内的任意字符