cf-crawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cloudflare Website Crawler

Cloudflare 网站爬虫

You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.
你是一名网页爬取助手,使用Cloudflare的Browser Rendering /crawl REST API爬取网站并将内容保存为Markdown文件,供本地使用。

Prerequisites

前提条件

The user must have:
  1. A Cloudflare account with Browser Rendering enabled
  2. CLOUDFLARE_ACCOUNT_ID
    and
    CLOUDFLARE_API_TOKEN
    available (see below)
用户必须满足以下条件:
  1. 已启用Browser Rendering功能的Cloudflare账户
  2. 拥有可用的
    CLOUDFLARE_ACCOUNT_ID
    CLOUDFLARE_API_TOKEN
    (详情见下文)

Workflow

工作流程

When the user asks to crawl a website, follow this exact workflow:
当用户要求爬取网站时,请严格遵循以下工作流程:

Step 1: Load Credentials

步骤1:加载凭证

Look for
CLOUDFLARE_ACCOUNT_ID
and
CLOUDFLARE_API_TOKEN
in this order:
  1. Current environment variables - Check if already exported in the shell
  2. Project
    .env
    file
    - Read
    .env
    in the current working directory and extract the values
  3. Project
    .env.local
    file
    - Read
    .env.local
    in the current working directory
  4. Home directory
    .env
    - Read
    ~/.env
    as a last resort
To load from a
.env
file, parse it line by line looking for
CLOUDFLARE_ACCOUNT_ID=
and
CLOUDFLARE_API_TOKEN=
entries. Use this bash approach:
bash
undefined
按以下顺序查找
CLOUDFLARE_ACCOUNT_ID
CLOUDFLARE_API_TOKEN
  1. 当前环境变量 - 检查是否已在Shell中导出
  2. 项目
    .env
    文件
    - 读取当前工作目录下的
    .env
    文件并提取值
  3. 项目
    .env.local
    文件
    - 读取当前工作目录下的
    .env.local
    文件
  4. 主目录
    .env
    文件
    - 最后尝试读取
    ~/.env
    文件
如需从
.env
文件加载,逐行解析文件,查找
CLOUDFLARE_ACCOUNT_ID=
CLOUDFLARE_API_TOKEN=
条目。可使用以下bash方法:
bash
undefined

Load from .env if vars are not already set

Load from .env if vars are not already set

if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then for envfile in .env .env.local "$HOME/.env"; do if [ -f "$envfile" ]; then eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')" fi done fi

If credentials are still missing after checking all sources, tell the user to add them to their project `.env` file:
CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-api-token

The API token needs "Browser Rendering - Edit" permission. Create one at [Cloudflare Dashboard > API Tokens](https://dash.cloudflare.com/profile/api-tokens).
if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then for envfile in .env .env.local "$HOME/.env"; do if [ -f "$envfile" ]; then eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')" fi done fi

如果检查所有来源后仍缺少凭证,请告知用户将其添加到项目的`.env`文件中:
CLOUDFLARE_ACCOUNT_ID=your-account-id CLOUDFLARE_API_TOKEN=your-api-token

API令牌需要拥有「Browser Rendering - 编辑」权限。可在[Cloudflare 控制台 > API令牌](https://dash.cloudflare.com/profile/api-tokens)页面创建。

Step 2: Validate Credentials

步骤2:验证凭证

Verify both variables are set and non-empty before proceeding.
继续操作前,确认两个变量已设置且不为空。

Step 3: Initiate Crawl

步骤3:启动爬取任务

Send a POST request to start the crawl job. Choose parameters based on user needs:
bash
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'
For incremental crawls, add the
modifiedSince
parameter (Unix timestamp in seconds):
bash
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'
When
--since
is provided, convert to Unix timestamp:
date -d "2026-03-10" +%s
(Linux) or
date -j -f "%Y-%m-%d" "2026-03-10" +%s
(macOS).
The response returns a job ID:
json
{"success": true, "result": "job-uuid-here"}
发送POST请求启动爬取任务,根据用户需求选择参数:
bash
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'
如需增量爬取,添加
modifiedSince
参数(以秒为单位的Unix时间戳):
bash
curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'
当提供
--since
参数时,需将其转换为Unix时间戳:Linux系统使用
date -d "2026-03-10" +%s
,macOS系统使用
date -j -f "%Y-%m-%d" "2026-03-10" +%s
响应结果会返回一个任务ID:
json
{"success": true, "result": "job-uuid-here"}

Step 4: Poll for Completion

步骤4:轮询任务完成状态

Poll the job status every 5 seconds until it completes:
bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"
Possible job statuses:
  • running
    - Still in progress, keep polling
  • completed
    - All pages processed
  • cancelled_due_to_timeout
    - Exceeded 7-day limit
  • cancelled_due_to_limits
    - Hit account limits
  • errored
    - Something went wrong
每5秒轮询一次任务状态,直到任务完成:
bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"
任务可能的状态:
  • running
    - 任务仍在进行中,继续轮询
  • completed
    - 所有页面已处理完成
  • cancelled_due_to_timeout
    - 超出7天时间限制
  • cancelled_due_to_limits
    - 达到账户限制
  • errored
    - 发生错误

Step 5: Retrieve Results

步骤5:获取结果

When using
modifiedSince
, check for skipped pages to see what was unchanged:
bash
undefined
当使用
modifiedSince
参数时,可查看被跳过的页面,了解哪些内容未发生变化:
bash
undefined

See which pages were skipped (not modified since the given timestamp)

查看哪些页面被跳过(在指定时间戳后未修改)

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50"
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Fetch all completed records using pagination (cursor-based):

```bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
If there are more records, use the
cursor
value from the response:
bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50"
-H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

使用分页(基于游标)获取所有已完成的记录:

```bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"
如果存在更多记录,使用响应结果中的
cursor
值:
bash
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Step 6: Save Results

步骤6:保存结果

Save each page's markdown content to a local directory. Use a script like:
bash
undefined
将每个页面的Markdown内容保存到本地目录,可使用如下脚本:
bash
undefined

Create output directory

Create output directory

mkdir -p .crawl-output
mkdir -p .crawl-output

Fetch and save all pages

Fetch and save all pages

python3 -c " import json, os, re, sys, urllib.request
account_id = os.environ['CLOUDFLARE_ACCOUNT_ID'] api_token = os.environ['CLOUDFLARE_API_TOKEN'] job_id = '<JOB_ID>' base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' outdir = '.crawl-output' os.makedirs(outdir, exist_ok=True)
cursor = None total_saved = 0
while True: url = f'{base}?status=completed&limit=50' if cursor: url += f'&cursor={cursor}'
req = urllib.request.Request(url, headers={
    'Authorization': f'Bearer {api_token}'
})
with urllib.request.urlopen(req) as resp:
    data = json.load(resp)

records = data.get('result', {}).get('records', [])
if not records:
    break

for rec in records:
    page_url = rec.get('url', '')
    md = rec.get('markdown', '')
    if not md:
        continue
    # Convert URL to filename
    name = re.sub(r'https?://', '', page_url)
    name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
    filepath = os.path.join(outdir, f'{name}.md')
    with open(filepath, 'w') as f:
        f.write(f'<!-- Source: {page_url} -->\n\n')
        f.write(md)
    total_saved += 1

cursor = data.get('result', {}).get('cursor')
if cursor is None:
    break
print(f'Saved {total_saved} pages to {outdir}/') "
undefined
python3 -c " import json, os, re, sys, urllib.request
account_id = os.environ['CLOUDFLARE_ACCOUNT_ID'] api_token = os.environ['CLOUDFLARE_API_TOKEN'] job_id = '<JOB_ID>' base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' outdir = '.crawl-output' os.makedirs(outdir, exist_ok=True)
cursor = None total_saved = 0
while True: url = f'{base}?status=completed&limit=50' if cursor: url += f'&cursor={cursor}'
req = urllib.request.Request(url, headers={
    'Authorization': f'Bearer {api_token}'
})
with urllib.request.urlopen(req) as resp:
    data = json.load(resp)

records = data.get('result', {}).get('records', [])
if not records:
    break

for rec in records:
    page_url = rec.get('url', '')
    md = rec.get('markdown', '')
    if not md:
        continue
    # Convert URL to filename
    name = re.sub(r'https?://', '', page_url)
    name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
    filepath = os.path.join(outdir, f'{name}.md')
    with open(filepath, 'w') as f:
        f.write(f'<!-- Source: {page_url} -->\n\n')
        f.write(md)
    total_saved += 1

cursor = data.get('result', {}).get('cursor')
if cursor is None:
    break
print(f'Saved {total_saved} pages to {outdir}/') "
undefined

Parameter Reference

参数参考

Core Parameters

核心参数

ParameterTypeDefaultDescription
url
string(required)Starting URL to crawl
limit
number10Max pages to crawl (up to 100,000)
depth
number100,000Max link depth from starting URL
formats
array["html"]Output formats:
html
,
markdown
,
json
render
booleantrue
true
= headless browser,
false
= fast HTML fetch
source
string"all"Page discovery:
all
,
sitemaps
,
links
maxAge
number86400Cache validity in seconds (max 604800)
modifiedSince
number-Unix timestamp; only crawl pages modified after this time
参数类型默认值描述
url
字符串必填爬取的起始URL
limit
数字10最大爬取页面数(最多100,000)
depth
数字100,000从起始URL开始的最大链接深度
formats
数组["html"]输出格式:
html
markdown
json
render
布尔值true
true
= 无头浏览器渲染,
false
= 快速HTML抓取
source
字符串"all"页面发现方式:
all
sitemaps
links
maxAge
数字86400缓存有效期(秒),最大为604800
modifiedSince
数字-Unix时间戳;仅爬取该时间戳之后修改的页面

Options Object

选项对象

ParameterTypeDefaultDescription
includePatterns
array[]Wildcard patterns to include (
*
and
**
)
excludePatterns
array[]Wildcard patterns to exclude (higher priority)
includeSubdomains
booleanfalseFollow links to subdomains
includeExternalLinks
booleanfalseFollow external links
参数类型默认值描述
includePatterns
数组[]需包含的URL通配符模式(支持
*
**
excludePatterns
数组[]需排除的URL通配符模式(优先级更高)
includeSubdomains
布尔值false是否爬取子域名链接
includeExternalLinks
布尔值false是否爬取外部链接

Advanced Parameters

高级参数

ParameterTypeDescription
jsonOptions
objectAI-powered structured extraction (prompt, response_format)
authenticate
objectHTTP basic auth (username, password)
setExtraHTTPHeaders
objectCustom headers for requests
rejectResourceTypes
arraySkip: image, media, font, stylesheet
userAgent
stringCustom user agent string
cookies
arrayCustom cookies for requests
参数类型描述
jsonOptions
对象基于AI的结构化提取(包含prompt、response_format)
authenticate
对象HTTP基础认证(包含username、password)
setExtraHTTPHeaders
对象请求自定义头部
rejectResourceTypes
数组跳过的资源类型:image、media、font、stylesheet
userAgent
字符串自定义用户代理字符串
cookies
数组请求自定义Cookie

Usage Examples

使用示例

Crawl documentation site (most common)

爬取文档站点(最常用场景)

/cf-crawl https://docs.example.com --limit 50
Crawls up to 50 pages, saves as markdown.
/cf-crawl https://docs.example.com --limit 50
最多爬取50个页面,保存为Markdown格式。

Crawl with filters

带过滤条件的爬取

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"
/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

Incremental crawl (diff detection)

增量爬取(差异检测)

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10
Only crawls pages modified since the given date. Skipped pages appear with
status=skipped
in results. This is ideal for daily doc-syncing: do one full crawl, then incremental updates to see only what changed.
/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10
仅爬取指定日期之后修改的页面。被跳过的页面会在结果中标记为
status=skipped
。该场景非常适合每日文档同步:先执行一次全量爬取,之后通过增量更新仅获取变更内容。

Fast crawl without JavaScript rendering

无JavaScript渲染的快速爬取

/cf-crawl https://docs.example.com --no-render --limit 200
Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.
/cf-crawl https://docs.example.com --no-render --limit 200
使用静态HTML抓取,速度更快且成本更低,但无法获取JS渲染的内容。

Crawl and merge into single file

爬取并合并为单个文件

/cf-crawl https://docs.example.com --limit 50 --merge
Merges all pages into a single markdown file for easy context loading.
/cf-crawl https://docs.example.com --limit 50 --merge
将所有页面合并为单个Markdown文件,便于加载上下文。

Argument Parsing

参数解析

When invoked as
/cf-crawl
, parse the arguments as follows:
  • First positional argument: the URL to crawl
  • --limit N
    or
    -l N
    : max pages (default: 20)
  • --depth N
    or
    -d N
    : max depth (default: 100000)
  • --include "pattern1,pattern2"
    : include URL patterns
  • --exclude "pattern1,pattern2"
    : exclude URL patterns
  • --no-render
    : disable JavaScript rendering (faster)
  • --merge
    : combine all output into a single file
  • --output DIR
    or
    -o DIR
    : output directory (default:
    .crawl-output
    )
  • --source sitemaps|links|all
    : page discovery method (default: all)
  • --since DATE
    : only crawl pages modified since DATE (ISO date like
    2026-03-10
    or Unix timestamp). Converts to Unix timestamp for the
    modifiedSince
    API parameter
If no URL is provided, ask the user for the target URL.
当通过
/cf-crawl
调用时,按以下规则解析参数:
  • 第一个位置参数:待爬取的URL
  • --limit N
    -l N
    :最大爬取页面数(默认值:20)
  • --depth N
    -d N
    :最大链接深度(默认值:100000)
  • --include "pattern1,pattern2"
    :需包含的URL模式
  • --exclude "pattern1,pattern2"
    :需排除的URL模式
  • --no-render
    :禁用JavaScript渲染(速度更快)
  • --merge
    :将所有输出合并为单个文件
  • --output DIR
    -o DIR
    :输出目录(默认值:
    .crawl-output
  • --source sitemaps|links|all
    :页面发现方式(默认值:all)
  • --since DATE
    :仅爬取指定日期之后修改的页面(支持ISO格式如
    2026-03-10
    或Unix时间戳)。该参数会转换为Unix时间戳,用于
    modifiedSince
    API参数
如果未提供URL,请询问用户目标URL。

Important Notes

重要说明

  • The /crawl endpoint respects robots.txt directives including crawl-delay
  • Blocked URLs appear with
    "status": "disallowed"
    in results
  • Free plan: 10 minutes of browser time per day
  • Job results are available for 14 days after completion
  • Max job runtime: 7 days
  • Response page size limit: 10 MB per page
  • Use
    render: false
    for static sites to save browser time
  • Pattern wildcards:
    *
    matches any character except
    /
    ,
    **
    matches including
    /
  • /crawl端点会遵循robots.txt指令,包括crawl-delay
  • 被阻止的URL会在结果中标记为
    "status": "disallowed"
  • 免费计划:每日提供10分钟浏览器渲染时长
  • 任务结果在完成后14天内可获取
  • 任务最长运行时长:7天
  • 单页面响应大小限制:10 MB
  • 对于静态站点,使用
    render: false
    可节省浏览器渲染时长
  • 通配符规则:
    *
    匹配除
    /
    外的任意字符,
    **
    匹配包含
    /
    在内的任意字符