firecrawl-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Firecrawl Scraper Skill

Firecrawl 抓取器Skill

Trigger Conditions & Endpoint Selection

触发条件与端点选择

Choose Firecrawl endpoint based on user intent:
  • scrape: Need to extract content from a single web page (markdown, html, json, screenshot, pdf)
  • crawl: Need to crawl entire website with depth control and path filtering
  • map: Need to quickly get a list of all URLs on a website
  • batch-scrape: Need to scrape multiple URLs in parallel
  • crawl-status: Given crawl job ID, check crawl progress/results (optional
    --wait
    )
根据用户意图选择Firecrawl端点:
  • scrape: 需要从单个网页提取内容(支持markdown、html、json、截图、pdf格式)
  • crawl: 需要爬取整个网站,支持深度控制和路径过滤
  • map: 需要快速获取网站上的所有URL列表
  • batch-scrape: 需要并行抓取多个URL
  • crawl-status: 给定爬取任务ID,检查爬取进度/结果(可选
    --wait
    参数)

Recommended Architecture (Main Skill + Sub-skill)

推荐架构(主Skill + 子Skill)

This skill uses a two-phase architecture:
  1. Main skill (current context): Understand user question → Choose endpoint → Assemble JSON payload
  2. Sub-skill (fork context): Only responsible for HTTP call execution, avoiding conversation history token waste
本Skill采用两阶段架构:
  1. 主Skill(当前上下文):理解用户问题 → 选择端点 → 组装JSON请求体
  2. 子Skill(分支上下文):仅负责执行HTTP调用,避免对话历史的token浪费

Execution Method

执行方法

Use Task tool to invoke
firecrawl-fetcher
sub-skill, passing command and JSON (stdin):
Task parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
  { ...payload... }
  JSON
使用Task工具调用
firecrawl-fetcher
子Skill,传入命令和JSON(通过标准输入):
Task parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
  { ...payload... }
  JSON

Payload Examples

请求体示例

1) Scrape Single Page

1) 抓取单个页面

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": ["nav", "footer"],
  "waitFor": 0,
  "timeout": 30000
}
JSON
Available formats:
  • "markdown"
    ,
    "html"
    ,
    "rawHtml"
    ,
    "links"
    ,
    "images"
    ,
    "summary"
  • {"type": "json", "prompt": "Extract product info", "schema": {...}}
  • {"type": "screenshot", "fullPage": true, "quality": 85}
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": ["nav", "footer"],
  "waitFor": 0,
  "timeout": 30000
}
JSON
支持的格式:
  • "markdown"
    ,
    "html"
    ,
    "rawHtml"
    ,
    "links"
    ,
    "images"
    ,
    "summary"
  • {"type": "json", "prompt": "Extract product info", "schema": {...}}
  • {"type": "screenshot", "fullPage": true, "quality": 85}

2) Scrape with Actions (Page Interaction)

2) 带交互动作的抓取(页面交互)

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com",
  "formats": ["markdown"],
  "actions": [
    {"type": "wait", "milliseconds": 2000},
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "milliseconds": 1000},
    {"type": "scroll", "direction": "down", "amount": 500}
  ]
}
JSON
Available actions:
  • wait
    ,
    click
    ,
    write
    ,
    press
    ,
    scroll
    ,
    screenshot
    ,
    scrape
    ,
    executeJavascript
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com",
  "formats": ["markdown"],
  "actions": [
    {"type": "wait", "milliseconds": 2000},
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "milliseconds": 1000},
    {"type": "scroll", "direction": "down", "amount": 500}
  ]
}
JSON
支持的动作:
  • wait
    ,
    click
    ,
    write
    ,
    press
    ,
    scroll
    ,
    screenshot
    ,
    scrape
    ,
    executeJavascript

3) Parse PDF

3) 解析PDF

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com/document.pdf",
  "formats": ["markdown"],
  "parsers": ["pdf"]
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com/document.pdf",
  "formats": ["markdown"],
  "parsers": ["pdf"]
}
JSON

4) Extract Structured JSON

4) 提取结构化JSON

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com/product",
  "formats": [
    {
      "type": "json",
      "prompt": "Extract product information",
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "description": {"type": "string"}
        },
        "required": ["name", "price"]
      }
    }
  ]
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
  "url": "https://example.com/product",
  "formats": [
    {
      "type": "json",
      "prompt": "Extract product information",
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "description": {"type": "string"}
        },
        "required": ["name", "price"]
      }
    }
  ]
}
JSON

5) Crawl Entire Website

5) 爬取整个网站

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "includePaths": ["^/docs/.*"],
  "excludePaths": ["^/blog/.*"],
  "maxDiscoveryDepth": 3,
  "limit": 100,
  "allowExternalLinks": false,
  "allowSubdomains": false
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "includePaths": ["^/docs/.*"],
  "excludePaths": ["^/blog/.*"],
  "maxDiscoveryDepth": 3,
  "limit": 100,
  "allowExternalLinks": false,
  "allowSubdomains": false
}
JSON

5.1) Crawl + Wait for Completion

5.1) 爬取 + 等待完成

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl --wait
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "limit": 100
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl --wait
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "limit": 100
}
JSON

6) Map Website URLs

6) 生成网站URL地图

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js map
{
  "url": "https://example.com",
  "search": "documentation",
  "limit": 5000
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js map
{
  "url": "https://example.com",
  "search": "documentation",
  "limit": 5000
}
JSON

7) Batch Scrape Multiple URLs

7) 批量抓取多个URL

bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js batch-scrape
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "formats": ["markdown"]
}
JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js batch-scrape
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "formats": ["markdown"]
}
JSON

8) Check Crawl Status

8) 检查爬取状态

bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id>
Wait for completion:
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id> --wait
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id>
等待完成:
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id> --wait

Key Features

核心功能

Formats

格式支持

  • markdown: Clean markdown content
  • html: Parsed HTML
  • rawHtml: Original HTML
  • links: All links on page
  • images: All images on page
  • summary: AI-generated summary
  • json: Structured data extraction with schema
  • screenshot: Page screenshot (PNG)
  • markdown: 整洁的Markdown格式内容
  • html: 解析后的HTML
  • rawHtml: 原始HTML
  • links: 页面上的所有链接
  • images: 页面上的所有图片
  • summary: AI生成的内容摘要
  • json: 带Schema的结构化数据提取
  • screenshot: 网页截图(PNG格式)

Content Control

内容控制

  • onlyMainContent
    : Extract only main content (default: true)
  • includeTags
    : CSS selectors to include
  • excludeTags
    : CSS selectors to exclude
  • waitFor
    : Wait time before scraping (ms)
  • maxAge
    : Cache duration (default: 48 hours)
  • onlyMainContent
    : 仅提取主内容(默认值:true)
  • includeTags
    : 需要包含的CSS选择器
  • excludeTags
    : 需要排除的CSS选择器
  • waitFor
    : 抓取前的等待时间(毫秒)
  • maxAge
    : 缓存有效期(默认值:48小时)

Actions (Browser Automation)

交互动作(浏览器自动化)

  • wait
    : Wait for specified time
  • click
    : Click element by selector
  • write
    : Input text into field
  • press
    : Press keyboard key
  • scroll
    : Scroll page
  • executeJavascript
    : Run custom JS
  • wait
    : 等待指定时长
  • click
    : 通过选择器点击元素
  • write
    : 在输入框中输入文本
  • press
    : 按下键盘按键
  • scroll
    : 滚动页面
  • executeJavascript
    : 执行自定义JavaScript代码

Crawl Options

爬取选项

  • includePaths
    : Regex patterns to include
  • excludePaths
    : Regex patterns to exclude
  • maxDiscoveryDepth
    : Maximum crawl depth
  • limit
    : Maximum pages to crawl
  • allowExternalLinks
    : Follow external links
  • allowSubdomains
    : Follow subdomains
  • includePaths
    : 需要包含的路径正则表达式
  • excludePaths
    : 需要排除的路径正则表达式
  • maxDiscoveryDepth
    : 最大爬取深度
  • limit
    : 最大爬取页面数量
  • allowExternalLinks
    : 是否允许跟随外部链接
  • allowSubdomains
    : 是否允许跟随子域名

Environment Variables & API Key

环境变量与API密钥

Two ways to configure API Key (priority: environment variable >
.env
):
  1. Environment variable:
    FIRECRAWL_API_KEY
  2. .env
    file: Place in
    .claude/skills/firecrawl-scraper/.env
    , can copy from
    .env.example
配置API密钥的两种方式(优先级:环境变量 >
.env
文件):
  1. 环境变量:
    FIRECRAWL_API_KEY
  2. .env
    文件:放置在
    .claude/skills/firecrawl-scraper/.env
    路径下,可从
    .env.example
    复制模板

Response Format

响应格式

All endpoints return JSON with:
  • success
    : Boolean indicating success
  • data
    : Extracted content (format depends on endpoint)
  • For crawl: Returns job ID, use
    crawl-status
    (or GET /v2/crawl/{id}) to check status
所有端点均返回JSON格式数据,包含:
  • success
    : 布尔值,表示请求是否成功
  • data
    : 提取的内容(格式取决于所选端点)
  • 爬取任务:返回任务ID,可使用
    crawl-status
    (或GET /v2/crawl/{id}接口)检查状态