firecrawl-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl Scraper Skill
Firecrawl 抓取器Skill
Trigger Conditions & Endpoint Selection
触发条件与端点选择
Choose Firecrawl endpoint based on user intent:
- scrape: Need to extract content from a single web page (markdown, html, json, screenshot, pdf)
- crawl: Need to crawl entire website with depth control and path filtering
- map: Need to quickly get a list of all URLs on a website
- batch-scrape: Need to scrape multiple URLs in parallel
- crawl-status: Given crawl job ID, check crawl progress/results (optional )
--wait
根据用户意图选择Firecrawl端点:
- scrape: 需要从单个网页提取内容(支持markdown、html、json、截图、pdf格式)
- crawl: 需要爬取整个网站,支持深度控制和路径过滤
- map: 需要快速获取网站上的所有URL列表
- batch-scrape: 需要并行抓取多个URL
- crawl-status: 给定爬取任务ID,检查爬取进度/结果(可选参数)
--wait
Recommended Architecture (Main Skill + Sub-skill)
推荐架构(主Skill + 子Skill)
This skill uses a two-phase architecture:
- Main skill (current context): Understand user question → Choose endpoint → Assemble JSON payload
- Sub-skill (fork context): Only responsible for HTTP call execution, avoiding conversation history token waste
本Skill采用两阶段架构:
- 主Skill(当前上下文):理解用户问题 → 选择端点 → 组装JSON请求体
- 子Skill(分支上下文):仅负责执行HTTP调用,避免对话历史的token浪费
Execution Method
执行方法
Use Task tool to invoke sub-skill, passing command and JSON (stdin):
firecrawl-fetcherTask parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
{ ...payload... }
JSON使用Task工具调用子Skill,传入命令和JSON(通过标准输入):
firecrawl-fetcherTask parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js <scrape|crawl|map|batch-scrape|crawl-status> [--wait]
{ ...payload... }
JSONPayload Examples
请求体示例
1) Scrape Single Page
1) 抓取单个页面
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": true,
"includeTags": [],
"excludeTags": ["nav", "footer"],
"waitFor": 0,
"timeout": 30000
}
JSONAvailable formats:
- ,
"markdown","html","rawHtml","links","images""summary" {"type": "json", "prompt": "Extract product info", "schema": {...}}{"type": "screenshot", "fullPage": true, "quality": 85}
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": true,
"includeTags": [],
"excludeTags": ["nav", "footer"],
"waitFor": 0,
"timeout": 30000
}
JSON支持的格式:
- ,
"markdown","html","rawHtml","links","images""summary" {"type": "json", "prompt": "Extract product info", "schema": {...}}{"type": "screenshot", "fullPage": true, "quality": 85}
2) Scrape with Actions (Page Interaction)
2) 带交互动作的抓取(页面交互)
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{"type": "wait", "milliseconds": 2000},
{"type": "click", "selector": "#load-more"},
{"type": "wait", "milliseconds": 1000},
{"type": "scroll", "direction": "down", "amount": 500}
]
}
JSONAvailable actions:
- ,
wait,click,write,press,scroll,screenshot,scrapeexecuteJavascript
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{"type": "wait", "milliseconds": 2000},
{"type": "click", "selector": "#load-more"},
{"type": "wait", "milliseconds": 1000},
{"type": "scroll", "direction": "down", "amount": 500}
]
}
JSON支持的动作:
- ,
wait,click,write,press,scroll,screenshot,scrapeexecuteJavascript
3) Parse PDF
3) 解析PDF
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com/document.pdf",
"formats": ["markdown"],
"parsers": ["pdf"]
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com/document.pdf",
"formats": ["markdown"],
"parsers": ["pdf"]
}
JSON4) Extract Structured JSON
4) 提取结构化JSON
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com/product",
"formats": [
{
"type": "json",
"prompt": "Extract product information",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
},
"required": ["name", "price"]
}
}
]
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js scrape
{
"url": "https://example.com/product",
"formats": [
{
"type": "json",
"prompt": "Extract product information",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
},
"required": ["name", "price"]
}
}
]
}
JSON5) Crawl Entire Website
5) 爬取整个网站
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl
{
"url": "https://docs.example.com",
"formats": ["markdown"],
"includePaths": ["^/docs/.*"],
"excludePaths": ["^/blog/.*"],
"maxDiscoveryDepth": 3,
"limit": 100,
"allowExternalLinks": false,
"allowSubdomains": false
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl
{
"url": "https://docs.example.com",
"formats": ["markdown"],
"includePaths": ["^/docs/.*"],
"excludePaths": ["^/blog/.*"],
"maxDiscoveryDepth": 3,
"limit": 100,
"allowExternalLinks": false,
"allowSubdomains": false
}
JSON5.1) Crawl + Wait for Completion
5.1) 爬取 + 等待完成
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl --wait
{
"url": "https://docs.example.com",
"formats": ["markdown"],
"limit": 100
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl --wait
{
"url": "https://docs.example.com",
"formats": ["markdown"],
"limit": 100
}
JSON6) Map Website URLs
6) 生成网站URL地图
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js map
{
"url": "https://example.com",
"search": "documentation",
"limit": 5000
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js map
{
"url": "https://example.com",
"search": "documentation",
"limit": 5000
}
JSON7) Batch Scrape Multiple URLs
7) 批量抓取多个URL
bash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js batch-scrape
{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"formats": ["markdown"]
}
JSONbash
cat <<'JSON' | node .claude/skills/firecrawl-scraper/firecrawl-api.js batch-scrape
{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"formats": ["markdown"]
}
JSON8) Check Crawl Status
8) 检查爬取状态
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id>Wait for completion:
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id> --waitbash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id>等待完成:
bash
node .claude/skills/firecrawl-scraper/firecrawl-api.js crawl-status <crawl-id> --waitKey Features
核心功能
Formats
格式支持
- markdown: Clean markdown content
- html: Parsed HTML
- rawHtml: Original HTML
- links: All links on page
- images: All images on page
- summary: AI-generated summary
- json: Structured data extraction with schema
- screenshot: Page screenshot (PNG)
- markdown: 整洁的Markdown格式内容
- html: 解析后的HTML
- rawHtml: 原始HTML
- links: 页面上的所有链接
- images: 页面上的所有图片
- summary: AI生成的内容摘要
- json: 带Schema的结构化数据提取
- screenshot: 网页截图(PNG格式)
Content Control
内容控制
- : Extract only main content (default: true)
onlyMainContent - : CSS selectors to include
includeTags - : CSS selectors to exclude
excludeTags - : Wait time before scraping (ms)
waitFor - : Cache duration (default: 48 hours)
maxAge
- : 仅提取主内容(默认值:true)
onlyMainContent - : 需要包含的CSS选择器
includeTags - : 需要排除的CSS选择器
excludeTags - : 抓取前的等待时间(毫秒)
waitFor - : 缓存有效期(默认值:48小时)
maxAge
Actions (Browser Automation)
交互动作(浏览器自动化)
- : Wait for specified time
wait - : Click element by selector
click - : Input text into field
write - : Press keyboard key
press - : Scroll page
scroll - : Run custom JS
executeJavascript
- : 等待指定时长
wait - : 通过选择器点击元素
click - : 在输入框中输入文本
write - : 按下键盘按键
press - : 滚动页面
scroll - : 执行自定义JavaScript代码
executeJavascript
Crawl Options
爬取选项
- : Regex patterns to include
includePaths - : Regex patterns to exclude
excludePaths - : Maximum crawl depth
maxDiscoveryDepth - : Maximum pages to crawl
limit - : Follow external links
allowExternalLinks - : Follow subdomains
allowSubdomains
- : 需要包含的路径正则表达式
includePaths - : 需要排除的路径正则表达式
excludePaths - : 最大爬取深度
maxDiscoveryDepth - : 最大爬取页面数量
limit - : 是否允许跟随外部链接
allowExternalLinks - : 是否允许跟随子域名
allowSubdomains
Environment Variables & API Key
环境变量与API密钥
Two ways to configure API Key (priority: environment variable > ):
.env- Environment variable:
FIRECRAWL_API_KEY - file: Place in
.env, can copy from.claude/skills/firecrawl-scraper/.env.env.example
配置API密钥的两种方式(优先级:环境变量 > 文件):
.env- 环境变量:
FIRECRAWL_API_KEY - 文件:放置在
.env路径下,可从.claude/skills/firecrawl-scraper/.env复制模板.env.example
Response Format
响应格式
All endpoints return JSON with:
- : Boolean indicating success
success - : Extracted content (format depends on endpoint)
data - For crawl: Returns job ID, use (or GET /v2/crawl/{id}) to check status
crawl-status
所有端点均返回JSON格式数据,包含:
- : 布尔值,表示请求是否成功
success - : 提取的内容(格式取决于所选端点)
data - 爬取任务:返回任务ID,可使用(或GET /v2/crawl/{id}接口)检查状态
crawl-status