firecrawl-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl Web Scraper Skill
Firecrawl网页抓取技能
Status: Production Ready
Last Updated: 2026-01-20
Official Docs: https://docs.firecrawl.dev
API Version: v2
SDK Versions: firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+
状态:已就绪可用于生产环境
最后更新:2026年1月20日
官方文档:https://docs.firecrawl.dev
API版本:v2
SDK版本:firecrawl-py 4.13.0及以上,@mendable/firecrawl-js 4.11.1及以上
What is Firecrawl?
什么是Firecrawl?
Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:
- JavaScript rendering - Executes client-side JavaScript to capture dynamic content
- Anti-bot bypass - Gets past CAPTCHA and bot detection systems
- Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries
- Document parsing - Processes PDFs, DOCX files, and images
- Autonomous agents - AI-powered web data gathering without URLs
- Change tracking - Monitor content changes over time
- Branding extraction - Extract color schemes, typography, logos
Firecrawl是一款面向AI的网页数据API,可将网站内容转换为LLM适用的Markdown格式或结构化数据。它支持:
- JavaScript渲染 - 执行客户端JavaScript以捕获动态内容
- 反机器人验证绕过 - 突破CAPTCHA和机器人检测系统
- 格式转换 - 输出为Markdown、HTML、JSON、截图、摘要
- 文档解析 - 处理PDF、DOCX文件及图片
- 自主Agent - 无需URL,由AI驱动的网页数据收集
- 变更追踪 - 监控内容随时间的变化
- 品牌信息提取 - 提取配色方案、排版、Logo
API Endpoints Overview
API接口概览
| Endpoint | Purpose | Use Case |
|---|---|---|
| Single page | Extract article, product page |
| Full site | Index docs, archive sites |
| URL discovery | Find all pages, plan strategy |
| Web search + scrape | Research with live data |
| Structured data | Product prices, contacts |
| Autonomous gathering | No URLs needed, AI navigates |
| Multiple URLs | Bulk processing |
| 接口 | 用途 | 适用场景 |
|---|---|---|
| 单页面抓取 | 提取文章、产品页面内容 |
| 全站爬取 | 文档索引、网站归档 |
| URL发现 | 查找所有页面、规划爬取策略 |
| 网页搜索+抓取 | 基于实时数据的调研 |
| 结构化数据提取 | 提取产品价格、联系信息 |
| 自主数据收集 | 无需提供URL,由AI导航完成 |
| 多URL批量处理 | 批量数据抓取 |
1. Scrape Endpoint (/v2/scrape
)
/v2/scrape1. 抓取接口(/v2/scrape
)
/v2/scrapeScrapes a single webpage and returns clean, structured content.
抓取单个网页并返回整洁的结构化内容。
Basic Usage
基础用法
python
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))python
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))Basic scrape
基础抓取
doc = app.scrape(
url="https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);doc = app.scrape(
url="https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);Output Formats
输出格式
| Format | Description |
|---|---|
| LLM-optimized content |
| Full HTML |
| Unprocessed HTML |
| Page capture (with viewport options) |
| All URLs on page |
| Structured data extraction |
| AI-generated summary |
| Design system data |
| Content change detection |
| 格式 | 说明 |
|---|---|
| 针对LLM优化的内容格式 |
| 完整HTML内容 |
| 未处理的原始HTML |
| 页面截图(支持视口配置) |
| 页面中所有URL |
| 结构化数据提取结果 |
| AI生成的内容摘要 |
| 设计系统数据 |
| 内容变更检测结果 |
Advanced Options
高级配置
python
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # Wait 5s for JS
timeout=30000,
# Location & language
location={"country": "AU", "languages": ["en-AU"]},
# Cache control
max_age=0, # Fresh content (no cache)
store_in_cache=True,
# Stealth mode for complex sites
stealth=True,
# Custom headers
headers={"User-Agent": "Custom Bot 1.0"}
)python
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # 等待5秒以加载JavaScript
timeout=30000,
# 地区与语言配置
location={"country": "AU", "languages": ["en-AU"]},
# 缓存控制
max_age=0, # 获取最新内容(不使用缓存)
store_in_cache=True,
# 针对复杂网站启用隐身模式
stealth=True,
# 自定义请求头
headers={"User-Agent": "Custom Bot 1.0"}
)Browser Actions
浏览器交互操作
Perform interactions before scraping:
python
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # Capture state mid-action
]
)在抓取前执行页面交互:
python
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # 捕获交互过程中的页面状态
]
)JSON Mode (Structured Extraction)
JSON模式(结构化提取)
python
undefinedpython
undefinedWith schema
带Schema的提取
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
Without schema (prompt-only)
不带Schema(仅用提示词)
doc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
undefineddoc = app.scrape(
url="https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
undefinedBranding Extraction
品牌信息提取
Extract design system and brand identity:
python
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)提取设计系统与品牌标识:
python
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)Returns:
返回内容包括:
- Color schemes and palettes
- 配色方案与调色板
- Typography (fonts, sizes, weights)
- 排版信息(字体、字号、字重)
- Spacing and layout metrics
- 间距与布局规范
- UI component styles
- UI组件样式
- Logo and imagery URLs
- Logo与图片URL
- Brand personality traits
- 品牌个性特征
---
---2. Crawl Endpoint (/v2/crawl
)
/v2/crawl2. 全站爬取接口(/v2/crawl
)
/v2/crawlCrawls all accessible pages from a starting URL.
python
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"Scraped: {page.metadata.source_url}")
print(f"Content: {page.markdown[:200]}...")从起始URL开始爬取所有可访问的页面。
python
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"已抓取: {page.metadata.source_url}")
print(f"内容: {page.markdown[:200]}...")Async Crawl with Webhooks
异步爬取与Webhook
python
undefinedpython
undefinedStart crawl (returns immediately)
启动爬取(立即返回)
job = app.start_crawl(
url="https://docs.example.com",
limit=1000,
webhook="https://your-domain.com/webhook"
)
print(f"Job ID: {job.id}")
job = app.start_crawl(
url="https://docs.example.com",
limit=1000,
webhook="https://your-domain.com/webhook"
)
print(f"任务ID: {job.id}")
Or poll for status
或轮询任务状态
status = app.check_crawl_status(job.id)
---status = app.check_crawl_status(job.id)
---3. Map Endpoint (/v2/map
)
/v2/map3. URL映射接口(/v2/map
)
/v2/mapRapidly discover all URLs on a website without scraping content.
python
urls = app.map(url="https://example.com")
print(f"Found {len(urls)} pages")
for url in urls[:10]:
print(url)Use for: sitemap discovery, crawl planning, website audits.
快速发现网站上的所有URL,无需抓取内容。
python
urls = app.map(url="https://example.com")
print(f"发现 {len(urls)} 个页面")
for url in urls[:10]:
print(url)适用场景:站点地图发现、爬取规划、网站审计。
4. Search Endpoint (/search
) - NEW
/search4. 搜索接口(/search
)- 新增功能
/searchPerform web searches and optionally scrape the results in one operation.
python
undefined执行网页搜索并可选择直接抓取搜索结果。
python
undefinedBasic search
基础搜索
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
Search + scrape results
搜索+抓取结果
results = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
undefinedresults = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
undefinedSearch Options
搜索配置
python
results = app.search(
query="machine learning papers",
limit=20,
# Filter by source type
sources=["web", "news", "images"],
# Filter by category
categories=["github", "research", "pdf"],
# Location
location={"country": "US"},
# Time filter
tbs="qdr:m", # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)
timeout=30000
)Cost: 2 credits per 10 results + scraping costs if enabled.
python
results = app.search(
query="machine learning papers",
limit=20,
# 按来源类型过滤
sources=["web", "news", "images"],
# 按分类过滤
categories=["github", "research", "pdf"],
# 地区配置
location={"country": "US"},
# 时间过滤
tbs="qdr:m", # 过去一个月(qdr:h=小时, qdr:d=天, qdr:w=周, qdr:y=年)
timeout=30000
)成本:每10条搜索结果消耗2积分,若启用抓取则额外收取抓取费用。
5. Extract Endpoint (/v2/extract
)
/v2/extract5. 结构化提取接口(/v2/extract
)
/v2/extractAI-powered structured data extraction from single pages, multiple pages, or entire domains.
基于AI的结构化数据提取,支持单页面、多页面或整个域名。
Single Page
单页面提取
python
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)python
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)Multi-Page / Domain Extraction
多页面/域名提取
python
undefinedpython
undefinedExtract from entire domain using wildcard
通配符匹配整个域名
result = app.extract(
urls=["example.com/*"], # All pages on domain
schema=Product,
system_prompt="Extract all products"
)
result = app.extract(
urls=["example.com/*"], # 域名下所有页面
schema=Product,
system_prompt="Extract all products"
)
Enable web search for additional context
启用网页搜索获取额外上下文
result = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # Follow external links
)
undefinedresult = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # 跟随外部链接
)
undefinedPrompt-Only Extraction (No Schema)
仅用提示词提取(无Schema)
python
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)python
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)LLM determines output structure
由LLM自动决定输出结构
---
---6. Agent Endpoint (/agent
) - NEW
/agent6. Agent接口(/agent
)- 新增功能
/agentAutonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.
python
undefined无需指定具体URL,实现自主网页数据收集。Agent通过自然语言提示词完成搜索、导航与数据收集。
python
undefinedBasic agent usage
基础Agent用法
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
With schema for structured output
带Schema的结构化输出
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
Optional: focus on specific URLs
可选:限定在指定URL范围内
result = app.agent(
prompt="Extract the enterprise pricing details",
urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)
undefinedresult = app.agent(
prompt="Extract the enterprise pricing details",
urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]
)
undefinedAgent Models
Agent模型
| Model | Best For | Cost |
|---|---|---|
| Simple extractions, high volume | Standard |
| Complex analysis, ambiguous data | 60% more |
python
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # For complex tasks
)| 模型 | 适用场景 | 成本 |
|---|---|---|
| 简单提取、高并发场景 | 标准定价 |
| 复杂分析、模糊数据处理 | 高出60% |
python
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # 适用于复杂任务
)Async Agent
异步Agent
python
undefinedpython
undefinedStart agent (returns immediately)
启动Agent任务(立即返回)
job = app.start_agent(
prompt="Research market trends..."
)
job = app.start_agent(
prompt="Research market trends..."
)
Poll for results
轮询结果
status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
**Note**: Agent is in Research Preview. 5 free daily requests, then credit-based billing.
---status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
**注意**:Agent功能处于研究预览阶段。每日免费5次请求,超出后按积分计费。
---7. Batch Scrape - NEW
7. 批量抓取 - 新增功能
Process multiple URLs efficiently in a single operation.
在单个操作中高效处理多个URL。
Synchronous (waits for completion)
同步模式(等待完成)
python
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} chars")python
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} 字符")Asynchronous (with webhooks)
异步模式(带Webhook)
python
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)python
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)Webhook receives events: started, page, completed, failed
Webhook接收事件:started, page, completed, failed
```typescript
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// Poll for status
const status = await app.checkBatchScrapeStatus(job.id);
```typescript
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// 轮询任务状态
const status = await app.checkBatchScrapeStatus(job.id);8. Change Tracking - NEW
8. 变更追踪 - 新增功能
Monitor content changes over time by comparing scrapes.
python
undefined通过对比多次抓取结果,监控内容随时间的变化。
python
undefinedEnable change tracking
启用变更追踪
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"]
)
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"]
)
Response includes:
响应包含:
print(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
undefinedprint(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
undefinedComparison Modes
对比模式
python
undefinedpython
undefinedGit-diff mode (default)
Git差异模式(默认)
doc = app.scrape(
url="https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # Line-by-line changes
doc = app.scrape(
url="https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # 逐行变更对比
JSON mode (structured comparison)
JSON模式(结构化对比)
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
doc = app.scrape(
url="https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
Costs 5 credits per page
每页消耗5积分
**Change States**:
- `new` - Page not seen before
- `same` - No changes since last scrape
- `changed` - Content modified
- `removed` - Page no longer accessible
---
**变更状态**:
- `new` - 页面首次被抓取
- `same` - 自上次抓取后无变更
- `changed` - 内容已修改
- `removed` - 页面已无法访问
---Authentication
身份验证
bash
undefinedbash
undefinedGet API key from https://www.firecrawl.dev/app
Store in environment
存储到环境变量中
FIRECRAWL_API_KEY=fc-your-api-key-here
**Never hardcode API keys!**
---FIRECRAWL_API_KEY=fc-your-api-key-here
**切勿硬编码API密钥!**
---Cloudflare Workers Integration
Cloudflare Workers集成
The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:
typescript
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};Firecrawl SDK无法在Cloudflare Workers中运行(依赖Node.js环境)。请直接使用REST API:
typescript
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};Rate Limits & Pricing
速率限制与定价
Warning: Stealth Mode Pricing Change (May 2025)
警告:隐身模式定价变更(2025年5月)
Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.
Recommended pattern:
python
undefined隐身模式当前采用主动使用时每请求5积分的计费方式。默认的"auto"模式仅在基础抓取失败时才会启用隐身模式并收取对应积分。
推荐用法:
python
undefinedUse auto mode (default) - only charges 5 credits if stealth is needed
推荐:使用默认的auto模式 - 仅在基础抓取失败时才会启用隐身模式并收取5积分
doc = app.scrape(url, formats=["markdown"])
doc = app.scrape(url, formats=["markdown"])
Or conditionally enable stealth for specific errors
或根据错误状态条件启用
if error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
undefinedif error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
undefinedUnified Billing (November 2025)
统一计费(2025年11月)
Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
积分与令牌已合并为单一系统。提取接口使用积分(15令牌=1积分)。
Pricing Tiers
定价套餐
| Tier | Credits/Month | Notes |
|---|---|---|
| Free | 500 | Good for testing |
| Hobby | 3,000 | $19/month |
| Standard | 100,000 | $99/month |
| Growth | 500,000 | $399/month |
Credit Costs:
- Scrape: 1 credit (basic), 5 credits (stealth)
- Crawl: 1 credit per page
- Search: 2 credits per 10 results
- Extract: 5 credits per page (changed from tokens in v2.6.0)
- Agent: Dynamic (complexity-based)
- Change Tracking JSON mode: +5 credits
| 套餐 | 每月积分 | 说明 |
|---|---|---|
| 免费版 | 500 | 适用于测试 |
| 爱好者版 | 3,000 | 19美元/月 |
| 标准版 | 100,000 | 99美元/月 |
| 成长版 | 500,000 | 399美元/月 |
积分消耗:
- 抓取:1积分(基础模式),5积分(隐身模式)
- 全站爬取:每页1积分
- 搜索:每10条结果2积分
- 提取:每页5积分(v2.6.0起从令牌改为积分)
- Agent:动态计费(基于任务复杂度)
- 变更追踪JSON模式:额外+5积分
Common Issues & Solutions
常见问题与解决方案
| Issue | Cause | Solution |
|---|---|---|
| Empty content | JS not loaded | Add |
| Rate limit exceeded | Over quota | Check dashboard, upgrade plan |
| Timeout error | Slow page | Increase |
| Bot detection | Anti-scraping | Use |
| Invalid API key | Wrong format | Must start with |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 内容为空 | JavaScript未加载 | 添加 |
| 超出速率限制 | 超出配额 | 查看控制台,升级套餐 |
| 超时错误 | 页面加载缓慢 | 增加 |
| 机器人检测 | 反爬机制拦截 | 使用 |
| API密钥无效 | 格式错误 | 密钥必须以 |
Known Issues Prevention
已知问题预防
This skill prevents 10 documented issues:
本技能可预防10种已记录的问题:
Issue #1: Stealth Mode Pricing Change (May 2025)
问题1:隐身模式定价变更(2025年5月)
Error: Unexpected credit costs when using stealth mode
Source: Stealth Mode Docs | Changelog
Why It Happens: Starting May 8th, 2025, Stealth Mode proxy requests cost 5 credits per request (previously included in standard pricing). This is a significant billing change.
Prevention: Use auto mode (default) which only charges stealth credits if basic fails
python
undefinedRECOMMENDED: Use auto mode (default)
推荐:使用默认的auto模式
doc = app.scrape(url, formats=['markdown'])
doc = app.scrape(url, formats=['markdown'])
Auto retries with stealth (5 credits) only if basic fails
自动重试并仅在基础抓取失败时启用隐身模式(消耗5积分)
Or conditionally enable based on error status
或根据错误状态条件启用
try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
**Stealth Mode Options**:
- `auto` (default): Charges 5 credits only if stealth succeeds after basic fails
- `basic`: Standard proxies, 1 credit cost
- `stealth`: 5 credits per request when actively used
---try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
**隐身模式选项**:
- `auto`(默认):仅在基础抓取失败且隐身模式成功时收取5积分
- `basic`:标准代理,每请求1积分
- `stealth`:主动使用时每请求5积分
---Issue #2: v2.0.0 Breaking Changes - Method Renames
问题2:v2.0.0破坏性变更 - 方法重命名
Error:
Source: v2.0.0 Release | Migration Guide
Why It Happens: v2.0.0 (August 2025) renamed SDK methods across all languages
Prevention: Use new method names
AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'JavaScript/TypeScript:
- →
scrapeUrl()scrape() - →
crawlUrl()orcrawl()startCrawl() - →
asyncCrawlUrl()startCrawl() - →
checkCrawlStatus()getCrawlStatus()
Python:
- →
scrape_url()scrape() - →
crawl_url()orcrawl()start_crawl()
python
undefined错误:
来源:v2.0.0版本发布 | 迁移指南
原因:v2.0.0(2025年8月)对所有语言的SDK方法进行了重命名
预防方案:使用新的方法名
AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'JavaScript/TypeScript:
- →
scrapeUrl()scrape() - →
crawlUrl()或crawl()startCrawl() - →
asyncCrawlUrl()startCrawl() - →
checkCrawlStatus()getCrawlStatus()
Python:
- →
scrape_url()scrape() - →
crawl_url()或crawl()start_crawl()
python
undefinedOLD (v1)
旧版(v1)
doc = app.scrape_url("https://example.com")
doc = app.scrape_url("https://example.com")
NEW (v2)
新版(v2)
doc = app.scrape("https://example.com")
---doc = app.scrape("https://example.com")
---Issue #3: v2.0.0 Breaking Changes - Format Changes
问题3:v2.0.0破坏性变更 - 格式变更
Error:
Source: v2.0.0 Release
Why It Happens: Old format renamed to in v2.0.0
Prevention: Use new object format for JSON extraction
'extract' is not a valid format"extract""json"python
undefined错误:
来源:v2.0.0版本发布
原因:旧版的格式在v2.0.0中重命名为
预防方案:使用JSON提取的新对象格式
'extract' is not a valid format"extract""json"python
undefinedOLD (v1)
旧版(v1)
doc = app.scrape_url(
url="https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
doc = app.scrape_url(
url="https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
NEW (v2)
新版(v2)
doc = app.scrape(
url="https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
doc = app.scrape(
url="https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
With schema
带Schema的提取
doc = app.scrape(
url="https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
**Screenshot format also changed**:
```pythondoc = app.scrape(
url="https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
**截图格式也已变更**:
```pythonNEW: Screenshot as object
新版:截图配置为对象
formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
---formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
---Issue #4: v2.0.0 Breaking Changes - Crawl Options
问题4:v2.0.0破坏性变更 - 爬取配置项
Error:
Source: v2.0.0 Release
Why It Happens: Several crawl parameters renamed or removed in v2.0.0
Prevention: Use new parameter names
'allowBackwardCrawling' is not a valid parameterParameter Changes:
- → Use
allowBackwardCrawlinginsteadcrawlEntireDomain - → Use
maxDepthinsteadmaxDiscoveryDepth - (bool) →
ignoreSitemap("only", "skip", "include")sitemap
python
undefined错误:
来源:v2.0.0版本发布
原因:v2.0.0中多个爬取配置项被重命名或移除
预防方案:使用新的配置项名称
'allowBackwardCrawling' is not a valid parameter配置项变更:
- → 改用
allowBackwardCrawlingcrawlEntireDomain - → 改用
maxDepthmaxDiscoveryDepth - (布尔值)→
ignoreSitemap(可选值:"only", "skip", "include")sitemap
python
undefinedOLD (v1)
旧版(v1)
app.crawl_url(
url="https://docs.example.com",
params={
"allowBackwardCrawling": True,
"maxDepth": 3,
"ignoreSitemap": False
}
)
app.crawl_url(
url="https://docs.example.com",
params={
"allowBackwardCrawling": True,
"maxDepth": 3,
"ignoreSitemap": False
}
)
NEW (v2)
新版(v2)
app.crawl(
url="https://docs.example.com",
crawl_entire_domain=True,
max_discovery_depth=3,
sitemap="include" # "only", "skip", or "include"
)
---app.crawl(
url="https://docs.example.com",
crawl_entire_domain=True,
max_discovery_depth=3,
sitemap="include" # 可选值:"only", "skip", "include"
)
---Issue #5: v2.0.0 Default Behavior Changes
问题5:v2.0.0默认行为变更
Error: Stale cached content returned unexpectedly
Source: v2.0.0 Release
Why It Happens: v2.0.0 changed several defaults
Prevention: Be aware of new defaults
Default Changes:
- now defaults to 2 days (cached by default)
maxAge - ,
blockAds,skipTlsVerificationenabled by defaultremoveBase64Images
python
undefined错误:意外返回缓存的旧内容
来源:v2.0.0版本发布
原因:v2.0.0修改了多项默认配置
预防方案:了解新的默认配置
默认配置变更:
- 现在默认值为2天(默认启用缓存)
maxAge - ,
blockAds,skipTlsVerification默认启用removeBase64Images
python
undefinedForce fresh data if needed
如需强制获取最新数据
doc = app.scrape(url, formats=['markdown'], max_age=0)
doc = app.scrape(url, formats=['markdown'], max_age=0)
Disable cache entirely
完全禁用缓存
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
---doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
---Issue #6: Job Status Race Condition
问题6:任务状态竞态条件
Error: when checking crawl status immediately after creation
Source: GitHub Issue #2662
Why It Happens: Database replication delay between job creation and status endpoint availability
Prevention: Wait 1-3 seconds before first status check, or implement retry logic
"Job not found"python
import time错误:创建爬取任务后立即检查状态时出现
来源:GitHub Issue #2662
原因:任务创建与状态接口可用之间存在数据库复制延迟
预防方案:首次状态检查前等待1-3秒,或实现重试逻辑
"Job not found"python
import timeStart crawl
启动爬取任务
job = app.start_crawl(url="https://docs.example.com")
print(f"Job ID: {job.id}")
job = app.start_crawl(url="https://docs.example.com")
print(f"任务ID: {job.id}")
REQUIRED: Wait before first status check
必须:首次状态检查前等待
time.sleep(2) # 1-3 seconds recommended
time.sleep(2) # 推荐等待1-3秒
Now status check succeeds
此时状态检查可成功
status = app.get_crawl_status(job.id)
status = app.get_crawl_status(job.id)
Or implement retry logic
或实现重试逻辑
def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
---def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
---Issue #7: DNS Errors Return HTTP 200
问题7:DNS错误返回HTTP 200状态码
Error: DNS resolution failures return with HTTP 200 status instead of 4xx
Source: GitHub Issue #2402 | Fixed in v2.7.0
Why It Happens: Changed in v2.7.0 for consistent error handling
Prevention: Check field and field, don't rely on HTTP status alone
success: falsesuccesscodetypescript
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// DON'T rely on HTTP status code
// Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// DO check success field
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS resolution failed');
}
throw new Error(result.error);
}Note: DNS resolution errors still charge 1 credit despite failure.
错误:DNS解析失败时返回但HTTP状态码为200而非4xx
来源:GitHub Issue #2402 | v2.7.0中已修复
原因:v2.7.0中为了统一错误处理逻辑而修改
预防方案:检查字段与字段,不要仅依赖HTTP状态码
success: falsesuccesscodetypescript
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// 不要仅依赖HTTP状态码
// 响应:HTTP 200,内容为{ success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// 应检查success字段
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS解析失败');
}
throw new Error(result.error);
}注意:即使DNS解析失败,仍会扣除1积分。
Issue #8: Bot Detection Still Charges Credits
问题8:机器人检测仍会扣除积分
Error: Cloudflare error page returned as "successful" scrape, credits charged
Source: GitHub Issue #2413
Why It Happens: Fire-1 engine charges credits even when bot detection prevents access
Prevention: Validate content isn't an error page before processing; use stealth mode for protected sites
python
undefined错误:返回Cloudflare错误页面却被标记为“成功”抓取,扣除积分
来源:GitHub Issue #2413
原因:Fire-1引擎即使在被机器人检测拦截时仍会扣除积分
预防方案:处理前验证内容是否为错误页面;对受保护网站使用隐身模式
python
undefinedFirst attempt without stealth
首次尝试不使用隐身模式
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])
Validate content isn't an error page
验证内容是否为错误页面
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# Retry with stealth (costs 5 credits if successful)
doc = app.scrape(url, formats=["markdown"], stealth=True)
**Cost Impact**: Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
---if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# 重试时启用隐身模式(成功则消耗5积分)
doc = app.scrape(url, formats=["markdown"], stealth=True)
**成本影响**:基础抓取即使失败也会扣除1积分,隐身模式重试会额外扣除5积分。
---Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness
问题9:自部署版本反机器人指纹识别缺陷
Error: (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures
Source: GitHub Issue #2257
Why It Happens: Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service
Prevention: Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy
"All scraping engines failed!"bash
undefined错误:在有反机器人措施的网站上出现(SCRAPE_ALL_ENGINES_FAILED)
来源:GitHub Issue #2257
原因:自部署的Firecrawl缺少云服务中具备的高级反指纹识别技术
预防方案:对有强反机器人措施的网站使用Firecrawl云服务,或配置代理
"All scraping engines failed!"bash
undefinedSelf-hosted fails on Cloudflare-protected sites
自部署版本在Cloudflare保护的网站上失败
curl -X POST 'http://localhost:3002/v2/scrape'
-H 'Authorization: Bearer YOUR_API_KEY'
-d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'
-H 'Authorization: Bearer YOUR_API_KEY'
-d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'
curl -X POST 'http://localhost:3002/v2/scrape'
-H 'Authorization: Bearer YOUR_API_KEY'
-d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'
-H 'Authorization: Bearer YOUR_API_KEY'
-d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'
Error: "All scraping engines failed!"
错误:"All scraping engines failed!"
Workaround: Use cloud service instead
解决方案:改用云服务
Cloud service has better anti-fingerprinting
云服务具备更完善的反指纹识别能力
**Note**: This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
---
**注意**:此问题影响默认docker-compose部署的自部署v2.3.0+版本。部署时会显示警告:"⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
---Issue #10: Cache Performance Best Practices (Community-sourced)
问题10:缓存性能最佳实践(社区贡献)
Suboptimal: Not leveraging cache can make requests 500% slower
Source: Fast Scraping Docs | Blog Post
Why It Matters: Default is 2 days in v2+, but many use cases need different strategies
Prevention: Use appropriate cache strategy for your content type
maxAgepython
undefinedFresh data (real-time pricing, stock prices)
实时数据(如实时定价、库存)
doc = app.scrape(url, formats=["markdown"], max_age=0)
doc = app.scrape(url, formats=["markdown"], max_age=0)
10-minute cache (news, blogs)
10分钟缓存(如新闻、博客)
doc = app.scrape(url, formats=["markdown"], max_age=600000) # milliseconds
doc = app.scrape(url, formats=["markdown"], max_age=600000) # 毫秒
Use default cache (2 days) for static content
静态内容使用默认缓存(2天)
doc = app.scrape(url, formats=["markdown"]) # maxAge defaults to 172800000
doc = app.scrape(url, formats=["markdown"]) # maxAge默认值为172800000
Don't store in cache (one-time scrape)
单次抓取,不存储到缓存
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
Require minimum age before re-scraping (v2.7.0+)
重新抓取前需满足最小间隔(v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 1 hour minimum
**Performance Impact**:
- Cached response: Milliseconds
- Fresh scrape: Seconds
- Speed difference: **Up to 500%**
---doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 最小间隔1小时
**性能影响**:
- 缓存响应:毫秒级
- 全新抓取:秒级
- 速度差异:**最高500%**
---Package Versions
包版本
| Package | Version | Last Checked |
|---|---|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API Version | v2 | Current |
| 包 | 版本 | 最后检查时间 |
|---|---|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API版本 | v2 | 当前版本 |
Official Documentation
官方文档
- Docs: https://docs.firecrawl.dev
- Python SDK: https://docs.firecrawl.dev/sdks/python
- Node.js SDK: https://docs.firecrawl.dev/sdks/node
- API Reference: https://docs.firecrawl.dev/api-reference
- GitHub: https://github.com/mendableai/firecrawl
- Dashboard: https://www.firecrawl.dev/app
Token Savings: ~65% vs manual integration
Error Prevention: 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization)
Production Ready: Yes
Last verified: 2026-01-21 | Skill version: 2.0.0 | Changes: Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model
- 文档:https://docs.firecrawl.dev
- Python SDK:https://docs.firecrawl.dev/sdks/python
- Node.js SDK:https://docs.firecrawl.dev/sdks/node
- API参考:https://docs.firecrawl.dev/api-reference
- GitHub:https://github.com/mendableai/firecrawl
- 控制台:https://www.firecrawl.dev/app
令牌节省:相比手动集成节省约65%
错误预防:10种已记录的问题(v2版本迁移、隐身模式定价、任务状态竞态、DNS错误、机器人检测计费、自部署限制、缓存优化)
生产就绪:是
最后验证:2026-01-21 | 技能版本:2.0.0 | 变更:新增已知问题预防章节,包含10种来自TIER 1-2研究的已记录错误;新增v2版本迁移指南;文档化隐身模式定价变更与统一计费模型