firecrawl-scraping

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Firecrawl Scraping

Firecrawl 网页爬取

Overview

概述

Scrape individual web pages and convert them to clean, LLM-ready markdown. Handles JavaScript rendering, anti-bot protection, and dynamic content.

爬取单个网页并将其转换为干净的、适合LLM处理的Markdown格式。支持JavaScript渲染、反爬验证处理以及动态内容抓取。

Quick Decision Tree

快速决策树

What are you scraping?
│
├── Single page (article, blog, docs)
│   └── references/single-page.md
│   └── Script: scripts/firecrawl_scrape.py
│
└── Entire website (multiple pages, crawling)
    └── references/website-crawler.md
    └── (Use Apify Website Content Crawler for multi-page)

你要爬取什么内容？
│
├── 单页面（文章、博客、文档）
│   └── 参考文档：references/single-page.md
│   └── 脚本：scripts/firecrawl_scrape.py
│
└── 整个网站（多页面、全站爬取）
    └── 参考文档：references/website-crawler.md
    └──（多页面爬取请使用Apify Website Content Crawler）

Environment Setup

环境配置

bash

undefined

bash

undefined

Required in .env

.env文件中需配置

FIRECRAWL_API_KEY=fc-your-api-key-here


Get your API key: https://firecrawl.dev/app/api-keys

FIRECRAWL_API_KEY=fc-your-api-key-here


获取API密钥：https://firecrawl.dev/app/api-keys

Common Usage

常见使用方式

Simple Scrape

简单爬取

bash

python scripts/firecrawl_scrape.py "https://example.com/article"

bash

python scripts/firecrawl_scrape.py "https://example.com/article"

With Options

带参数爬取

bash

python scripts/firecrawl_scrape.py "https://wsj.com/article" \
  --proxy stealth \
  --format markdown summary \
  --timeout 60000

bash

python scripts/firecrawl_scrape.py "https://wsj.com/article" \
  --proxy stealth \
  --format markdown summary \
  --timeout 60000

Proxy Modes

代理模式

Mode	Use Case
`basic`	Standard sites, fastest
`stealth`	Anti-bot protection, premium content (WSJ, NYT)
`auto`	Let Firecrawl decide (recommended)

模式	使用场景
`basic`	标准网站，速度最快
`stealth`	有反爬验证的网站、付费内容（如《华尔街日报》、《纽约时报》）
`auto`	由Firecrawl自动选择（推荐）

Output Formats

输出格式

```
markdown
```
- Clean markdown content (default)
```
html
```
- Raw HTML
```
summary
```
- AI-generated summary
```
screenshot
```
- Page screenshot
```
links
```
- All links on page

```
markdown
```
- 干净的Markdown内容（默认）
```
html
```
- 原始HTML
```
summary
```
- AI生成的摘要
```
screenshot
```
- 页面截图
```
links
```
- 页面上的所有链接

Cost

成本说明

~1 credit per page. Stealth proxy may use additional credits.

每爬取一个页面约消耗1个积分。使用stealth代理可能会消耗额外积分。

Security Notes

安全注意事项

Credential Handling

凭据处理

Store
```
FIRECRAWL_API_KEY
```
in
```
.env
```
file (never commit to git)
API keys can be regenerated at https://firecrawl.dev/app/api-keys
Never log or print API keys in script output
Use environment variables, not hardcoded values

将
```
FIRECRAWL_API_KEY
```
存储在
```
.env
```
文件中（绝对不要提交到Git）
可在https://firecrawl.dev/app/api-keys页面重新生成API密钥
绝对不要在脚本输出中记录或打印API密钥
使用环境变量存储，不要硬编码到代码中

Data Privacy

数据隐私

Only scrapes publicly accessible web pages
Scraped content is processed by Firecrawl servers temporarily
Markdown output stored locally in
```
.tmp/
```
directory
Screenshots (if requested) are stored locally
No persistent data retention by Firecrawl after request

仅爬取公开可访问的网页
爬取的内容会在Firecrawl服务器上临时处理
Markdown输出会本地存储在
```
.tmp/
```
目录中
截图（如果请求的话）会本地存储
Firecrawl在请求完成后不会保留任何持久化数据

Access Scopes

访问权限

API key provides full access to scraping features
No granular permission scopes available
Monitor usage via Firecrawl dashboard

API密钥可访问全部爬取功能
暂不支持细粒度权限范围
可通过Firecrawl控制台监控使用情况

Compliance Considerations

合规性注意事项

Robots.txt: Firecrawl respects robots.txt by default
Public Content Only: Only scrape publicly accessible pages
Terms of Service: Respect target site ToS
Rate Limiting: Built-in rate limiting prevents abuse
Stealth Proxy: Use stealth mode only when necessary (paywalled news, not auth bypass)
GDPR: Scraped content may contain PII - handle accordingly
Copyright: Respect intellectual property rights of scraped content

Robots.txt：Firecrawl默认会遵守robots.txt规则
仅公开内容：仅爬取公开可访问的页面
服务条款：遵守目标网站的服务条款
速率限制：内置速率限制机制，防止滥用
Stealth代理：仅在必要时使用（如付费新闻网站，不要用于绕过身份验证）
GDPR：爬取的内容可能包含个人身份信息（PII），请妥善处理
版权：尊重爬取内容的知识产权

Troubleshooting

故障排除

Common Issues

常见问题

Issue: Credits exhausted

问题：积分耗尽

Symptoms: API returns "insufficient credits" or quota exceeded error Cause: Account credits depleted Solution:

Check credit balance at https://firecrawl.dev/app
Upgrade plan or purchase additional credits
Reduce scraping frequency
Use
```
basic
```
proxy mode to conserve credits

症状： API返回“insufficient credits”或配额超出错误 原因： 账户积分已用完 解决方案：

在https://firecrawl.dev/app页面查看积分余额
升级套餐或购买额外积分
降低爬取频率
使用
```
basic
```
代理模式以节省积分

Issue: Page not rendering correctly

问题：页面渲染不正确

Symptoms: Empty content or partial HTML returned Cause: JavaScript-heavy page not fully loading Solution:

Enable JavaScript rendering with
```
--js-render
```
flag
Increase timeout with
```
--timeout 60000
```
(60 seconds)
Try
```
stealth
```
proxy mode for protected sites
Wait for specific elements with
```
--wait-for
```
selector

症状： 返回空内容或部分HTML 原因： 重度依赖JavaScript的页面未完全加载 解决方案：

使用
```
--js-render
```
标志启用JavaScript渲染
增加超时时间，如
```
--timeout 60000
```
（60秒）
对受保护的网站尝试使用
```
stealth
```
代理模式
使用
```
--wait-for
```
选择器等待特定元素加载

Issue: 403 Forbidden error

问题：403 Forbidden错误

Symptoms: Script returns 403 status code Cause: Site blocking automated access Solution:

Enable
```
stealth
```
proxy mode
Add delay between requests
Try at different times (some sites rate limit by time)
Check if site requires login (not supported)

症状： 脚本返回403状态码 原因： 目标网站阻止了自动化访问 解决方案：

启用
```
stealth
```
代理模式
在请求之间添加延迟
尝试在不同时间进行爬取（部分网站会按时间进行速率限制）
检查目标网站是否需要登录（暂不支持登录爬取）

Issue: Empty markdown output

问题：Markdown输出为空

Symptoms: Scrape succeeds but markdown is empty or malformed Cause: Dynamic content loaded after page load, or unusual page structure Solution:

Increase wait time for JavaScript to execute
Use
```
--wait-for
```
to wait for specific content
Try
```
html
```
format to see raw content
Check if content is in an iframe (not always supported)

症状： 爬取成功但Markdown内容为空或格式错误 原因： 动态内容在页面加载后才加载，或页面结构特殊 解决方案：

增加JavaScript执行的等待时间
使用
```
--wait-for
```
参数等待特定内容加载
尝试使用
```
html
```
格式查看原始内容
检查内容是否在iframe中（并非所有iframe都支持爬取）

Issue: Timeout errors

问题：超时错误

Symptoms: Request times out before completion Cause: Slow page load or large page content Solution:

Increase timeout value (up to 120000ms)
Use
```
basic
```
proxy for faster response
Target specific page sections if possible
Check if site is experiencing issues

症状： 请求在完成前超时 原因： 页面加载缓慢或内容过大 解决方案：

增加超时值（最高可达120000毫秒）
使用
```
basic
```
代理以获得更快的响应
尽可能爬取页面的特定部分
检查目标网站是否出现故障

Resources

资源

references/single-page.md - Single page scraping details
references/website-crawler.md - Multi-page website crawling

references/single-page.md - 单页面爬取详细说明
references/website-crawler.md - 多页面网站爬取详细说明

Integration Patterns

集成模式

Scrape and Analyze

爬取与分析

Skills: firecrawl-scraping → parallel-research Use case: Scrape competitor pages, then analyze content strategy Flow:

Scrape competitor website pages with Firecrawl
Convert to clean markdown
Use parallel-research to analyze positioning, messaging, features

技能组合： firecrawl-scraping → parallel-research 使用场景： 爬取竞争对手页面，然后分析其内容策略 流程：

使用Firecrawl爬取竞争对手网站页面
转换为干净的Markdown格式
使用parallel-research分析其定位、品牌话术、功能

Scrape and Document

爬取与文档生成

Skills: firecrawl-scraping → content-generation Use case: Create summary documents from web research Flow:

Scrape multiple article pages on a topic
Combine markdown content
Generate summary document via content-generation

技能组合： firecrawl-scraping → content-generation 使用场景： 从网络研究中创建摘要文档 流程：

爬取多个同一主题的文章页面
合并Markdown内容
通过content-generation生成摘要文档

Scrape and Enrich CRM

爬取与CRM enrichment

Skills: firecrawl-scraping → attio-crm Use case: Enrich company records with website data Flow:

Scrape company website (about page, team page, product pages)
Extract key information (funding, team size, products)
Update company record in Attio CRM with enriched data

技能组合： firecrawl-scraping → attio-crm 使用场景： 用网站数据丰富CRM中的公司记录 流程：

爬取公司网站（关于我们页面、团队页面、产品页面）
提取关键信息（融资情况、团队规模、产品）
使用丰富的数据更新Attio CRM中的公司记录