firecrawl-scraping

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Firecrawl Scraping

Firecrawl 网页爬取

Overview

概述

Scrape individual web pages and convert them to clean, LLM-ready markdown. Handles JavaScript rendering, anti-bot protection, and dynamic content.
爬取单个网页并将其转换为干净的、适合LLM处理的Markdown格式。支持JavaScript渲染、反爬验证处理以及动态内容抓取。

Quick Decision Tree

快速决策树

What are you scraping?
├── Single page (article, blog, docs)
│   └── references/single-page.md
│   └── Script: scripts/firecrawl_scrape.py
└── Entire website (multiple pages, crawling)
    └── references/website-crawler.md
    └── (Use Apify Website Content Crawler for multi-page)
你要爬取什么内容?
├── 单页面(文章、博客、文档)
│   └── 参考文档:references/single-page.md
│   └── 脚本:scripts/firecrawl_scrape.py
└── 整个网站(多页面、全站爬取)
    └── 参考文档:references/website-crawler.md
    └──(多页面爬取请使用Apify Website Content Crawler)

Environment Setup

环境配置

bash
undefined
bash
undefined

Required in .env

.env文件中需配置

FIRECRAWL_API_KEY=fc-your-api-key-here

Get your API key: https://firecrawl.dev/app/api-keys
FIRECRAWL_API_KEY=fc-your-api-key-here

获取API密钥:https://firecrawl.dev/app/api-keys

Common Usage

常见使用方式

Simple Scrape

简单爬取

bash
python scripts/firecrawl_scrape.py "https://example.com/article"
bash
python scripts/firecrawl_scrape.py "https://example.com/article"

With Options

带参数爬取

bash
python scripts/firecrawl_scrape.py "https://wsj.com/article" \
  --proxy stealth \
  --format markdown summary \
  --timeout 60000
bash
python scripts/firecrawl_scrape.py "https://wsj.com/article" \
  --proxy stealth \
  --format markdown summary \
  --timeout 60000

Proxy Modes

代理模式

ModeUse Case
basic
Standard sites, fastest
stealth
Anti-bot protection, premium content (WSJ, NYT)
auto
Let Firecrawl decide (recommended)
模式使用场景
basic
标准网站,速度最快
stealth
有反爬验证的网站、付费内容(如《华尔街日报》、《纽约时报》)
auto
由Firecrawl自动选择(推荐)

Output Formats

输出格式

  • markdown
    - Clean markdown content (default)
  • html
    - Raw HTML
  • summary
    - AI-generated summary
  • screenshot
    - Page screenshot
  • links
    - All links on page
  • markdown
    - 干净的Markdown内容(默认)
  • html
    - 原始HTML
  • summary
    - AI生成的摘要
  • screenshot
    - 页面截图
  • links
    - 页面上的所有链接

Cost

成本说明

~1 credit per page. Stealth proxy may use additional credits.
每爬取一个页面约消耗1个积分。使用stealth代理可能会消耗额外积分。

Security Notes

安全注意事项

Credential Handling

凭据处理

  • Store
    FIRECRAWL_API_KEY
    in
    .env
    file (never commit to git)
  • API keys can be regenerated at https://firecrawl.dev/app/api-keys
  • Never log or print API keys in script output
  • Use environment variables, not hardcoded values

Data Privacy

数据隐私

  • Only scrapes publicly accessible web pages
  • Scraped content is processed by Firecrawl servers temporarily
  • Markdown output stored locally in
    .tmp/
    directory
  • Screenshots (if requested) are stored locally
  • No persistent data retention by Firecrawl after request
  • 仅爬取公开可访问的网页
  • 爬取的内容会在Firecrawl服务器上临时处理
  • Markdown输出会本地存储在
    .tmp/
    目录中
  • 截图(如果请求的话)会本地存储
  • Firecrawl在请求完成后不会保留任何持久化数据

Access Scopes

访问权限

  • API key provides full access to scraping features
  • No granular permission scopes available
  • Monitor usage via Firecrawl dashboard
  • API密钥可访问全部爬取功能
  • 暂不支持细粒度权限范围
  • 可通过Firecrawl控制台监控使用情况

Compliance Considerations

合规性注意事项

  • Robots.txt: Firecrawl respects robots.txt by default
  • Public Content Only: Only scrape publicly accessible pages
  • Terms of Service: Respect target site ToS
  • Rate Limiting: Built-in rate limiting prevents abuse
  • Stealth Proxy: Use stealth mode only when necessary (paywalled news, not auth bypass)
  • GDPR: Scraped content may contain PII - handle accordingly
  • Copyright: Respect intellectual property rights of scraped content
  • Robots.txt:Firecrawl默认会遵守robots.txt规则
  • 仅公开内容:仅爬取公开可访问的页面
  • 服务条款:遵守目标网站的服务条款
  • 速率限制:内置速率限制机制,防止滥用
  • Stealth代理:仅在必要时使用(如付费新闻网站,不要用于绕过身份验证)
  • GDPR:爬取的内容可能包含个人身份信息(PII),请妥善处理
  • 版权:尊重爬取内容的知识产权

Troubleshooting

故障排除

Common Issues

常见问题

Issue: Credits exhausted

问题:积分耗尽

Symptoms: API returns "insufficient credits" or quota exceeded error Cause: Account credits depleted Solution:
  • Check credit balance at https://firecrawl.dev/app
  • Upgrade plan or purchase additional credits
  • Reduce scraping frequency
  • Use
    basic
    proxy mode to conserve credits
症状: API返回“insufficient credits”或配额超出错误 原因: 账户积分已用完 解决方案:

Issue: Page not rendering correctly

问题:页面渲染不正确

Symptoms: Empty content or partial HTML returned Cause: JavaScript-heavy page not fully loading Solution:
  • Enable JavaScript rendering with
    --js-render
    flag
  • Increase timeout with
    --timeout 60000
    (60 seconds)
  • Try
    stealth
    proxy mode for protected sites
  • Wait for specific elements with
    --wait-for
    selector
症状: 返回空内容或部分HTML 原因: 重度依赖JavaScript的页面未完全加载 解决方案:
  • 使用
    --js-render
    标志启用JavaScript渲染
  • 增加超时时间,如
    --timeout 60000
    (60秒)
  • 对受保护的网站尝试使用
    stealth
    代理模式
  • 使用
    --wait-for
    选择器等待特定元素加载

Issue: 403 Forbidden error

问题:403 Forbidden错误

Symptoms: Script returns 403 status code Cause: Site blocking automated access Solution:
  • Enable
    stealth
    proxy mode
  • Add delay between requests
  • Try at different times (some sites rate limit by time)
  • Check if site requires login (not supported)
症状: 脚本返回403状态码 原因: 目标网站阻止了自动化访问 解决方案:
  • 启用
    stealth
    代理模式
  • 在请求之间添加延迟
  • 尝试在不同时间进行爬取(部分网站会按时间进行速率限制)
  • 检查目标网站是否需要登录(暂不支持登录爬取)

Issue: Empty markdown output

问题:Markdown输出为空

Symptoms: Scrape succeeds but markdown is empty or malformed Cause: Dynamic content loaded after page load, or unusual page structure Solution:
  • Increase wait time for JavaScript to execute
  • Use
    --wait-for
    to wait for specific content
  • Try
    html
    format to see raw content
  • Check if content is in an iframe (not always supported)
症状: 爬取成功但Markdown内容为空或格式错误 原因: 动态内容在页面加载后才加载,或页面结构特殊 解决方案:
  • 增加JavaScript执行的等待时间
  • 使用
    --wait-for
    参数等待特定内容加载
  • 尝试使用
    html
    格式查看原始内容
  • 检查内容是否在iframe中(并非所有iframe都支持爬取)

Issue: Timeout errors

问题:超时错误

Symptoms: Request times out before completion Cause: Slow page load or large page content Solution:
  • Increase timeout value (up to 120000ms)
  • Use
    basic
    proxy for faster response
  • Target specific page sections if possible
  • Check if site is experiencing issues
症状: 请求在完成前超时 原因: 页面加载缓慢或内容过大 解决方案:
  • 增加超时值(最高可达120000毫秒)
  • 使用
    basic
    代理以获得更快的响应
  • 尽可能爬取页面的特定部分
  • 检查目标网站是否出现故障

Resources

资源

  • references/single-page.md - Single page scraping details
  • references/website-crawler.md - Multi-page website crawling
  • references/single-page.md - 单页面爬取详细说明
  • references/website-crawler.md - 多页面网站爬取详细说明

Integration Patterns

集成模式

Scrape and Analyze

爬取与分析

Skills: firecrawl-scraping → parallel-research Use case: Scrape competitor pages, then analyze content strategy Flow:
  1. Scrape competitor website pages with Firecrawl
  2. Convert to clean markdown
  3. Use parallel-research to analyze positioning, messaging, features
技能组合: firecrawl-scraping → parallel-research 使用场景: 爬取竞争对手页面,然后分析其内容策略 流程:
  1. 使用Firecrawl爬取竞争对手网站页面
  2. 转换为干净的Markdown格式
  3. 使用parallel-research分析其定位、品牌话术、功能

Scrape and Document

爬取与文档生成

Skills: firecrawl-scraping → content-generation Use case: Create summary documents from web research Flow:
  1. Scrape multiple article pages on a topic
  2. Combine markdown content
  3. Generate summary document via content-generation
技能组合: firecrawl-scraping → content-generation 使用场景: 从网络研究中创建摘要文档 流程:
  1. 爬取多个同一主题的文章页面
  2. 合并Markdown内容
  3. 通过content-generation生成摘要文档

Scrape and Enrich CRM

爬取与CRM enrichment

Skills: firecrawl-scraping → attio-crm Use case: Enrich company records with website data Flow:
  1. Scrape company website (about page, team page, product pages)
  2. Extract key information (funding, team size, products)
  3. Update company record in Attio CRM with enriched data
技能组合: firecrawl-scraping → attio-crm 使用场景: 用网站数据丰富CRM中的公司记录 流程:
  1. 爬取公司网站(关于我们页面、团队页面、产品页面)
  2. 提取关键信息(融资情况、团队规模、产品)
  3. 使用丰富的数据更新Attio CRM中的公司记录