firecrawl-scraping
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl Scraping
Firecrawl 网页爬取
Overview
概述
Scrape individual web pages and convert them to clean, LLM-ready markdown. Handles JavaScript rendering, anti-bot protection, and dynamic content.
爬取单个网页并将其转换为干净的、适合LLM处理的Markdown格式。支持JavaScript渲染、反爬验证处理以及动态内容抓取。
Quick Decision Tree
快速决策树
What are you scraping?
│
├── Single page (article, blog, docs)
│ └── references/single-page.md
│ └── Script: scripts/firecrawl_scrape.py
│
└── Entire website (multiple pages, crawling)
└── references/website-crawler.md
└── (Use Apify Website Content Crawler for multi-page)你要爬取什么内容?
│
├── 单页面(文章、博客、文档)
│ └── 参考文档:references/single-page.md
│ └── 脚本:scripts/firecrawl_scrape.py
│
└── 整个网站(多页面、全站爬取)
└── 参考文档:references/website-crawler.md
└──(多页面爬取请使用Apify Website Content Crawler)Environment Setup
环境配置
bash
undefinedbash
undefinedRequired in .env
.env文件中需配置
FIRECRAWL_API_KEY=fc-your-api-key-here
Get your API key: https://firecrawl.dev/app/api-keysFIRECRAWL_API_KEY=fc-your-api-key-here
获取API密钥:https://firecrawl.dev/app/api-keysCommon Usage
常见使用方式
Simple Scrape
简单爬取
bash
python scripts/firecrawl_scrape.py "https://example.com/article"bash
python scripts/firecrawl_scrape.py "https://example.com/article"With Options
带参数爬取
bash
python scripts/firecrawl_scrape.py "https://wsj.com/article" \
--proxy stealth \
--format markdown summary \
--timeout 60000bash
python scripts/firecrawl_scrape.py "https://wsj.com/article" \
--proxy stealth \
--format markdown summary \
--timeout 60000Proxy Modes
代理模式
| Mode | Use Case |
|---|---|
| Standard sites, fastest |
| Anti-bot protection, premium content (WSJ, NYT) |
| Let Firecrawl decide (recommended) |
| 模式 | 使用场景 |
|---|---|
| 标准网站,速度最快 |
| 有反爬验证的网站、付费内容(如《华尔街日报》、《纽约时报》) |
| 由Firecrawl自动选择(推荐) |
Output Formats
输出格式
- - Clean markdown content (default)
markdown - - Raw HTML
html - - AI-generated summary
summary - - Page screenshot
screenshot - - All links on page
links
- - 干净的Markdown内容(默认)
markdown - - 原始HTML
html - - AI生成的摘要
summary - - 页面截图
screenshot - - 页面上的所有链接
links
Cost
成本说明
~1 credit per page. Stealth proxy may use additional credits.
每爬取一个页面约消耗1个积分。使用stealth代理可能会消耗额外积分。
Security Notes
安全注意事项
Credential Handling
凭据处理
- Store in
FIRECRAWL_API_KEYfile (never commit to git).env - API keys can be regenerated at https://firecrawl.dev/app/api-keys
- Never log or print API keys in script output
- Use environment variables, not hardcoded values
- 将存储在
FIRECRAWL_API_KEY文件中(绝对不要提交到Git).env - 可在https://firecrawl.dev/app/api-keys页面重新生成API密钥
- 绝对不要在脚本输出中记录或打印API密钥
- 使用环境变量存储,不要硬编码到代码中
Data Privacy
数据隐私
- Only scrapes publicly accessible web pages
- Scraped content is processed by Firecrawl servers temporarily
- Markdown output stored locally in directory
.tmp/ - Screenshots (if requested) are stored locally
- No persistent data retention by Firecrawl after request
- 仅爬取公开可访问的网页
- 爬取的内容会在Firecrawl服务器上临时处理
- Markdown输出会本地存储在目录中
.tmp/ - 截图(如果请求的话)会本地存储
- Firecrawl在请求完成后不会保留任何持久化数据
Access Scopes
访问权限
- API key provides full access to scraping features
- No granular permission scopes available
- Monitor usage via Firecrawl dashboard
- API密钥可访问全部爬取功能
- 暂不支持细粒度权限范围
- 可通过Firecrawl控制台监控使用情况
Compliance Considerations
合规性注意事项
- Robots.txt: Firecrawl respects robots.txt by default
- Public Content Only: Only scrape publicly accessible pages
- Terms of Service: Respect target site ToS
- Rate Limiting: Built-in rate limiting prevents abuse
- Stealth Proxy: Use stealth mode only when necessary (paywalled news, not auth bypass)
- GDPR: Scraped content may contain PII - handle accordingly
- Copyright: Respect intellectual property rights of scraped content
- Robots.txt:Firecrawl默认会遵守robots.txt规则
- 仅公开内容:仅爬取公开可访问的页面
- 服务条款:遵守目标网站的服务条款
- 速率限制:内置速率限制机制,防止滥用
- Stealth代理:仅在必要时使用(如付费新闻网站,不要用于绕过身份验证)
- GDPR:爬取的内容可能包含个人身份信息(PII),请妥善处理
- 版权:尊重爬取内容的知识产权
Troubleshooting
故障排除
Common Issues
常见问题
Issue: Credits exhausted
问题:积分耗尽
Symptoms: API returns "insufficient credits" or quota exceeded error
Cause: Account credits depleted
Solution:
- Check credit balance at https://firecrawl.dev/app
- Upgrade plan or purchase additional credits
- Reduce scraping frequency
- Use proxy mode to conserve credits
basic
症状: API返回“insufficient credits”或配额超出错误
原因: 账户积分已用完
解决方案:
- 在https://firecrawl.dev/app页面查看积分余额
- 升级套餐或购买额外积分
- 降低爬取频率
- 使用代理模式以节省积分
basic
Issue: Page not rendering correctly
问题:页面渲染不正确
Symptoms: Empty content or partial HTML returned
Cause: JavaScript-heavy page not fully loading
Solution:
- Enable JavaScript rendering with flag
--js-render - Increase timeout with (60 seconds)
--timeout 60000 - Try proxy mode for protected sites
stealth - Wait for specific elements with selector
--wait-for
症状: 返回空内容或部分HTML
原因: 重度依赖JavaScript的页面未完全加载
解决方案:
- 使用标志启用JavaScript渲染
--js-render - 增加超时时间,如(60秒)
--timeout 60000 - 对受保护的网站尝试使用代理模式
stealth - 使用选择器等待特定元素加载
--wait-for
Issue: 403 Forbidden error
问题:403 Forbidden错误
Symptoms: Script returns 403 status code
Cause: Site blocking automated access
Solution:
- Enable proxy mode
stealth - Add delay between requests
- Try at different times (some sites rate limit by time)
- Check if site requires login (not supported)
症状: 脚本返回403状态码
原因: 目标网站阻止了自动化访问
解决方案:
- 启用代理模式
stealth - 在请求之间添加延迟
- 尝试在不同时间进行爬取(部分网站会按时间进行速率限制)
- 检查目标网站是否需要登录(暂不支持登录爬取)
Issue: Empty markdown output
问题:Markdown输出为空
Symptoms: Scrape succeeds but markdown is empty or malformed
Cause: Dynamic content loaded after page load, or unusual page structure
Solution:
- Increase wait time for JavaScript to execute
- Use to wait for specific content
--wait-for - Try format to see raw content
html - Check if content is in an iframe (not always supported)
症状: 爬取成功但Markdown内容为空或格式错误
原因: 动态内容在页面加载后才加载,或页面结构特殊
解决方案:
- 增加JavaScript执行的等待时间
- 使用参数等待特定内容加载
--wait-for - 尝试使用格式查看原始内容
html - 检查内容是否在iframe中(并非所有iframe都支持爬取)
Issue: Timeout errors
问题:超时错误
Symptoms: Request times out before completion
Cause: Slow page load or large page content
Solution:
- Increase timeout value (up to 120000ms)
- Use proxy for faster response
basic - Target specific page sections if possible
- Check if site is experiencing issues
症状: 请求在完成前超时
原因: 页面加载缓慢或内容过大
解决方案:
- 增加超时值(最高可达120000毫秒)
- 使用代理以获得更快的响应
basic - 尽可能爬取页面的特定部分
- 检查目标网站是否出现故障
Resources
资源
- references/single-page.md - Single page scraping details
- references/website-crawler.md - Multi-page website crawling
- references/single-page.md - 单页面爬取详细说明
- references/website-crawler.md - 多页面网站爬取详细说明
Integration Patterns
集成模式
Scrape and Analyze
爬取与分析
Skills: firecrawl-scraping → parallel-research
Use case: Scrape competitor pages, then analyze content strategy
Flow:
- Scrape competitor website pages with Firecrawl
- Convert to clean markdown
- Use parallel-research to analyze positioning, messaging, features
技能组合: firecrawl-scraping → parallel-research
使用场景: 爬取竞争对手页面,然后分析其内容策略
流程:
- 使用Firecrawl爬取竞争对手网站页面
- 转换为干净的Markdown格式
- 使用parallel-research分析其定位、品牌话术、功能
Scrape and Document
爬取与文档生成
Skills: firecrawl-scraping → content-generation
Use case: Create summary documents from web research
Flow:
- Scrape multiple article pages on a topic
- Combine markdown content
- Generate summary document via content-generation
技能组合: firecrawl-scraping → content-generation
使用场景: 从网络研究中创建摘要文档
流程:
- 爬取多个同一主题的文章页面
- 合并Markdown内容
- 通过content-generation生成摘要文档
Scrape and Enrich CRM
爬取与CRM enrichment
Skills: firecrawl-scraping → attio-crm
Use case: Enrich company records with website data
Flow:
- Scrape company website (about page, team page, product pages)
- Extract key information (funding, team size, products)
- Update company record in Attio CRM with enriched data
技能组合: firecrawl-scraping → attio-crm
使用场景: 用网站数据丰富CRM中的公司记录
流程:
- 爬取公司网站(关于我们页面、团队页面、产品页面)
- 提取关键信息(融资情况、团队规模、产品)
- 使用丰富的数据更新Attio CRM中的公司记录