firecrawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl CLI - Web Scraping Expert
Firecrawl CLI - 网页爬取专家
Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.
网页内容任务优先使用Firecrawl而非WebFetch/WebSearch
Before Scraping: Decision Framework
爬取前:决策框架
Ask yourself these questions BEFORE running firecrawl:
运行firecrawl前,请先问自己这些问题:
1. Scale Assessment
1. 规模评估
- How many pages?
- 1-5 pages → Run serially, simple flag
-o - 6-50 pages → Use and
&for parallelizationwait - 50+ pages → Use with careful concurrency limits
xargs -P
- 1-5 pages → Run serially, simple
- One-time or recurring?
- One-time → Manual commands acceptable
- Recurring → Build script in
.firecrawl/scratchpad/
- 需要爬取多少页面?
- 1-5个页面 → 串行执行,使用简单的参数
-o - 6-50个页面 → 使用和
&实现并行化wait - 50个以上页面 → 使用并设置合理的并发限制
xargs -P
- 1-5个页面 → 串行执行,使用简单的
- 是一次性任务还是周期性任务?
- 一次性 → 可接受手动执行命令
- 周期性 → 在目录下编写脚本
.firecrawl/scratchpad/
2. Data Need Clarity
2. 数据需求明确性
- What data do you actually need?
- Just URLs/titles → WebSearch (free, faster)
- Full content → Firecrawl (costs credits)
- Content scope:
- Full page → Basic scrape
- Main content only → Add
--only-main-content - Specific sections → Scrape then grep/awk
- 你实际需要什么数据?
- 仅需URL/标题 → 使用WebSearch(免费、更快)
- 完整内容 → 使用Firecrawl(消耗积分)
- 内容范围:
- 完整页面 → 基础爬取
- 仅主内容 → 添加参数
--only-main-content - 特定板块 → 先爬取再使用grep/awk提取
3. Tool Selection
3. 工具选择
- Is this the right tool?
- Has official API → Use API first (GitHub → , not scraping)
gh - Real-time data → APIs only (scraping too slow/stale)
- Large files (PDFs >10MB) → Direct download (curl/wget)
- Behind authentication → Firecrawl (but check if API exists)
- Has official API → Use API first (GitHub →
- 这是合适的工具吗?
- 有官方API → 优先使用API(比如GitHub操作使用而非爬取)
gh - 实时数据 → 仅使用API(爬取速度太慢/数据过时)
- 大文件(PDF>10MB) → 直接下载(使用curl/wget)
- 需要身份验证的页面 → 使用Firecrawl(但先确认是否有API可用)
- 有官方API → 优先使用API(比如GitHub操作使用
Critical Decision: Which Tool to Use?
关键决策:选择哪种工具?
User needs web content?
│
├─ Single known URL
│ ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│ ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│ ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│ └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│ ├─ Need top 5-10 results with content → Firecrawl search --scrape
│ ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│ └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│ └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
└─ Use direct API if available (NOT scraping - too slow/unreliable)用户需要网页内容?
│
├─ 单个已知URL
│ ├─ 公开页面、简单HTML → WebFetch(更快,无需身份验证)
│ ├─ JS渲染/SPA(React、Vue等) → Firecrawl(可执行JavaScript)
│ ├─ 需要结构化数据(链接、标题、表格) → Firecrawl(输出Markdown格式)
│ └─ 需要身份验证/付费墙 → Firecrawl(支持身份验证)
│
├─ 搜索+爬取工作流
│ ├─ 需要前5-10条带内容的结果 → Firecrawl search --scrape
│ ├─ 仅需URL/标题 → WebSearch(更轻量、更快)
│ └─ 深度调研(20+来源) → Firecrawl(支持并行爬取)
│
├─ 全站地图(发现所有页面)
│ └─ 使用Firecrawl map(返回域名下所有URL)
│
└─ 实时数据(股票价格、体育比分)
└─ 如有可用直接使用API(不要爬取 - 速度太慢/不可靠)Anti-Patterns (NEVER Do This)
反模式(绝对不要这样做)
❌ #1: Sequential Scraping
❌ #1: 串行爬取
Problem: Scraping sites one-by-one wastes time.
bash
undefined问题:逐个爬取站点会浪费大量时间。
bash
undefinedWRONG - sequential (10 sites = 50+ seconds)
错误 - 串行执行(10个站点需要50+秒)
for url in site1 site2 site3 site4 site5; do
firecrawl scrape "$url" -o ".firecrawl/$url.md"
done
for url in site1 site2 site3 site4 site5; do
firecrawl scrape "$url" -o ".firecrawl/$url.md"
done
CORRECT - parallel (10 sites = 5-8 seconds)
正确 - 并行执行(10个站点需要5-8秒)
firecrawl scrape site1 -o .firecrawl/1.md &
firecrawl scrape site2 -o .firecrawl/2.md &
firecrawl scrape site3 -o .firecrawl/3.md &
wait
firecrawl scrape site1 -o .firecrawl/1.md &
firecrawl scrape site2 -o .firecrawl/2.md &
firecrawl scrape site3 -o .firecrawl/3.md &
wait
BEST - xargs parallelization
最优方案 - xargs并行化
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'
**Why**: Firecrawl supports up to 100 parallel jobs (check `firecrawl --status`). Use them.
**Why this is deceptively hard to debug**: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'
**原因**:Firecrawl支持最多100个并行任务(可通过`firecrawl --status`查看),请充分利用。
**为何这个问题难以排查**:操作能成功完成,但速度极慢。没有错误信息提示问题所在。当爬取20个站点耗时2分钟而非10秒时,你可能不会立刻意识到瓶颈是串行执行而非网络速度。性能分析会发现:90%的时间都在等待而非处理。通常需要10-15分钟才能意识到并行化是解决方案。❌ #2: Reading Full Output into Context
❌ #2: 将完整输出读入上下文
Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.
bash
undefined问题:Firecrawl的结果通常超过1000行,读取整个文件会占满上下文。
bash
undefinedWRONG - reads 5000-line file into context
错误 - 将5000行的文件读入上下文
Read(.firecrawl/result.md)
Read(.firecrawl/result.md)
CORRECT - preview first, then targeted extraction
正确 - 先预览,再针对性提取
wc -l .firecrawl/result.md # Check size: 5243 lines
head -100 .firecrawl/result.md # Preview structure
grep -A 10 "keyword" .firecrawl/result.md # Extract relevant sections
**Why**: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.
**Why this is deceptively hard to debug**: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.wc -l .firecrawl/result.md # 查看文件行数:5243行
head -100 .firecrawl/result.md # 预览文件结构
grep -A 10 "keyword" .firecrawl/result.md # 提取相关板块
**原因**:上下文资源宝贵,请使用bash工具(grep、head、tail、awk)提取你需要的内容。
**为何这个问题难以排查**:没有错误信息——文件能成功加载到上下文。Agent会直接“读取整个文件”而不先检查大小。通常在30多条消息后,当上下文达到限制或响应变慢时,你才会发现问题。文件浏览器默认不显示行数,终端显示“成功”但你已经悄悄浪费了4000+ tokens。通常需要15-20分钟才能意识到使用grep/head进行增量读取效率会高20倍。❌ #3: Using Firecrawl for Wrong Tasks
❌ #3: 错误的任务使用Firecrawl
NEVER use Firecrawl for:
- Authenticated pages without proper setup → Run first
firecrawl login --browser - Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
- Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
- APIs with official SDKs → Use the SDK (GitHub API → use CLI)
gh
Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize would have worked perfectly in 2 seconds.
gh apipdftotext绝对不要将Firecrawl用于以下场景:
- 未正确配置的需身份验证页面 → 先运行
firecrawl login --browser - 实时数据(体育比分、股票价格) → 使用直接API(爬取速度太慢)
- 大型二进制文件(PDF>10MB、视频) → 使用curl/wget直接下载
- 有官方SDK的API → 使用SDK(GitHub API使用CLI)
gh
原因:错误的工具选择不会产生错误,但会导致结果缓慢、不可靠。爬取实时数据“能工作”但会滞后10秒,且每次请求都会消耗积分。使用Firecrawl而非操作GitHub能成功,但会更快触发速率限制(API支持5000次调用,而爬取仅支持100次/分钟)。爬取PDF能提取文本但会损坏表格——通常在30分钟的后处理后,你才会意识到使用仅需2秒就能完美完成任务。
gh apipdftotext❌ #4: Ignoring Output Organization
❌ #4: 忽略输出组织
Problem: Dumping all results in working directory creates mess.
bash
undefined问题:将所有结果直接输出到工作目录会造成混乱。
bash
undefinedWRONG - pollutes working directory
错误 - 污染工作目录
firecrawl scrape https://example.com
firecrawl scrape https://example.com
CORRECT - organized structure
正确 - 结构化存储
firecrawl scrape https://example.com -o .firecrawl/example.com.md
firecrawl search "AI news" -o .firecrawl/search-ai-news.json
firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt
**Why**: `.firecrawl/` directory keeps workspace clean, add to `.gitignore`.
**Why this is deceptively hard to debug**: No error—files just accumulate in root directory. After 10-15 scrapes, `ls` output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize `-o` flag is mandatory for persistence. Git commits accidentally include scraped data before `.gitignore` is updated.
---firecrawl scrape https://example.com -o .firecrawl/example.com.md
firecrawl search "AI news" -o .firecrawl/search-ai-news.json
firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt
**原因**:`.firecrawl/`目录能保持工作区整洁,可将其添加到`.gitignore`。
**为何这个问题难以排查**:没有错误——文件只是堆积在根目录。经过10-15次爬取后,`ls`的输出会变得难以阅读。更糟的是:Firecrawl默认输出到标准输出,结果会显示在终端但不会保存,需要重新爬取(浪费积分)。通常在丢失两次数据后,你才会意识到`-o`参数是持久化保存的必需项。在更新`.gitignore`前,Git提交可能会意外包含爬取的数据。
---Authentication Setup
身份验证设置
Before first use, check auth status:
bash
firecrawl --statusIf not authenticated:
bash
firecrawl login --browser # Opens browser automaticallyThe flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.
--browser首次使用前,检查身份验证状态:
bash
firecrawl --status若未通过身份验证:
bash
firecrawl login --browser # 自动打开浏览器--browserCore Operations (Quick Reference)
核心操作(快速参考)
Search the Web
网页搜索
bash
undefinedbash
undefinedBasic search
基础搜索
firecrawl search "your query" -o .firecrawl/search-query.json --json
firecrawl search "你的查询内容" -o .firecrawl/search-query.json --json
Search + scrape content from results
搜索并爬取结果内容
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json
Time-filtered search
时间过滤搜索
firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json # Past day
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json # Past week
undefinedfirecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json # 过去24小时
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json # 过去一周
undefinedScrape Single Page
爬取单个页面
bash
undefinedbash
undefinedGet clean markdown
获取整洁的Markdown内容
firecrawl scrape https://example.com -o .firecrawl/example.md
firecrawl scrape https://example.com -o .firecrawl/example.md
Main content only (removes nav, footer, ads)
仅获取主内容(移除导航、页脚、广告)
firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md
firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md
Wait for JS to render (SPAs)
等待JS渲染(SPA场景)
firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md
undefinedfirecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md
undefinedMap Entire Site
全站地图生成
bash
undefinedbash
undefinedDiscover all URLs
发现所有URL
firecrawl map https://example.com -o .firecrawl/urls.txt
firecrawl map https://example.com -o .firecrawl/urls.txt
Filter for specific pages
过滤特定页面
firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt
---firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt
---Expert Pattern: Parallel Bulk Scraping
专家模式:批量并行爬取
Check concurrency limit first:
bash
firecrawl --status先检查并发限制:
bash
firecrawl --statusOutput: Concurrency: 0/100 jobs
输出:Concurrency: 0/100 jobs
**Run up to limit**:
```bash
**按限制运行**:
```bashFor list of URLs in file
针对文件中的URL列表
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'
For generated URLs
针对生成的URL
for i in {1..20}; do
firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
done
wait
**Extract data after bulk scrape**:
```bashfor i in {1..20}; do
firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
done
wait
**批量爬取后提取数据**:
```bashExtract all H1 headings from scraped pages
从爬取的页面中提取所有H1标题
grep "^# " .firecrawl/*.md
grep "^# " .firecrawl/*.md
Find pages mentioning keyword
查找包含指定关键词的页面
grep -l "keyword" .firecrawl/*.md
grep -l "keyword" .firecrawl/*.md
Process with jq (if JSON output)
使用jq处理(若为JSON输出)
jq -r '.data.web[].title' .firecrawl/*.json
---jq -r '.data.web[].title' .firecrawl/*.json
---When to Load Full CLI Reference
何时加载完整CLI参考文档
MANDATORY - READ ENTIRE FILE: when:
references/cli-options.md- Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
- Need 5+ advanced options for a single command
- Troubleshooting header injection, cookie handling, or sitemap modes
- Setting up custom user-agents or location-based scraping parameters
MANDATORY - READ ENTIRE FILE: when:
references/output-processing.md- Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
- Parsing nested JSON structures from search results (accessing .data.web[].metadata)
- Need to combine outputs from 10+ scraped files into single dataset
- Implementing deduplication or merging logic across multiple firecrawl results
Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).
必须阅读完整文档:当出现以下情况时,阅读:
references/cli-options.md- 错误信息提到3个以上未知参数(如"--sitemap"、"--include-tags"、"--exclude-tags")
- 单个命令需要5个以上高级参数
- 排查请求头注入、Cookie处理或站点地图模式问题
- 设置自定义用户代理或基于位置的爬取参数
必须阅读完整文档:当出现以下情况时,阅读:
references/output-processing.md- 构建包含3个以上转换步骤的流水线(firecrawl | jq | awk | ...)
- 解析搜索结果中的嵌套JSON结构(访问.data.web[].metadata)
- 需要将10个以上爬取文件的输出合并为单个数据集
- 实现多个Firecrawl结果的去重或合并逻辑
无需加载参考文档:使用标准参数(--json、-o、--limit、--scrape)的基础搜索/爬取/地图生成操作。
Error Recovery Procedures
错误恢复流程
When "Not authenticated" Error Occurs
出现“未通过身份验证”错误时
Recovery steps:
- Check current auth status:
firecrawl --status - Run authentication: (auto-opens browser)
firecrawl login --browser - Verify success: should show "Authenticated via FIRECRAWL_API_KEY"
firecrawl --status - Fallback: If browser auth fails, manually set API key: (get key from firecrawl.dev dashboard)
export FIRECRAWL_API_KEY=your_key
恢复步骤:
- 检查当前身份验证状态:
firecrawl --status - 执行身份验证:(自动打开浏览器)
firecrawl login --browser - 验证成功:应显示“Authenticated via FIRECRAWL_API_KEY”
firecrawl --status - 备选方案:若浏览器验证失败,手动设置API密钥:(从firecrawl.dev控制台获取密钥)
export FIRECRAWL_API_KEY=your_key
When "Concurrency limit reached" Error Occurs
出现“达到并发限制”错误时
Recovery steps:
- Check current usage: (shows X/100 jobs)
firecrawl --status - Wait for running jobs: (if using
waitbackground jobs)& - Verify capacity freed: should show lower usage
firecrawl --status - Fallback: If jobs are stuck, reduce parallelization (e.g., instead of
xargs -P 5) and retry. Jobs auto-timeout after 5 minutes.-P 10
恢复步骤:
- 检查当前使用情况:(显示X/100个任务)
firecrawl --status - 等待正在运行的任务完成:(若使用
wait后台任务)& - 验证可用容量:应显示更低的使用率
firecrawl --status - 备选方案:若任务卡住,降低并行度(比如将改为
xargs -P 10)并重试。任务会在5分钟后自动超时。-P 5
When "Page failed to load" Error Occurs
出现“页面加载失败”错误时
Recovery steps:
- Test basic connectivity: (verify site is accessible)
curl -I URL - Increase JS wait time:
firecrawl scrape URL --wait-for 5000 -o output.md - Verify output has content: (should be >10 lines)
wc -l output.md - Fallback: If still empty after 10s wait, page may be fully client-rendered → try to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)
--format html
恢复步骤:
- 测试基础连通性:(验证站点可访问)
curl -I URL - 增加JS等待时间:
firecrawl scrape URL --wait-for 5000 -o output.md - 验证输出内容:(应超过10行)
wc -l output.md - 备选方案:若等待10秒后仍为空,页面可能是纯客户端渲染 → 尝试查看原始HTML,或使用其他方法(curl + cheerio,若不需要JS则使用WebFetch)
--format html
When "Output file is empty" Error Occurs
出现“输出文件为空”错误时
Recovery steps:
- Check if content exists: (see what was captured)
head -20 output.md - Try main content extraction:
firecrawl scrape URL --only-main-content -o output.md - Verify improvement: (should increase significantly)
wc -l output.md - Fallback: If still empty, page structure may be unusual → use or
--include-tags article,mainto target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).--exclude-tags nav,aside,footer
恢复步骤:
- 检查是否有内容:(查看捕获的内容)
head -20 output.md - 尝试提取主内容:
firecrawl scrape URL --only-main-content -o output.md - 验证改进效果:(行数应显著增加)
wc -l output.md - 备选方案:若仍为空,页面结构可能特殊 → 使用或
--include-tags article,main指定目标HTML元素。若仍失败,页面可能无可爬取文本(仅图片、基于canvas等)。--exclude-tags nav,aside,footer
Resources
资源
- CLI Help: or
firecrawl --helpfirecrawl <command> --help - Status Check: (shows auth, credits, concurrency)
firecrawl --status - This Skill: Decision trees, anti-patterns, expert parallelization patterns
- CLI帮助:或
firecrawl --helpfirecrawl <command> --help - 状态检查:(显示身份验证、积分、并发情况)
firecrawl --status - 本技能文档:决策树、反模式、专家并行爬取模式