Firecrawl CLI - Web Scraping Expert

Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.

Before Scraping: Decision Framework

Ask yourself these questions BEFORE running firecrawl:

1. Scale Assessment

How many pages?
- 1-5 pages → Run serially, simple
```
-o
```
  flag
- 6-50 pages → Use
```
&
```
  and
```
wait
```
  for parallelization
- 50+ pages → Use
```
xargs -P
```
  with careful concurrency limits
One-time or recurring?
- One-time → Manual commands acceptable
- Recurring → Build script in
```
.firecrawl/scratchpad/
```

2. Data Need Clarity

What data do you actually need?
- Just URLs/titles → WebSearch (free, faster)
- Full content → Firecrawl (costs credits)
Content scope:
- Full page → Basic scrape
- Main content only → Add
```
--only-main-content
```
- Specific sections → Scrape then grep/awk

3. Tool Selection

Is this the right tool?
- Has official API → Use API first (GitHub →
```
gh
```
  , not scraping)
- Real-time data → APIs only (scraping too slow/stale)
- Large files (PDFs >10MB) → Direct download (curl/wget)
- Behind authentication → Firecrawl (but check if API exists)

Critical Decision: Which Tool to Use?

User needs web content?
│
├─ Single known URL
│   ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│   ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│   ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│   └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│   ├─ Need top 5-10 results with content → Firecrawl search --scrape
│   ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│   └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│   └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
    └─ Use direct API if available (NOT scraping - too slow/unreliable)

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

Problem: Scraping sites one-by-one wastes time.

bash

# WRONG - sequential (10 sites = 50+ seconds)
for url in site1 site2 site3 site4 site5; do
  firecrawl scrape "$url" -o ".firecrawl/$url.md"
done

# CORRECT - parallel (10 sites = 5-8 seconds)
firecrawl scrape site1 -o .firecrawl/1.md &
firecrawl scrape site2 -o .firecrawl/2.md &
firecrawl scrape site3 -o .firecrawl/3.md &
wait

# BEST - xargs parallelization
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

Why: Firecrawl supports up to 100 parallel jobs (check

firecrawl --status

). Use them.

Why this is deceptively hard to debug: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.

❌ #2: Reading Full Output into Context

Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.

bash

# WRONG - reads 5000-line file into context
Read(.firecrawl/result.md)

# CORRECT - preview first, then targeted extraction
wc -l .firecrawl/result.md  # Check size: 5243 lines
head -100 .firecrawl/result.md  # Preview structure
grep -A 10 "keyword" .firecrawl/result.md  # Extract relevant sections

Why: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.

Why this is deceptively hard to debug: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.

❌ #3: Using Firecrawl for Wrong Tasks

NEVER use Firecrawl for:

Authenticated pages without proper setup → Run
```
firecrawl login --browser
```
first
Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
APIs with official SDKs → Use the SDK (GitHub API → use
```
gh
```
CLI)

Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of

gh api

for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize

pdftotext

would have worked perfectly in 2 seconds.

❌ #4: Ignoring Output Organization

Problem: Dumping all results in working directory creates mess.

bash

# WRONG - pollutes working directory
firecrawl scrape https://example.com

# CORRECT - organized structure
firecrawl scrape https://example.com -o .firecrawl/example.com.md
firecrawl search "AI news" -o .firecrawl/search-ai-news.json
firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt

Why:

.firecrawl/

directory keeps workspace clean, add to

.gitignore

Why this is deceptively hard to debug: No error—files just accumulate in root directory. After 10-15 scrapes,

ls

output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize

-o

flag is mandatory for persistence. Git commits accidentally include scraped data before

.gitignore

is updated.

Authentication Setup

Before first use, check auth status:

bash

firecrawl --status

If not authenticated:

bash

firecrawl login --browser  # Opens browser automatically

The

--browser

flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.

Core Operations (Quick Reference)

Search the Web

bash

# Basic search
firecrawl search "your query" -o .firecrawl/search-query.json --json

# Search + scrape content from results
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json

# Time-filtered search
firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json  # Past day
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json          # Past week

Scrape Single Page

bash

# Get clean markdown
firecrawl scrape https://example.com -o .firecrawl/example.md

# Main content only (removes nav, footer, ads)
firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md

# Wait for JS to render (SPAs)
firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

Map Entire Site

bash

# Discover all URLs
firecrawl map https://example.com -o .firecrawl/urls.txt

# Filter for specific pages
firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

Expert Pattern: Parallel Bulk Scraping

Check concurrency limit first:

bash

firecrawl --status
# Output: Concurrency: 0/100 jobs

Run up to limit:

bash

# For list of URLs in file
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'

# For generated URLs
for i in {1..20}; do
  firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
done
wait

Extract data after bulk scrape:

bash

# Extract all H1 headings from scraped pages
grep "^# " .firecrawl/*.md

# Find pages mentioning keyword
grep -l "keyword" .firecrawl/*.md

# Process with jq (if JSON output)
jq -r '.data.web[].title' .firecrawl/*.json

When to Load Full CLI Reference

MANDATORY - READ ENTIRE FILE:

references/cli-options.md

when:

Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
Need 5+ advanced options for a single command
Troubleshooting header injection, cookie handling, or sitemap modes
Setting up custom user-agents or location-based scraping parameters

MANDATORY - READ ENTIRE FILE:

references/output-processing.md

when:

Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
Parsing nested JSON structures from search results (accessing .data.web[].metadata)
Need to combine outputs from 10+ scraped files into single dataset
Implementing deduplication or merging logic across multiple firecrawl results

Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).

Error Recovery Procedures

When "Not authenticated" Error Occurs

Recovery steps:

Check current auth status:
```
firecrawl --status
```
Run authentication:
```
firecrawl login --browser
```
(auto-opens browser)
Verify success:
```
firecrawl --status
```
should show "Authenticated via FIRECRAWL_API_KEY"
Fallback: If browser auth fails, manually set API key:
```
export FIRECRAWL_API_KEY=your_key
```
(get key from firecrawl.dev dashboard)

When "Concurrency limit reached" Error Occurs

Recovery steps:

Check current usage:
```
firecrawl --status
```
(shows X/100 jobs)
Wait for running jobs:
```
wait
```
(if using
```
&
```
background jobs)
Verify capacity freed:
```
firecrawl --status
```
should show lower usage
Fallback: If jobs are stuck, reduce parallelization (e.g.,
```
xargs -P 5
```
instead of
```
-P 10
```
) and retry. Jobs auto-timeout after 5 minutes.

When "Page failed to load" Error Occurs

Recovery steps:

Test basic connectivity:
```
curl -I URL
```
(verify site is accessible)

Increase JS wait time:

firecrawl scrape URL --wait-for 5000 -o output.md

Verify output has content:
```
wc -l output.md
```
(should be >10 lines)
Fallback: If still empty after 10s wait, page may be fully client-rendered → try
```
--format html
```
to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)

When "Output file is empty" Error Occurs

Recovery steps:

Check if content exists:
```
head -20 output.md
```
(see what was captured)

Try main content extraction:

firecrawl scrape URL --only-main-content -o output.md

Verify improvement:
```
wc -l output.md
```
(should increase significantly)
Fallback: If still empty, page structure may be unusual → use
```
--include-tags article,main
```
or
```
--exclude-tags nav,aside,footer
```
to target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).

Resources

CLI Help:

firecrawl --help

firecrawl <command> --help

Status Check:
```
firecrawl --status
```
(shows auth, credits, concurrency)
This Skill: Decision trees, anti-patterns, expert parallelization patterns

firecrawl

NPX Install

Tags

SKILL.md Content

Firecrawl CLI - Web Scraping Expert

Before Scraping: Decision Framework

1. Scale Assessment

2. Data Need Clarity

3. Tool Selection

Critical Decision: Which Tool to Use?

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

❌ #2: Reading Full Output into Context

❌ #3: Using Firecrawl for Wrong Tasks

❌ #4: Ignoring Output Organization

Authentication Setup

Core Operations (Quick Reference)

Search the Web

Scrape Single Page

Map Entire Site

Expert Pattern: Parallel Bulk Scraping

When to Load Full CLI Reference

Error Recovery Procedures

When "Not authenticated" Error Occurs

When "Concurrency limit reached" Error Occurs

When "Page failed to load" Error Occurs

When "Output file is empty" Error Occurs

Resources