Firecrawl & Jina Web Scraping
Firecrawl 与 Jina 网页抓取
Firecrawl vs WebFetch
Firecrawl 与 WebFetch 对比
Prefer
firecrawl scrape URL --only-main-content
over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.
优先使用
firecrawl scrape URL --only-main-content
而非WebFetch工具——它生成的Markdown更整洁,支持处理重度依赖JavaScript的页面,还能避免内容截断(基准测试覆盖率超过80%)。当Firecrawl不可用时,可以将WebFetch作为备选方案。
Token-Efficient Scraping
低Token消耗的抓取方案
Inspired by Anthropic's
dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.
受Anthropic的
动态过滤启发——一定要在推理前先做过滤。在他们的基准测试中,这种做法能减少约24%的输入Token,并提升约11%的准确率。
The Principle: Search → Filter → Scrape → Filter → Reason
原则:搜索 → 过滤 → 抓取 → 过滤 → 推理
DO:
Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason
DON'T:
Search → Scrape everything → Reason over all of it
推荐做法:
仅搜索(标题/URL)→ 评估相关性 → 抓取高匹配结果 → 按段落过滤 → 推理
不推荐做法:
Step-by-Step Efficient Workflow
分步高效工作流
Step 1: Search — get titles/URLs only (cheap)
步骤1:搜索 —— 仅获取标题/URL(成本极低)
firecrawl search "query" --limit 20
firecrawl search "query" --limit 20
Step 2: Evaluate results, pick 3-5 best URLs
步骤2:评估结果,挑选3-5个最匹配的URL
Step 3: Scrape only those, filter to relevant sections
步骤3:仅抓取选中的URL,过滤出相关段落
firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
Post-Processing with filter_web_results.py
使用filter_web_results.py进行后处理
Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:
将任意Firecrawl或Exa的输出通过该脚本管道传输,即可在推理前缩减上下文长度:
Extract only matching sections from scraped page
从抓取的页面中仅提取匹配的段落
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
Keep only paragraphs with keywords
仅保留包含指定关键词的段落
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
Extract specific JSON fields from API output
从API输出中提取指定JSON字段
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
Combine filters with stats
组合使用过滤规则并输出统计信息
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
**Full path:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**Flags:** `--sections`, `--keywords`, `--max-chars`, `--max-lines`, `--fields` (JSON), `--strip-links`, `--strip-images`, `--compact`, `--stats`
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
**完整路径:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**可用参数:** `--sections`、`--keywords`、`--max-chars`、`--max-lines`、`--fields`(JSON格式)、`--strip-links`、`--strip-images`、`--compact`、`--stats`
Other Token-Saving Patterns
其他节省Token的技巧
- Use to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
- Use
firecrawl map URL --search "topic"
first to find relevant subpages before scraping
- Use first to get URL list, evaluate, then scrape selectively
- Use with to cap extraction length
- Use (Python API script) over full text when you need the gist, not raw content
- 使用参数 可以移除导航栏、页脚等冗余内容,降低Token消耗。仅当你确实需要导航/页脚内容时才省略该参数。
- 优先使用
firecrawl map URL --search "topic"
在抓取前先找到相关的子页面
- 优先使用参数 先获取URL列表,评估后再选择性抓取
- 搭配使用参数 限制提取内容的长度
- 当你只需要内容概要而非原始内容时 使用Python API脚本的参数而非全文提取
Claude API Native Tools (for API Agent Builders)
Claude API原生工具(供API Agent开发者使用)
Anthropic's API now offers built-in dynamic filtering tools:
web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09
These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.
Anthropic的API现在提供内置的动态过滤工具:
web_search_20260209 / web_fetch_20260209
请求头: anthropic-beta: code-execution-web-tools-2026-02-09
这些工具通过代码执行实现了内置动态过滤功能。直接构建Claude API Agent时可以使用它们。当你需要以下能力时请使用Firecrawl/Exa:自主Agent、批量抓取、结构化提取、特定领域爬取,或是不使用Claude API的场景。
1. Official Firecrawl CLI () — Primary
1. 官方Firecrawl CLI()—— 首选工具
Setup: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
| Command | Purpose | Quick Example |
|---|
| Single page → markdown | firecrawl scrape URL --only-main-content
|
| Entire site with progress | firecrawl crawl URL --wait --progress --limit 50
|
| Discover all URLs on a site | firecrawl map URL --search "API"
|
| Web search (+ optional scrape) | firecrawl search "query" --limit 10
|
Full CLI reference: references/cli-reference.md
安装配置: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
| 命令 | 用途 | 快速示例 |
|---|
| 单页转Markdown | firecrawl scrape URL --only-main-content
|
| 带进度的全站爬取 | firecrawl crawl URL --wait --progress --limit 50
|
| 发现站点下所有URL | firecrawl map URL --search "API"
|
| 网页搜索(可选自动抓取) | firecrawl search "query" --limit 10
|
完整CLI参考文档: references/cli-reference.md
2. Auto-Save Alias () — Shell Alias
Requires shell alias setup (not bundled with this skill).
→ Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
→ 自动保存到 ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
3. Python API Script () — Advanced Features
Command: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
Requires: env var,
pip install firecrawl-py requests
| Command | Purpose | Quick Example |
|---|
| Web search with scraping | firecrawl_api.py search "query" -n 10
|
| Single URL with page actions | firecrawl_api.py scrape URL --formats markdown summary
|
| Multiple URLs concurrently | firecrawl_api.py batch-scrape URL1 URL2 URL3
|
| Website crawling | firecrawl_api.py crawl URL --limit 20
|
| URL discovery | firecrawl_api.py map URL --search "query"
|
| LLM-powered structured extraction | firecrawl_api.py extract URL --prompt "Find pricing"
|
| Autonomous extraction (no URLs needed) | firecrawl_api.py agent "Find YC W24 AI startups"
|
| Bulk agent queries (v2.8.0+) | firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
|
Agent models: (10 credits, simple),
(default),
(thorough)
Full Python API reference: references/python-api-reference.md
命令: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
依赖要求: 配置
环境变量,执行
pip install firecrawl-py requests
安装依赖
| 命令 | 用途 | 快速示例 |
|---|
| 带抓取功能的网页搜索 | firecrawl_api.py search "query" -n 10
|
| 支持页面交互的单URL抓取 | firecrawl_api.py scrape URL --formats markdown summary
|
| 并发抓取多个URL | firecrawl_api.py batch-scrape URL1 URL2 URL3
|
| 网站爬取 | firecrawl_api.py crawl URL --limit 20
|
| URL发现 | firecrawl_api.py map URL --search "query"
|
| 基于LLM的结构化提取 | firecrawl_api.py extract URL --prompt "Find pricing"
|
| 自主提取(无需提供URL) | firecrawl_api.py agent "Find YC W24 AI startups"
|
| 批量Agent查询(v2.8.0+版本支持) | firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
|
Agent可用模型: (10积分,适合简单任务)、
(默认)、
(深度处理)
完整Python API参考文档: references/python-api-reference.md
4. DeepWiki — GitHub Repo Documentation
4. DeepWiki —— GitHub仓库文档生成工具
bash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
AI-generated wiki for any public GitHub repo. No API key required.
bash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
为任意公开GitHub仓库生成AI驱动的wiki,无需API密钥。
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
Full dump for RAG
导出全量内容供RAG使用
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
5. Jina Reader () — Fallback
Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).
bash
jina https://x.com/username/status/123456
当Firecrawl抓取失败,或是需要抓取Twitter/X的URL时使用(Firecrawl会屏蔽Twitter,Jina可以正常抓取)。
bash
jina https://x.com/username/status/123456
Firecrawl vs Exa vs Native Claude Tools
Firecrawl、Exa与Claude原生工具对比
| Need | Best Tool | Why |
|---|
| Single page → markdown | firecrawl scrape --only-main-content
| Cleanest output |
| Search + scrape in one shot | firecrawl search --scrape
| Combined operation |
| Crawl entire site | firecrawl crawl --wait --progress
| Link following + progress |
| Autonomous data finding | | No URLs needed |
| Semantic/neural search | Exa | AI-powered relevance |
| Find research papers | Exa --category "research paper"
| Academic index |
| Quick research answer | Exa | Citations + synthesis |
| Find similar pages | Exa | Competitive analysis |
| Claude API agent building | Native | Built-in dynamic filtering |
| Twitter/X content | | Only tool that works |
| GitHub repo docs | | AI-generated wiki |
| Anti-bot / Cloudflare bypass | stealth fetch | Local Turnstile solver |
| Element-level extraction | + CSS selectors | Precision targeting, adaptive tracking |
| No API key scraping | HTTP fetch | 100% local, no credentials |
| Site redesign resilience | adaptive mode | SQLite similarity matching |
| 需求 | 最佳工具 | 原因 |
|---|
| 单页转Markdown | firecrawl scrape --only-main-content
| 输出最整洁 |
| 一站式搜索+抓取 | firecrawl search --scrape
| 操作合并效率高 |
| 全站爬取 | firecrawl crawl --wait --progress
| 支持自动跟随链接+实时进度 |
| 自主查找数据 | | 无需提供URL |
| 语义/神经搜索 | Exa | AI驱动的相关性匹配 |
| 查找研究论文 | Exa --category "research paper"
| 学术资源索引 |
| 快速获取研究答案 | Exa | 带引用的内容汇总 |
| 查找相似页面 | Exa | 适合竞品分析 |
| 构建Claude API Agent | 原生 | 内置动态过滤 |
| 抓取Twitter/X内容 | | 唯一可用的工具 |
| 获取GitHub仓库文档 | | AI生成的wiki内容 |
| 绕过反爬/Cloudflare验证 | 隐身抓取 | 本地Turnstile验证破解 |
| 元素级内容提取 | + CSS选择器 | 精准定位,自适应跟踪 |
| 无需API密钥的抓取 | HTTP抓取 | 100%本地运行,无需凭证 |
| 适配站点改版 | 自适应模式 | SQLite相似度匹配 |
bash
firecrawl scrape https://example.com/page --only-main-content
bash
firecrawl scrape https://example.com/page --only-main-content
Or auto-save: fc-save URL
或者自动保存: fc-save URL
Or to file: firecrawl scrape URL --only-main-content -o page.md
或者保存到文件: firecrawl scrape URL --only-main-content -o page.md
Documentation Crawling
文档站点爬取
Map first, then crawl relevant paths
先映射站点结构,再爬取相关路径
bash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown
bash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown
Agent-Powered Research (No URLs Needed)
Agent驱动的研究(无需提供URL)
bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"
bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"
Check status and credits
查看状态与剩余额度
firecrawl --status && firecrawl credit-usage
firecrawl --status && firecrawl credit-usage
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
echo $FIRECRAWL_API_KEY
- **Scrape fails:** Try `jina URL`, or add `--wait-for 3000` for JS-heavy sites
- **Async job stuck:** Check with `crawl-status`/`batch-status`, cancel with `crawl-cancel`/`batch-cancel`
- **Disable telemetry:** `export FIRECRAWL_NO_TELEMETRY=1`
---
echo $FIRECRAWL_API_KEY
- **抓取失败:** 尝试使用`jina URL`,或是对JS重度页面添加`--wait-for 3000`参数
- **异步任务卡住:** 使用`crawl-status`/`batch-status`查看状态,使用`crawl-cancel`/`batch-cancel`取消任务
- **关闭数据采集:** 执行`export FIRECRAWL_NO_TELEMETRY=1`
---
Reference Documentation
参考文档
| File | Contents |
|---|
references/cli-reference.md
| Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki) |
references/python-api-reference.md
| Full Python API script reference (all commands, SDK examples) |
references/firecrawl-api.md
| Firecrawl Search API reference |
references/firecrawl-agent-api.md
| Agent API (spark models, parallel agents, webhooks) |
references/actions-reference.md
| Page actions for dynamic content (click, write, wait, scroll) |
references/branding-format.md
| Brand identity extraction (colors, fonts, UI) |
| 文件 | 内容 |
|---|
references/cli-reference.md
| 完整CLI参数参考(scrape、crawl、map、search、fc-save、jina、deepwiki) |
references/python-api-reference.md
| 完整Python API脚本参考(所有命令、SDK示例) |
references/firecrawl-api.md
| Firecrawl搜索API参考 |
references/firecrawl-agent-api.md
| Agent API参考(spark模型、并行Agent、webhooks) |
references/actions-reference.md
| 动态内容页面交互参考(点击、输入、等待、滚动) |
references/branding-format.md
| 品牌标识提取参考(颜色、字体、UI) |
bash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # Specific test
bash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # 快速验证
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # 全量测试
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # 单项测试