firecrawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl Web Scraping & Data Extraction
Firecrawl 网页抓取与数据提取
Installation
安装
bash
pip install firecrawl-pybash
pip install firecrawl-pyEnvironment Setup
环境配置
Set your Firecrawl API key:
bash
export FIRECRAWL_API_KEY="your-api-key-here"设置你的Firecrawl API密钥:
bash
export FIRECRAWL_API_KEY="your-api-key-here"Scripts
脚本说明
Note: Set to this skill's base directory. Reference bundled scripts as (not relative paths from the current working directory).
SKILL_ROOTpython3 "$SKILL_ROOT/scripts/<script>.py" ...注意:将设置为该工具的根目录。调用内置脚本时请使用格式(不要使用当前工作目录的相对路径)。
SKILL_ROOTpython3 "$SKILL_ROOT/scripts/<script>.py" ...scrape.py - Single Page Scraping
scrape.py - 单页抓取
The most powerful and reliable scraper. Use when you know exactly which page contains the information.
bash
undefined功能最强大、最稳定的抓取工具,当你明确知道信息所在的具体页面时使用。
bash
undefinedBasic scrape (returns markdown)
基础抓取(返回markdown格式内容)
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com"
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com"
Get HTML format
获取HTML格式内容
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --format html
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --format html
Extract only main content (removes headers, footers, etc.)
仅提取主体内容(移除页眉、页脚等冗余内容)
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --only-main
python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --only-main
Combine options
组合使用参数
python3 "$SKILL_ROOT/scripts/scrape.py" "https://docs.example.com/api" --format markdown --only-main
undefinedpython3 "$SKILL_ROOT/scripts/scrape.py" "https://docs.example.com/api" --format markdown --only-main
undefinedsearch.py - Web Search
search.py - 网页搜索
Search the web when you don't know which website has the information.
bash
undefined当你不知道信息存在于哪个网站时使用,可直接搜索全网内容。
bash
undefinedBasic search
基础搜索
python3 "$SKILL_ROOT/scripts/search.py" "latest AI research papers 2024"
python3 "$SKILL_ROOT/scripts/search.py" "latest AI research papers 2024"
Limit results
限制返回结果数量
python3 "$SKILL_ROOT/scripts/search.py" "Python web scraping tutorials" --limit 5
python3 "$SKILL_ROOT/scripts/search.py" "Python web scraping tutorials" --limit 5
Search with scraping (get full content)
搜索同时抓取内容(获取完整页面内容)
python3 "$SKILL_ROOT/scripts/search.py" "firecrawl documentation" --limit 3
undefinedpython3 "$SKILL_ROOT/scripts/search.py" "firecrawl documentation" --limit 3
undefinedmap.py - URL Discovery
map.py - URL发现
Discover all URLs on a website. Use before deciding what to scrape.
bash
undefined发现指定网站下的所有URL,可在确定抓取目标前使用。
bash
undefinedMap a website
映射网站所有URL
python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com"
python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com"
Limit number of URLs
限制返回URL数量
python3 "$SKILL_ROOT/scripts/map.py" "https://example.com" --limit 100
python3 "$SKILL_ROOT/scripts/map.py" "https://example.com" --limit 100
Search within mapped URLs
在映射的URL中搜索指定内容
python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com" --search "authentication"
undefinedpython3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com" --search "authentication"
undefinedcrawl.py - Multi-Page Crawling
crawl.py - 多页爬取
Extract content from multiple related pages. Warning: can be slow and return large results.
bash
undefined从多个相关页面提取内容。注意:该操作可能较慢,且返回结果体量较大。
bash
undefinedBasic crawl
基础爬取
python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com"
python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com"
Limit pages
限制爬取页面数量
python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 20
python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 20
Control crawl depth
控制爬取深度
python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 10 --depth 2
undefinedpython3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 10 --depth 2
undefinedextract.py - Structured Data Extraction
extract.py - 结构化数据提取
Extract specific structured data using LLM capabilities.
bash
undefined利用LLM能力提取指定的结构化数据。
bash
undefinedExtract with prompt
通过提示词提取内容
python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/pricing"
--prompt "Extract all pricing tiers with their features and prices"
--prompt "Extract all pricing tiers with their features and prices"
python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/pricing"
--prompt "Extract all pricing tiers with their features and prices"
--prompt "Extract all pricing tiers with their features and prices"
Extract with JSON schema
通过JSON schema提取内容
python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/team"
--prompt "Extract team member information"
--schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'
--prompt "Extract team member information"
--schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'
python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/team"
--prompt "Extract team member information"
--schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'
--prompt "Extract team member information"
--schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'
Extract from multiple URLs
从多个URL提取内容
python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/page1" "https://example.com/page2"
--prompt "Extract product information"
--prompt "Extract product information"
undefinedpython3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/page1" "https://example.com/page2"
--prompt "Extract product information"
--prompt "Extract product information"
undefinedagent.py - Autonomous Data Gathering
agent.py - 自主数据收集
Autonomous agent that searches, navigates, and extracts data from anywhere on the web.
bash
undefined自主Agent,可从网络任意位置搜索、导航并提取数据。
bash
undefinedSimple research task
简单调研任务
python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the founders of Firecrawl and their backgrounds"
python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the founders of Firecrawl and their backgrounds"
Complex data gathering
复杂数据收集任务
python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the top 5 AI startups founded in 2024 and their funding amounts"
python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the top 5 AI startups founded in 2024 and their funding amounts"
Focus on specific URLs
限定在指定URL范围内操作
python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Compare the features and pricing"
--urls "https://example1.com,https://example2.com"
--prompt "Compare the features and pricing"
--urls "https://example1.com,https://example2.com"
python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Compare the features and pricing"
--urls "https://example1.com,https://example2.com"
--prompt "Compare the features and pricing"
--urls "https://example1.com,https://example2.com"
With output schema
指定输出schema
python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'
undefinedpython3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'
undefinedOutput Format
输出格式
All scripts output JSON to stdout. Errors are written to stderr.
所有脚本都会将JSON结果输出到stdout,错误信息会写入stderr。
Success Response
成功响应
json
{
"success": true,
"data": { ... }
}json
{
"success": true,
"data": { ... }
}Error Response
错误响应
json
{
"success": false,
"error": "Error message"
}json
{
"success": false,
"error": "Error message"
}Tips
使用提示
- Performance: Use for single pages - it's 500% faster with caching
scrape - Discovery: Use first to find URLs, then
mapspecific pagesscrape - Large sites: Prefer +
mapoverscrapefor better controlcrawl - Structured data: Use with a JSON schema for consistent output
extract - Research: Use when you don't know where to find the data
agent
- 性能优化: 单页内容优先使用,开启缓存后速度提升500%
scrape - 资源发现: 先使用发现URL,再针对特定页面使用
map抓取scrape - 大型站点处理: 优先使用+
map组合,比直接用scrape可控性更高crawl - 结构化数据: 配合JSON schema使用可获得一致格式的输出
extract - 调研场景: 当你不知道数据来源时使用
agent