Firecrawl Web Scraping & Data Extraction

Firecrawl 网页抓取与数据提取

Installation

安装

bash

pip install firecrawl-py

bash

pip install firecrawl-py

Environment Setup

环境配置

Set your Firecrawl API key:

bash

export FIRECRAWL_API_KEY="your-api-key-here"

设置你的Firecrawl API密钥：

bash

export FIRECRAWL_API_KEY="your-api-key-here"

Scripts

脚本说明

Note: Set

SKILL_ROOT

to this skill's base directory. Reference bundled scripts as

python3 "$SKILL_ROOT/scripts/<script>.py" ...

(not relative paths from the current working directory).

注意：将

SKILL_ROOT

设置为该工具的根目录。调用内置脚本时请使用

python3 "$SKILL_ROOT/scripts/<script>.py" ...

格式（不要使用当前工作目录的相对路径）。

scrape.py - Single Page Scraping

scrape.py - 单页抓取

The most powerful and reliable scraper. Use when you know exactly which page contains the information.

bash

undefined

功能最强大、最稳定的抓取工具，当你明确知道信息所在的具体页面时使用。

bash

undefined

Basic scrape (returns markdown)

基础抓取（返回markdown格式内容）

python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com"

Get HTML format

获取HTML格式内容

python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --format html

Extract only main content (removes headers, footers, etc.)

仅提取主体内容（移除页眉、页脚等冗余内容）

python3 "$SKILL_ROOT/scripts/scrape.py" "https://example.com" --only-main

Combine options

组合使用参数

python3 "$SKILL_ROOT/scripts/scrape.py" "https://docs.example.com/api" --format markdown --only-main

undefined

python3 "$SKILL_ROOT/scripts/scrape.py" "https://docs.example.com/api" --format markdown --only-main

undefined

search.py - Web Search

search.py - 网页搜索

Search the web when you don't know which website has the information.

bash

undefined

当你不知道信息存在于哪个网站时使用，可直接搜索全网内容。

bash

undefined

Basic search

基础搜索

python3 "$SKILL_ROOT/scripts/search.py" "latest AI research papers 2024"

Limit results

限制返回结果数量

python3 "$SKILL_ROOT/scripts/search.py" "Python web scraping tutorials" --limit 5

Search with scraping (get full content)

搜索同时抓取内容（获取完整页面内容）

python3 "$SKILL_ROOT/scripts/search.py" "firecrawl documentation" --limit 3

undefined

python3 "$SKILL_ROOT/scripts/search.py" "firecrawl documentation" --limit 3

undefined

map.py - URL Discovery

map.py - URL发现

Discover all URLs on a website. Use before deciding what to scrape.

bash

undefined

发现指定网站下的所有URL，可在确定抓取目标前使用。

bash

undefined

Map a website

映射网站所有URL

python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com"

Limit number of URLs

限制返回URL数量

python3 "$SKILL_ROOT/scripts/map.py" "https://example.com" --limit 100

Search within mapped URLs

在映射的URL中搜索指定内容

python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com" --search "authentication"

undefined

python3 "$SKILL_ROOT/scripts/map.py" "https://docs.example.com" --search "authentication"

undefined

crawl.py - Multi-Page Crawling

crawl.py - 多页爬取

Extract content from multiple related pages. Warning: can be slow and return large results.

bash

undefined

从多个相关页面提取内容。注意：该操作可能较慢，且返回结果体量较大。

bash

undefined

Basic crawl

基础爬取

python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com"

Limit pages

限制爬取页面数量

python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 20

Control crawl depth

控制爬取深度

python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 10 --depth 2

undefined

python3 "$SKILL_ROOT/scripts/crawl.py" "https://docs.example.com" --limit 10 --depth 2

undefined

extract.py - Structured Data Extraction

extract.py - 结构化数据提取

Extract specific structured data using LLM capabilities.

bash

undefined

利用LLM能力提取指定的结构化数据。

bash

undefined

Extract with prompt

通过提示词提取内容

python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/pricing"
--prompt "Extract all pricing tiers with their features and prices"

Extract with JSON schema

通过JSON schema提取内容

python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/team"
--prompt "Extract team member information"
--schema '{"type":"object","properties":{"members":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"role":{"type":"string"},"bio":{"type":"string"}}}}}}'

Extract from multiple URLs

从多个URL提取内容

python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/page1" "https://example.com/page2"
--prompt "Extract product information"

undefined

python3 "$SKILL_ROOT/scripts/extract.py" "https://example.com/page1" "https://example.com/page2"
--prompt "Extract product information"

undefined

agent.py - Autonomous Data Gathering

agent.py - 自主数据收集

Autonomous agent that searches, navigates, and extracts data from anywhere on the web.

bash

undefined

自主Agent，可从网络任意位置搜索、导航并提取数据。

bash

undefined

Simple research task

简单调研任务

python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the founders of Firecrawl and their backgrounds"

Complex data gathering

复杂数据收集任务

python3 "$SKILL_ROOT/scripts/agent.py" --prompt "Find the top 5 AI startups founded in 2024 and their funding amounts"

Focus on specific URLs

限定在指定URL范围内操作

python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Compare the features and pricing"
--urls "https://example1.com,https://example2.com"

With output schema

指定输出schema

python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'

undefined

python3 "$SKILL_ROOT/scripts/agent.py"
--prompt "Find recent tech layoffs"
--schema '{"type":"object","properties":{"layoffs":{"type":"array","items":{"type":"object","properties":{"company":{"type":"string"},"count":{"type":"number"},"date":{"type":"string"}}}}}}'

undefined

Output Format

输出格式

All scripts output JSON to stdout. Errors are written to stderr.

所有脚本都会将JSON结果输出到stdout，错误信息会写入stderr。

Success Response

成功响应

json

{
  "success": true,
  "data": { ... }
}

json

{
  "success": true,
  "data": { ... }
}

Error Response

错误响应

json

{
  "success": false,
  "error": "Error message"
}

json

{
  "success": false,
  "error": "Error message"
}

Tips

使用提示

Performance: Use
```
scrape
```
for single pages - it's 500% faster with caching
Discovery: Use
```
map
```
first to find URLs, then
```
scrape
```
specific pages
Large sites: Prefer
```
map
```
+
```
scrape
```
over
```
crawl
```
for better control
Structured data: Use
```
extract
```
with a JSON schema for consistent output
Research: Use
```
agent
```
when you don't know where to find the data

性能优化: 单页内容优先使用
```
scrape
```
，开启缓存后速度提升500%
资源发现: 先使用
```
map
```
发现URL，再针对特定页面使用
```
scrape
```
抓取
大型站点处理: 优先使用
```
map
```
+
```
scrape
```
组合，比直接用
```
crawl
```
可控性更高
结构化数据: 配合JSON schema使用
```
extract
```
可获得一致格式的输出
调研场景: 当你不知道数据来源时使用
```
agent
```

firecrawl

Original

Translation

Firecrawl Web Scraping & Data Extraction

Firecrawl 网页抓取与数据提取

Installation

安装

Environment Setup

环境配置

Scripts

脚本说明

scrape.py - Single Page Scraping

scrape.py - 单页抓取

Basic scrape (returns markdown)

基础抓取（返回markdown格式内容）

Get HTML format

获取HTML格式内容

Extract only main content (removes headers, footers, etc.)

仅提取主体内容（移除页眉、页脚等冗余内容）

Combine options

组合使用参数

search.py - Web Search

search.py - 网页搜索

Basic search

基础搜索

Limit results

限制返回结果数量

Search with scraping (get full content)

搜索同时抓取内容（获取完整页面内容）

map.py - URL Discovery

map.py - URL发现

Map a website

映射网站所有URL

Limit number of URLs

限制返回URL数量

Search within mapped URLs

在映射的URL中搜索指定内容

crawl.py - Multi-Page Crawling

crawl.py - 多页爬取

Basic crawl

基础爬取

Limit pages

限制爬取页面数量

Control crawl depth

控制爬取深度

extract.py - Structured Data Extraction

extract.py - 结构化数据提取

Extract with prompt

通过提示词提取内容

Extract with JSON schema

通过JSON schema提取内容

Extract from multiple URLs

从多个URL提取内容

agent.py - Autonomous Data Gathering

agent.py - 自主数据收集

Simple research task

简单调研任务

Complex data gathering

复杂数据收集任务

Focus on specific URLs

限定在指定URL范围内操作

With output schema

指定输出schema

Output Format

输出格式

Success Response

成功响应

Error Response

错误响应

Tips

使用提示