apify
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseApify
Apify
Web scraping and automation platform. Run pre-built Actors (scrapers) or create your own. Access thousands of ready-to-use scrapers for popular websites.
Official docs: https://docs.apify.com/api/v2
网页抓取与自动化平台。你可以运行预构建的Actor(抓取器),也可以创建自定义Actor。平台提供数千个可直接使用的热门网站抓取器。
When to Use
适用场景
Use this skill when you need to:
- Scrape data from websites (Amazon, Google, LinkedIn, Twitter, etc.)
- Run pre-built web scrapers without coding
- Extract structured data from any website
- Automate web tasks at scale
- Store and retrieve scraped data
在以下场景中可使用该工具:
- 从网站(Amazon、Google、LinkedIn、Twitter等)抓取数据
- 无需编码即可运行预构建的网页抓取器
- 从任意网站提取结构化数据
- 大规模自动化网页任务
- 存储和检索抓取到的数据
Prerequisites
前置条件
- Create an account at https://apify.com/
- Get your API token from https://console.apify.com/account#/integrations
Set environment variable:
bash
export APIFY_API_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"Important: When usingin a command that pipes to another command, wrap the command containing$VARin$VAR. Due to a Claude Code bug, environment variables are silently cleared when pipes are used directly.bash -c '...'bashbash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
设置环境变量:
bash
export APIFY_API_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"重要提示: 在包含的命令通过管道符连接到另一个命令时,请将包含$VAR的命令用$VAR包裹。由于Claude Code的bug,直接使用管道符时环境变量会被静默清除。bash -c '...'bashbash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
How to Use
使用方法
1. Run an Actor (Async)
1. 异步运行Actor
Start an Actor run asynchronously:
Write to :
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'Response contains (run ID) and for fetching results.
iddefaultDatasetId异步启动Actor运行:
写入:
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'响应包含(运行ID)和,用于获取结果。
iddefaultDatasetId2. Run Actor Synchronously
2. 同步运行Actor
Wait for completion and get results directly (max 5 min):
Write to :
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://news.ycombinator.com"}],
"maxPagesPerCrawl": 1,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'等待任务完成并直接获取结果(最长等待5分钟):
写入:
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://news.ycombinator.com"}],
"maxPagesPerCrawl": 1,
"pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'3. Check Run Status
3. 检查运行状态
⚠️ Important: Thebelow is a placeholder - replace it with the actual run ID from your async run response (found in{runId}). See the complete workflow example below..data.id
Poll the run status:
bash
undefined⚠️ 重要提示: 下方的是占位符 - 请替换为异步运行响应中的实际运行ID(在{runId}字段中)。请参考下方的完整工作流示例。.data.id
轮询运行状态:
bash
undefinedReplace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
将{runId}替换为实际ID,例如"HG7ML7M8z78YcAPEB"
bash -c 'curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq -r '.data.status'
**Complete workflow example** (capture run ID and check status):
Write to `/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}Then run:
bash
undefinedbash -c 'curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq -r '.data.status'
**完整工作流示例**(捕获运行ID并检查状态):
写入`/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}然后运行:
bash
undefinedStep 1: Start an async run and capture the run ID
步骤1:启动异步运行并捕获运行ID
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
Step 2: Check the run status
步骤2:检查运行状态
bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq '.data.status'
**Statuses**: `READY`, `RUNNING`, `SUCCEEDED`, `FAILED`, `ABORTED`, `TIMED-OUT`bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq '.data.status'
**状态说明**:`READY`、`RUNNING`、`SUCCEEDED`、`FAILED`、`ABORTED`、`TIMED-OUT`4. Get Dataset Items
4. 获取数据集条目
⚠️ Important: Thebelow is a placeholder - do not use it literally! You must replace it with the actual dataset ID from your run response (found in{datasetId}). See the complete workflow example below for how to capture and use the real ID..data.defaultDatasetId
Fetch results from a completed run:
bash
undefined⚠️ 重要提示: 下方的是占位符 - 请勿直接使用!你必须将其替换为运行响应中的实际数据集ID(在{datasetId}字段中)。请参考下方的完整工作流示例了解如何捕获和使用真实ID。.data.defaultDatasetId
从已完成的运行中获取结果:
bash
undefinedReplace {datasetId} with actual ID like "WkzbQMuFYuamGv3YF"
将{datasetId}替换为实际ID,例如"WkzbQMuFYuamGv3YF"
bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
**Complete workflow example** (run async, wait, and fetch results):
Write to `/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}Then run:
bash
undefinedbash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
**完整工作流示例**(异步运行、等待完成并获取结果):
写入`/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 10
}然后运行:
bash
undefinedStep 1: Start async run and capture IDs
步骤1:启动异步运行并捕获ID
RESPONSE=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json')
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')
RESPONSE=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json')
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id')
DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')
Step 2: Wait for completion (poll status)
步骤2:等待任务完成(轮询状态)
while true; do
STATUS=$(bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq -r '.data.status')
echo "Status: $STATUS"
[[ "$STATUS" == "SUCCEEDED" ]] && break
[[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
sleep 5
done
while true; do
STATUS=$(bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq -r '.data.status')
echo "状态: $STATUS"
[[ "$STATUS" == "SUCCEEDED" ]] && break
[[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1
sleep 5
done
Step 3: Fetch the dataset items
步骤3:获取数据集条目
bash -c "curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
**With pagination:**
```bashbash -c "curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
**分页获取:**
```bashReplace {datasetId} with actual ID
将{datasetId}替换为实际ID
bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
undefinedbash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
undefined5. Popular Actors
5. 热门Actor
Google Search Scraper
Google Search Scraper
Write to :
/tmp/apify_request.jsonjson
{
"queries": "web scraping tools",
"maxPagesPerQuery": 1,
"resultsPerPage": 10
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'写入:
/tmp/apify_request.jsonjson
{
"queries": "web scraping tools",
"maxPagesPerQuery": 1,
"resultsPerPage": 10
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'Website Content Crawler
Website Content Crawler
Write to :
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://docs.example.com"}],
"maxCrawlPages": 10,
"crawlerType": "cheerio"
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'写入:
/tmp/apify_request.jsonjson
{
"startUrls": [{"url": "https://docs.example.com"}],
"maxCrawlPages": 10,
"crawlerType": "cheerio"
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'Instagram Scraper
Instagram Scraper
Write to :
/tmp/apify_request.jsonjson
{
"directUrls": ["https://www.instagram.com/apaborotnikov/"],
"resultsType": "posts",
"resultsLimit": 10
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'写入:
/tmp/apify_request.jsonjson
{
"directUrls": ["https://www.instagram.com/apaborotnikov/"],
"resultsType": "posts",
"resultsLimit": 10
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'Amazon Product Scraper
Amazon Product Scraper
Write to :
/tmp/apify_request.jsonjson
{
"categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
"maxItemsPerStartUrl": 1
}Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'写入:
/tmp/apify_request.jsonjson
{
"categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
"maxItemsPerStartUrl": 1
}然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'6. List Your Runs
6. 列出你的运行记录
Get recent Actor runs:
bash
bash -c 'curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {id, actId, status, startedAt}'获取近期的Actor运行记录:
bash
bash -c 'curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {id, actId, status, startedAt}'7. Abort a Run
7. 终止运行
⚠️ Important: Thebelow is a placeholder - replace it with the actual run ID. See the complete workflow example below.{runId}
Stop a running Actor:
bash
undefined⚠️ 重要提示: 下方的是占位符 - 请替换为实际运行ID。请参考下方的完整工作流示例。{runId}
停止正在运行的Actor:
bash
undefinedReplace {runId} with actual ID like "HG7ML7M8z78YcAPEB"
将{runId}替换为实际ID,例如"HG7ML7M8z78YcAPEB"
bash -c 'curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
**Complete workflow example** (start a run and abort it):
Write to `/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 100
}Then run:
bash
undefinedbash -c 'curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
**完整工作流示例**(启动运行并终止):
写入`/tmp/apify_request.json`:
```json
{
"startUrls": [{"url": "https://example.com"}],
"maxPagesPerCrawl": 100
}然后运行:
bash
undefinedStep 1: Start an async run and capture the run ID
步骤1:启动异步运行并捕获运行ID
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
echo "Started run: $RUN_ID"
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
echo "已启动运行: $RUN_ID"
Step 2: Abort the run
步骤2:终止运行
bash -c "curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
undefinedbash -c "curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
undefined8. List Available Actors
8. 列出可用Actor
Browse public Actors:
bash
bash -c 'curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {name, username, title}'浏览公开Actor:
bash
bash -c 'curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {name, username, title}'Popular Actors Reference
热门Actor参考
| Actor ID | Description |
|---|---|
| General web scraper |
| Crawl entire websites |
| Google search results |
| Instagram posts/profiles |
| Amazon products |
| Twitter/X posts |
| YouTube videos |
| LinkedIn profiles |
| Google Maps places |
Find more at: https://apify.com/store
| Actor ID | 描述 |
|---|---|
| 通用网页抓取器 |
| 整站内容爬取器 |
| Google搜索结果抓取器 |
| Instagram帖子/个人资料抓取器 |
| 亚马逊商品抓取器 |
| Twitter/X帖子抓取器 |
| YouTube视频抓取器 |
| LinkedIn个人资料抓取器 |
| Google地图地点抓取器 |
更多Actor请访问:https://apify.com/store
Run Options
运行选项
| Parameter | Type | Description |
|---|---|---|
| number | Run timeout in seconds |
| number | Memory in MB (128, 256, 512, 1024, 2048, 4096) |
| number | Max items to return (for sync endpoints) |
| string | Actor build tag (default: "latest") |
| number | Wait time in seconds (for async runs) |
| 参数 | 类型 | 描述 |
|---|---|---|
| 数字 | 运行超时时间(秒) |
| 数字 | 内存大小(MB),可选值:128、256、512、1024、2048、4096 |
| 数字 | 最大返回条目数(仅适用于同步端点) |
| 字符串 | Actor构建标签(默认值:"latest") |
| 数字 | 等待完成时间(秒,仅适用于异步运行) |
Response Format
响应格式
Run object:
json
{
"data": {
"id": "HG7ML7M8z78YcAPEB",
"actId": "HDSasDasz78YcAPEB",
"status": "SUCCEEDED",
"startedAt": "2024-01-01T00:00:00.000Z",
"finishedAt": "2024-01-01T00:01:00.000Z",
"defaultDatasetId": "WkzbQMuFYuamGv3YF",
"defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
}
}运行对象:
json
{
"data": {
"id": "HG7ML7M8z78YcAPEB",
"actId": "HDSasDasz78YcAPEB",
"status": "SUCCEEDED",
"startedAt": "2024-01-01T00:00:00.000Z",
"finishedAt": "2024-01-01T00:01:00.000Z",
"defaultDatasetId": "WkzbQMuFYuamGv3YF",
"defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
}
}Guidelines
注意事项
- Sync vs Async: Use for quick tasks (<5 min), async for longer jobs
run-sync-get-dataset-items - Rate Limits: 250,000 requests/min globally, 400/sec per resource
- Memory: Higher memory = faster execution but more credits
- Timeouts: Default varies by Actor; set explicit timeout for sync calls
- Pagination: Use and
limitfor large datasetsoffset - Actor Input: Each Actor has different input schema - check Actor's page for details
- Credits: Check usage at https://console.apify.com/billing
- 同步vs异步:对于快速任务(<5分钟)使用,长时间任务使用异步模式
run-sync-get-dataset-items - 请求限制:全局限制为250,000次请求/分钟,单资源限制为400次请求/秒
- 内存配置:内存越高执行速度越快,但消耗的积分也越多
- 超时设置:默认超时时间因Actor而异;同步调用请设置明确的超时时间
- 分页处理:针对大型数据集,使用和
limit参数进行分页获取offset - Actor输入:每个Actor的输入架构不同,请查看对应Actor的页面了解详情
- 积分消耗:请访问https://console.apify.com/billing查看使用情况