apify

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Apify

Apify

Web scraping and automation platform. Run pre-built Actors (scrapers) or create your own. Access thousands of ready-to-use scrapers for popular websites.

网页抓取与自动化平台。你可以运行预构建的Actor(抓取器),也可以创建自定义Actor。平台提供数千个可直接使用的热门网站抓取器。

When to Use

适用场景

Use this skill when you need to:
  • Scrape data from websites (Amazon, Google, LinkedIn, Twitter, etc.)
  • Run pre-built web scrapers without coding
  • Extract structured data from any website
  • Automate web tasks at scale
  • Store and retrieve scraped data

在以下场景中可使用该工具:
  • 从网站(Amazon、Google、LinkedIn、Twitter等)抓取数据
  • 无需编码即可运行预构建的网页抓取器
  • 从任意网站提取结构化数据
  • 大规模自动化网页任务
  • 存储和检索抓取到的数据

Prerequisites

前置条件

  1. Create an account at https://apify.com/
  2. Get your API token from https://console.apify.com/account#/integrations
Set environment variable:
bash
export APIFY_API_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"

Important: When using
$VAR
in a command that pipes to another command, wrap the command containing
$VAR
in
bash -c '...'
. Due to a Claude Code bug, environment variables are silently cleared when pipes are used directly.
bash
bash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
  1. https://apify.com/创建账户
  2. https://console.apify.com/account#/integrations获取你的API token
设置环境变量:
bash
export APIFY_API_TOKEN="apify_api_xxxxxxxxxxxxxxxxxxxxxxxx"

重要提示: 在包含
$VAR
的命令通过管道符连接到另一个命令时,请将包含
$VAR
的命令用
bash -c '...'
包裹。由于Claude Code的bug,直接使用管道符时环境变量会被静默清除。
bash
bash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'

How to Use

使用方法

1. Run an Actor (Async)

1. 异步运行Actor

Start an Actor run asynchronously:
Write to
/tmp/apify_request.json
:
json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
Response contains
id
(run ID) and
defaultDatasetId
for fetching results.
异步启动Actor运行:
写入
/tmp/apify_request.json
json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
响应包含
id
(运行ID)和
defaultDatasetId
,用于获取结果。

2. Run Actor Synchronously

2. 同步运行Actor

Wait for completion and get results directly (max 5 min):
Write to
/tmp/apify_request.json
:
json
{
  "startUrls": [{"url": "https://news.ycombinator.com"}],
  "maxPagesPerCrawl": 1,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
等待任务完成并直接获取结果(最长等待5分钟):
写入
/tmp/apify_request.json
json
{
  "startUrls": [{"url": "https://news.ycombinator.com"}],
  "maxPagesPerCrawl": 1,
  "pageFunction": "async function pageFunction(context) { const { request, log, jQuery } = context; const $ = jQuery; const title = $(\"title\").text(); return { url: request.url, title }; }"
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/run-sync-get-dataset-items" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'

3. Check Run Status

3. 检查运行状态

⚠️ Important: The
{runId}
below is a placeholder - replace it with the actual run ID from your async run response (found in
.data.id
). See the complete workflow example below.
Poll the run status:
bash
undefined
⚠️ 重要提示: 下方的
{runId}
占位符 - 请替换为异步运行响应中的实际运行ID(在
.data.id
字段中)。请参考下方的完整工作流示例。
轮询运行状态:
bash
undefined

Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"

将{runId}替换为实际ID,例如"HG7ML7M8z78YcAPEB"

bash -c 'curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq -r '.data.status'

**Complete workflow example** (capture run ID and check status):

Write to `/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}
Then run:
bash
undefined
bash -c 'curl -s "https://api.apify.com/v2/actor-runs/{runId}" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq -r '.data.status'

**完整工作流示例**(捕获运行ID并检查状态):

写入`/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}
然后运行:
bash
undefined

Step 1: Start an async run and capture the run ID

步骤1:启动异步运行并捕获运行ID

RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')

Step 2: Check the run status

步骤2:检查运行状态

bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq '.data.status'

**Statuses**: `READY`, `RUNNING`, `SUCCEEDED`, `FAILED`, `ABORTED`, `TIMED-OUT`
bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq '.data.status'

**状态说明**:`READY`、`RUNNING`、`SUCCEEDED`、`FAILED`、`ABORTED`、`TIMED-OUT`

4. Get Dataset Items

4. 获取数据集条目

⚠️ Important: The
{datasetId}
below is a placeholder - do not use it literally! You must replace it with the actual dataset ID from your run response (found in
.data.defaultDatasetId
). See the complete workflow example below for how to capture and use the real ID.
Fetch results from a completed run:
bash
undefined
⚠️ 重要提示: 下方的
{datasetId}
占位符 - 请勿直接使用!你必须将其替换为运行响应中的实际数据集ID(在
.data.defaultDatasetId
字段中)。请参考下方的完整工作流示例了解如何捕获和使用真实ID。
从已完成的运行中获取结果:
bash
undefined

Replace {datasetId} with actual ID like "WkzbQMuFYuamGv3YF"

将{datasetId}替换为实际ID,例如"WkzbQMuFYuamGv3YF"

bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'

**Complete workflow example** (run async, wait, and fetch results):

Write to `/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}
Then run:
bash
undefined
bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'

**完整工作流示例**(异步运行、等待完成并获取结果):

写入`/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 10
}
然后运行:
bash
undefined

Step 1: Start async run and capture IDs

步骤1:启动异步运行并捕获ID

RESPONSE=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json')
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id') DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')
RESPONSE=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json')
RUN_ID=$(echo "$RESPONSE" | jq -r '.data.id') DATASET_ID=$(echo "$RESPONSE" | jq -r '.data.defaultDatasetId')

Step 2: Wait for completion (poll status)

步骤2:等待任务完成(轮询状态)

while true; do STATUS=$(bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq -r '.data.status') echo "Status: $STATUS" [[ "$STATUS" == "SUCCEEDED" ]] && break [[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1 sleep 5 done
while true; do STATUS=$(bash -c "curl -s "https://api.apify.com/v2/actor-runs/${RUN_ID}\" --header "Authorization: Bearer ${APIFY_API_TOKEN}"" | jq -r '.data.status') echo "状态: $STATUS" [[ "$STATUS" == "SUCCEEDED" ]] && break [[ "$STATUS" == "FAILED" || "$STATUS" == "ABORTED" ]] && exit 1 sleep 5 done

Step 3: Fetch the dataset items

步骤3:获取数据集条目

bash -c "curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""

**With pagination:**

```bash
bash -c "curl -s "https://api.apify.com/v2/datasets/${DATASET_ID}/items\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""

**分页获取:**

```bash

Replace {datasetId} with actual ID

将{datasetId}替换为实际ID

bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
undefined
bash -c 'curl -s "https://api.apify.com/v2/datasets/{datasetId}/items?limit=100&offset=0" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'
undefined

5. Popular Actors

5. 热门Actor

Google Search Scraper

Google Search Scraper

Write to
/tmp/apify_request.json
:
json
{
  "queries": "web scraping tools",
  "maxPagesPerQuery": 1,
  "resultsPerPage": 10
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
写入
/tmp/apify_request.json
json
{
  "queries": "web scraping tools",
  "maxPagesPerQuery": 1,
  "resultsPerPage": 10
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?timeout=120" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'

Website Content Crawler

Website Content Crawler

Write to
/tmp/apify_request.json
:
json
{
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxCrawlPages": 10,
  "crawlerType": "cheerio"
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
写入
/tmp/apify_request.json
json
{
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxCrawlPages": 10,
  "crawlerType": "cheerio"
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?timeout=300" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'

Instagram Scraper

Instagram Scraper

Write to
/tmp/apify_request.json
:
json
{
  "directUrls": ["https://www.instagram.com/apaborotnikov/"],
  "resultsType": "posts",
  "resultsLimit": 10
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
写入
/tmp/apify_request.json
json
{
  "directUrls": ["https://www.instagram.com/apaborotnikov/"],
  "resultsType": "posts",
  "resultsLimit": 10
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~instagram-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'

Amazon Product Scraper

Amazon Product Scraper

Write to
/tmp/apify_request.json
:
json
{
  "categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
  "maxItemsPerStartUrl": 1
}
Then run:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'
写入
/tmp/apify_request.json
json
{
  "categoryOrProductUrls": [{"url": "https://www.amazon.com/dp/B0BSHF7WHW"}],
  "maxItemsPerStartUrl": 1
}
然后运行:
bash
bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/junglee~amazon-crawler/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json'

6. List Your Runs

6. 列出你的运行记录

Get recent Actor runs:
bash
bash -c 'curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {id, actId, status, startedAt}'
获取近期的Actor运行记录:
bash
bash -c 'curl -s "https://api.apify.com/v2/actor-runs?limit=10&desc=true" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {id, actId, status, startedAt}'

7. Abort a Run

7. 终止运行

⚠️ Important: The
{runId}
below is a placeholder - replace it with the actual run ID. See the complete workflow example below.
Stop a running Actor:
bash
undefined
⚠️ 重要提示: 下方的
{runId}
占位符 - 请替换为实际运行ID。请参考下方的完整工作流示例。
停止正在运行的Actor:
bash
undefined

Replace {runId} with actual ID like "HG7ML7M8z78YcAPEB"

将{runId}替换为实际ID,例如"HG7ML7M8z78YcAPEB"

bash -c 'curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'

**Complete workflow example** (start a run and abort it):

Write to `/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 100
}
Then run:
bash
undefined
bash -c 'curl -s -X POST "https://api.apify.com/v2/actor-runs/{runId}/abort" --header "Authorization: Bearer ${APIFY_API_TOKEN}"'

**完整工作流示例**(启动运行并终止):

写入`/tmp/apify_request.json`:

```json
{
  "startUrls": [{"url": "https://example.com"}],
  "maxPagesPerCrawl": 100
}
然后运行:
bash
undefined

Step 1: Start an async run and capture the run ID

步骤1:启动异步运行并捕获运行ID

RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
echo "Started run: $RUN_ID"
RUN_ID=$(bash -c 'curl -s -X POST "https://api.apify.com/v2/acts/apify~web-scraper/runs" --header "Authorization: Bearer ${APIFY_API_TOKEN}" --header "Content-Type: application/json" -d @/tmp/apify_request.json' | jq -r '.data.id')
echo "已启动运行: $RUN_ID"

Step 2: Abort the run

步骤2:终止运行

bash -c "curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
undefined
bash -c "curl -s -X POST "https://api.apify.com/v2/actor-runs/${RUN_ID}/abort\" --header "Authorization: Bearer ${APIFY_API_TOKEN}""
undefined

8. List Available Actors

8. 列出可用Actor

Browse public Actors:
bash
bash -c 'curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {name, username, title}'

浏览公开Actor:
bash
bash -c 'curl -s "https://api.apify.com/v2/store?limit=20&category=ECOMMERCE" --header "Authorization: Bearer ${APIFY_API_TOKEN}"' | jq '.data.items[] | {name, username, title}'

Popular Actors Reference

热门Actor参考

Actor IDDescription
apify/web-scraper
General web scraper
apify/website-content-crawler
Crawl entire websites
apify/google-search-scraper
Google search results
apify/instagram-scraper
Instagram posts/profiles
junglee/amazon-crawler
Amazon products
apify/twitter-scraper
Twitter/X posts
apify/youtube-scraper
YouTube videos
apify/linkedin-scraper
LinkedIn profiles
lukaskrivka/google-maps
Google Maps places

Actor ID描述
apify/web-scraper
通用网页抓取器
apify/website-content-crawler
整站内容爬取器
apify/google-search-scraper
Google搜索结果抓取器
apify/instagram-scraper
Instagram帖子/个人资料抓取器
junglee/amazon-crawler
亚马逊商品抓取器
apify/twitter-scraper
Twitter/X帖子抓取器
apify/youtube-scraper
YouTube视频抓取器
apify/linkedin-scraper
LinkedIn个人资料抓取器
lukaskrivka/google-maps
Google地图地点抓取器
更多Actor请访问:https://apify.com/store

Run Options

运行选项

ParameterTypeDescription
timeout
numberRun timeout in seconds
memory
numberMemory in MB (128, 256, 512, 1024, 2048, 4096)
maxItems
numberMax items to return (for sync endpoints)
build
stringActor build tag (default: "latest")
waitForFinish
numberWait time in seconds (for async runs)

参数类型描述
timeout
数字运行超时时间(秒)
memory
数字内存大小(MB),可选值:128、256、512、1024、2048、4096
maxItems
数字最大返回条目数(仅适用于同步端点)
build
字符串Actor构建标签(默认值:"latest")
waitForFinish
数字等待完成时间(秒,仅适用于异步运行)

Response Format

响应格式

Run object:
json
{
  "data": {
  "id": "HG7ML7M8z78YcAPEB",
  "actId": "HDSasDasz78YcAPEB",
  "status": "SUCCEEDED",
  "startedAt": "2024-01-01T00:00:00.000Z",
  "finishedAt": "2024-01-01T00:01:00.000Z",
  "defaultDatasetId": "WkzbQMuFYuamGv3YF",
  "defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
  }
}

运行对象:
json
{
  "data": {
  "id": "HG7ML7M8z78YcAPEB",
  "actId": "HDSasDasz78YcAPEB",
  "status": "SUCCEEDED",
  "startedAt": "2024-01-01T00:00:00.000Z",
  "finishedAt": "2024-01-01T00:01:00.000Z",
  "defaultDatasetId": "WkzbQMuFYuamGv3YF",
  "defaultKeyValueStoreId": "tbhFDFDh78YcAPEB"
  }
}

Guidelines

注意事项

  1. Sync vs Async: Use
    run-sync-get-dataset-items
    for quick tasks (<5 min), async for longer jobs
  2. Rate Limits: 250,000 requests/min globally, 400/sec per resource
  3. Memory: Higher memory = faster execution but more credits
  4. Timeouts: Default varies by Actor; set explicit timeout for sync calls
  5. Pagination: Use
    limit
    and
    offset
    for large datasets
  6. Actor Input: Each Actor has different input schema - check Actor's page for details
  7. Credits: Check usage at https://console.apify.com/billing
  1. 同步vs异步:对于快速任务(<5分钟)使用
    run-sync-get-dataset-items
    ,长时间任务使用异步模式
  2. 请求限制:全局限制为250,000次请求/分钟,单资源限制为400次请求/秒
  3. 内存配置:内存越高执行速度越快,但消耗的积分也越多
  4. 超时设置:默认超时时间因Actor而异;同步调用请设置明确的超时时间
  5. 分页处理:针对大型数据集,使用
    limit
    offset
    参数进行分页获取
  6. Actor输入:每个Actor的输入架构不同,请查看对应Actor的页面了解详情
  7. 积分消耗:请访问https://console.apify.com/billing查看使用情况