web-crawler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web crawler

网页爬虫

Use this only after the normal

web_fetch

path fails, returns unusable boilerplate, or the user specifically asks for YouTube video content/transcript. Prefer native tools first; paid fallback calls should be deliberate and scoped.

仅在常规

web_fetch

路径失败、返回无法使用的模板内容，或用户明确要求获取YouTube视频内容/转录文本时使用此工具。优先使用原生工具；付费备选方案的调用需经过审慎考虑并限定范围。

What each service is for

各服务的用途

SerpApi is only for YouTube-related retrieval:

```
engine=youtube
```
to search YouTube and discover candidate videos.
```
engine=youtube_video
```
to fetch video metadata, description, chapters, related videos, and the transcript discovery link.
```
engine=youtube_video_transcript
```
to fetch timestamped transcript segments for AI analysis.

Do not use SerpApi for Google/Bing/general SERP scraping, shopping, maps, news, or any non-YouTube engine. The transparent proxy blocks those engines.

Firecrawl is only a fallback crawler for one web page when ordinary fetching fails. Use

POST /v2/scrape

with a single

url

and focused formats like

markdown

html

rawHtml

links

summary

, or constrained

json

question

highlights

extraction.

Do not use Firecrawl crawl/map/search/agent/browser endpoints for this fallback skill. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.

SerpApi仅用于与YouTube相关的检索：

使用
```
engine=youtube
```
搜索YouTube并筛选候选视频。
使用
```
engine=youtube_video
```
获取视频元数据、描述、章节信息、相关视频以及转录文本的发现链接。
使用
```
engine=youtube_video_transcript
```
获取带时间戳的转录文本片段，以供AI分析。

请勿将SerpApi用于Google/Bing/通用SERP爬取、购物、地图、新闻或任何非YouTube的引擎。透明代理会阻止这些引擎的访问。

Firecrawl仅作为常规抓取失败时的备选爬虫，用于单页网页爬取。使用

POST /v2/scrape

接口，传入单个

url

，并选择聚焦格式，如

markdown

、

html

、

rawHtml

、

links

、

summary

，或受限的

json

question

highlights

提取方式。

请勿使用Firecrawl的crawl/map/search/agent/browser端点作为本备选方案的一部分。除非后续代理政策扩展，否则请勿请求截图、音频、品牌信息、图片或浏览器操作。

Access pattern through transparent proxy

通过透明代理的访问模式

Use Python scripts with

core.http_client.proxied_get

proxied_post

; include a typed

SC-CALLER-ID

header for cost tracking.

SerpApi examples:

python

from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

search = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube", "search_query": "topic keywords"},
    headers=headers,
).json()

video = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video", "v": "VIDEO_ID"},
    headers=headers,
).json()

transcript = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
    headers=headers,
).json()

Firecrawl example:

python

from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

使用Python脚本调用

core.http_client.proxied_get

proxied_post

；需包含带类型的

SC-CALLER-ID

请求头以进行成本追踪。

SerpApi示例：

python

from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

search = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube", "search_query": "topic keywords"},
    headers=headers,
).json()

video = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video", "v": "VIDEO_ID"},
    headers=headers,
).json()

transcript = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
    headers=headers,
).json()

Firecrawl示例：

python

from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

Decision rules

决策规则

For a YouTube URL, extract the 11-character video id and call

youtube_video_transcript

first if the user's goal is content analysis, summarization, quote extraction, or topic mining. This is usually more useful than fetching the public watch page because it returns timestamped transcript segments directly.

If the transcript call returns empty, unavailable, or the wrong language, call

youtube_video

next to inspect title, description, channel, publication date, and any advertised transcript metadata. Try one obvious

language_code

change only when the desired language is clear (for example

en

after a non-English request gives no transcript). If no transcript exists, summarize from metadata only and say the transcript was not available.

For a YouTube topic query, call

engine=youtube

, choose relevant

video_results

, then call metadata/transcript only for the videos needed. Avoid fetching many transcripts by default.

For a blocked or JS-heavy web page, call Firecrawl once with

formats:["markdown","links"]

and

onlyMainContent:true

. Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.

General web-page extraction lessons:

Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels.
```
web_fetch
```
may return boilerplate while Firecrawl can still recover the real main content.
Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
If Markdown misses important layout or structured fields, retry once with
```
rawHtml
```
; use
```
json
```
,
```
question
```
, or
```
highlights
```
only when the user asked for narrow extraction and the schema/prompt is specific.

对于YouTube链接，提取11位字符的视频ID。如果用户的目标是内容分析、总结、引用提取或主题挖掘，优先调用

youtube_video_transcript

。这通常比抓取公开观看页面更有用，因为它直接返回带时间戳的转录文本片段。

如果转录文本调用返回空值、不可用或语言不符，接下来调用

youtube_video

以查看标题、描述、频道、发布日期以及任何公开的转录文本元数据。仅当所需语言明确时，尝试一次明显的

language_code

更改（例如，非英语请求未返回转录文本时尝试

en

）。如果不存在转录文本，则仅基于元数据生成总结，并告知用户转录文本不可用。

对于YouTube主题查询，调用

engine=youtube

，选择相关的

video_results

，然后仅对所需视频调用元数据/转录文本接口。默认情况下避免获取大量转录文本。

对于被阻止或依赖JavaScript的网页，调用一次Firecrawl，设置

formats:["markdown","links"]

和

onlyMainContent:true

。将返回的Markdown作为提取基础，而非最终结果：从页面结构中解析标题、价格/价值字段、规格、正文描述、图片URL、外部链接以及明显的联系/位置提示。

通用网页提取要点：

许多列表/详情页面通过JavaScript、图片库、隐藏区域或重复UI标签渲染重要内容。
```
web_fetch
```
可能返回模板内容，而Firecrawl仍可恢复真实的主要内容。
请勿硬编码特定网站的标签。将页面文本转换为通用结构化总结：内容类型、位置、关键数据、证据片段、媒体/链接以及注意事项。
当图片和链接有助于验证页面时，保留其源URL，但除非用户要求，否则请勿下载或批量处理所有媒体资源。
如果Markdown遗漏了重要布局或结构化字段，使用
```
rawHtml
```
重试一次；仅当用户要求窄范围提取且模式/提示明确时，才使用
```
json
```
、
```
question
```
或
```
highlights
```
。

Cost discipline

成本管控

SerpApi is billed per successful search-like call, regardless of result count. Firecrawl is billed per page plus expensive modifiers. Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites with this skill.

If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.

If the proxy returns 429, back off; do not parallelize around the limit.

If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.

SerpApi按成功的类搜索调用计费，与结果数量无关。Firecrawl按页面加上昂贵的附加项计费。严格控制调用范围：单个页面、单个视频或少量候选列表。切勿使用本工具批量爬取整个网站。

如果代理返回403，说明请求超出允许的使用场景。请更改方法，而非重试。

如果代理返回429，请暂停请求；请勿绕过限制并行调用。

如果上游服务返回失败，请报告确切的失败信息，除非有明确理由更改参数，否则避免重复付费重试。