web-crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web crawler

网页爬虫

Use this only after the normal
web_fetch
path fails, returns unusable boilerplate, or the user specifically asks for YouTube video content/transcript. Prefer native tools first; paid fallback calls should be deliberate and scoped.
仅在常规
web_fetch
路径失败、返回无法使用的模板内容,或用户明确要求获取YouTube视频内容/转录文本时使用此工具。优先使用原生工具;付费备选方案的调用需经过审慎考虑并限定范围。

What each service is for

各服务的用途

SerpApi is only for YouTube-related retrieval:
  • engine=youtube
    to search YouTube and discover candidate videos.
  • engine=youtube_video
    to fetch video metadata, description, chapters, related videos, and the transcript discovery link.
  • engine=youtube_video_transcript
    to fetch timestamped transcript segments for AI analysis.
Do not use SerpApi for Google/Bing/general SERP scraping, shopping, maps, news, or any non-YouTube engine. The transparent proxy blocks those engines.
Firecrawl is only a fallback crawler for one web page when ordinary fetching fails. Use
POST /v2/scrape
with a single
url
and focused formats like
markdown
,
html
,
rawHtml
,
links
,
summary
, or constrained
json
/
question
/
highlights
extraction.
Do not use Firecrawl crawl/map/search/agent/browser endpoints for this fallback skill. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.
SerpApi仅用于与YouTube相关的检索:
  • 使用
    engine=youtube
    搜索YouTube并筛选候选视频。
  • 使用
    engine=youtube_video
    获取视频元数据、描述、章节信息、相关视频以及转录文本的发现链接。
  • 使用
    engine=youtube_video_transcript
    获取带时间戳的转录文本片段,以供AI分析。
请勿将SerpApi用于Google/Bing/通用SERP爬取、购物、地图、新闻或任何非YouTube的引擎。透明代理会阻止这些引擎的访问。
Firecrawl仅作为常规抓取失败时的备选爬虫,用于单页网页爬取。使用
POST /v2/scrape
接口,传入单个
url
,并选择聚焦格式,如
markdown
html
rawHtml
links
summary
,或受限的
json
/
question
/
highlights
提取方式。
请勿使用Firecrawl的crawl/map/search/agent/browser端点作为本备选方案的一部分。除非后续代理政策扩展,否则请勿请求截图、音频、品牌信息、图片或浏览器操作。

Access pattern through transparent proxy

通过透明代理的访问模式

Use Python scripts with
core.http_client.proxied_get
/
proxied_post
; include a typed
SC-CALLER-ID
header for cost tracking.
SerpApi examples:
python
from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

search = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube", "search_query": "topic keywords"},
    headers=headers,
).json()

video = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video", "v": "VIDEO_ID"},
    headers=headers,
).json()

transcript = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
    headers=headers,
).json()
Firecrawl example:
python
from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()
使用Python脚本调用
core.http_client.proxied_get
/
proxied_post
;需包含带类型的
SC-CALLER-ID
请求头以进行成本追踪。
SerpApi示例:
python
from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

search = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube", "search_query": "topic keywords"},
    headers=headers,
).json()

video = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video", "v": "VIDEO_ID"},
    headers=headers,
).json()

transcript = proxied_get(
    "https://serpapi.com/search.json",
    params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
    headers=headers,
).json()
Firecrawl示例:
python
from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

Decision rules

决策规则

For a YouTube URL, extract the 11-character video id and call
youtube_video_transcript
first if the user's goal is content analysis, summarization, quote extraction, or topic mining. This is usually more useful than fetching the public watch page because it returns timestamped transcript segments directly.
If the transcript call returns empty, unavailable, or the wrong language, call
youtube_video
next to inspect title, description, channel, publication date, and any advertised transcript metadata. Try one obvious
language_code
change only when the desired language is clear (for example
en
after a non-English request gives no transcript). If no transcript exists, summarize from metadata only and say the transcript was not available.
For a YouTube topic query, call
engine=youtube
, choose relevant
video_results
, then call metadata/transcript only for the videos needed. Avoid fetching many transcripts by default.
For a blocked or JS-heavy web page, call Firecrawl once with
formats:["markdown","links"]
and
onlyMainContent:true
. Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.
General web-page extraction lessons:
  • Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels.
    web_fetch
    may return boilerplate while Firecrawl can still recover the real main content.
  • Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
  • Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
  • If Markdown misses important layout or structured fields, retry once with
    rawHtml
    ; use
    json
    ,
    question
    , or
    highlights
    only when the user asked for narrow extraction and the schema/prompt is specific.
对于YouTube链接,提取11位字符的视频ID。如果用户的目标是内容分析、总结、引用提取或主题挖掘,优先调用
youtube_video_transcript
。这通常比抓取公开观看页面更有用,因为它直接返回带时间戳的转录文本片段。
如果转录文本调用返回空值、不可用或语言不符,接下来调用
youtube_video
以查看标题、描述、频道、发布日期以及任何公开的转录文本元数据。仅当所需语言明确时,尝试一次明显的
language_code
更改(例如,非英语请求未返回转录文本时尝试
en
)。如果不存在转录文本,则仅基于元数据生成总结,并告知用户转录文本不可用。
对于YouTube主题查询,调用
engine=youtube
,选择相关的
video_results
,然后仅对所需视频调用元数据/转录文本接口。默认情况下避免获取大量转录文本。
对于被阻止或依赖JavaScript的网页,调用一次Firecrawl,设置
formats:["markdown","links"]
onlyMainContent:true
。将返回的Markdown作为提取基础,而非最终结果:从页面结构中解析标题、价格/价值字段、规格、正文描述、图片URL、外部链接以及明显的联系/位置提示。
通用网页提取要点:
  • 许多列表/详情页面通过JavaScript、图片库、隐藏区域或重复UI标签渲染重要内容。
    web_fetch
    可能返回模板内容,而Firecrawl仍可恢复真实的主要内容。
  • 请勿硬编码特定网站的标签。将页面文本转换为通用结构化总结:内容类型、位置、关键数据、证据片段、媒体/链接以及注意事项。
  • 当图片和链接有助于验证页面时,保留其源URL,但除非用户要求,否则请勿下载或批量处理所有媒体资源。
  • 如果Markdown遗漏了重要布局或结构化字段,使用
    rawHtml
    重试一次;仅当用户要求窄范围提取且模式/提示明确时,才使用
    json
    question
    highlights

Cost discipline

成本管控

SerpApi is billed per successful search-like call, regardless of result count. Firecrawl is billed per page plus expensive modifiers. Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites with this skill.
If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.
If the proxy returns 429, back off; do not parallelize around the limit.
If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.
SerpApi按成功的类搜索调用计费,与结果数量无关。Firecrawl按页面加上昂贵的附加项计费。严格控制调用范围:单个页面、单个视频或少量候选列表。切勿使用本工具批量爬取整个网站。
如果代理返回403,说明请求超出允许的使用场景。请更改方法,而非重试。
如果代理返回429,请暂停请求;请勿绕过限制并行调用。
如果上游服务返回失败,请报告确切的失败信息,除非有明确理由更改参数,否则避免重复付费重试。