web-crawler
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb crawler
网页爬虫
Use this only after the normal path fails, returns unusable boilerplate, or the user specifically asks for YouTube video content/transcript. Prefer native tools first; paid fallback calls should be deliberate and scoped.
web_fetch仅在常规路径失败、返回无法使用的模板内容,或用户明确要求获取YouTube视频内容/转录文本时使用此工具。优先使用原生工具;付费备选方案的调用需经过审慎考虑并限定范围。
web_fetchWhat each service is for
各服务的用途
SerpApi is only for YouTube-related retrieval:
- to search YouTube and discover candidate videos.
engine=youtube - to fetch video metadata, description, chapters, related videos, and the transcript discovery link.
engine=youtube_video - to fetch timestamped transcript segments for AI analysis.
engine=youtube_video_transcript
Do not use SerpApi for Google/Bing/general SERP scraping, shopping, maps, news, or any non-YouTube engine. The transparent proxy blocks those engines.
Firecrawl is only a fallback crawler for one web page when ordinary fetching fails. Use with a single and focused formats like , , , , , or constrained // extraction.
POST /v2/scrapeurlmarkdownhtmlrawHtmllinkssummaryjsonquestionhighlightsDo not use Firecrawl crawl/map/search/agent/browser endpoints for this fallback skill. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.
SerpApi仅用于与YouTube相关的检索:
- 使用搜索YouTube并筛选候选视频。
engine=youtube - 使用获取视频元数据、描述、章节信息、相关视频以及转录文本的发现链接。
engine=youtube_video - 使用获取带时间戳的转录文本片段,以供AI分析。
engine=youtube_video_transcript
请勿将SerpApi用于Google/Bing/通用SERP爬取、购物、地图、新闻或任何非YouTube的引擎。透明代理会阻止这些引擎的访问。
Firecrawl仅作为常规抓取失败时的备选爬虫,用于单页网页爬取。使用接口,传入单个,并选择聚焦格式,如、、、、,或受限的//提取方式。
POST /v2/scrapeurlmarkdownhtmlrawHtmllinkssummaryjsonquestionhighlights请勿使用Firecrawl的crawl/map/search/agent/browser端点作为本备选方案的一部分。除非后续代理政策扩展,否则请勿请求截图、音频、品牌信息、图片或浏览器操作。
Access pattern through transparent proxy
通过透明代理的访问模式
Use Python scripts with / ; include a typed header for cost tracking.
core.http_client.proxied_getproxied_postSC-CALLER-IDSerpApi examples:
python
from core.http_client import proxied_get
headers = {"SC-CALLER-ID": "chat:youtube-transcript"}
search = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube", "search_query": "topic keywords"},
headers=headers,
).json()
video = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video", "v": "VIDEO_ID"},
headers=headers,
).json()
transcript = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
headers=headers,
).json()Firecrawl example:
python
from core.http_client import proxied_post
headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}
page = proxied_post(
"https://api.firecrawl.dev/v2/scrape",
json={
"url": "https://example.com/article",
"formats": ["markdown", "links"],
"onlyMainContent": True,
"timeout": 60000,
},
headers=headers,
).json()使用Python脚本调用 / ;需包含带类型的请求头以进行成本追踪。
core.http_client.proxied_getproxied_postSC-CALLER-IDSerpApi示例:
python
from core.http_client import proxied_get
headers = {"SC-CALLER-ID": "chat:youtube-transcript"}
search = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube", "search_query": "topic keywords"},
headers=headers,
).json()
video = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video", "v": "VIDEO_ID"},
headers=headers,
).json()
transcript = proxied_get(
"https://serpapi.com/search.json",
params={"engine": "youtube_video_transcript", "v": "VIDEO_ID", "language_code": "en"},
headers=headers,
).json()Firecrawl示例:
python
from core.http_client import proxied_post
headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}
page = proxied_post(
"https://api.firecrawl.dev/v2/scrape",
json={
"url": "https://example.com/article",
"formats": ["markdown", "links"],
"onlyMainContent": True,
"timeout": 60000,
},
headers=headers,
).json()Decision rules
决策规则
For a YouTube URL, extract the 11-character video id and call first if the user's goal is content analysis, summarization, quote extraction, or topic mining. This is usually more useful than fetching the public watch page because it returns timestamped transcript segments directly.
youtube_video_transcriptIf the transcript call returns empty, unavailable, or the wrong language, call next to inspect title, description, channel, publication date, and any advertised transcript metadata. Try one obvious change only when the desired language is clear (for example after a non-English request gives no transcript). If no transcript exists, summarize from metadata only and say the transcript was not available.
youtube_videolanguage_codeenFor a YouTube topic query, call , choose relevant , then call metadata/transcript only for the videos needed. Avoid fetching many transcripts by default.
engine=youtubevideo_resultsFor a blocked or JS-heavy web page, call Firecrawl once with and . Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.
formats:["markdown","links"]onlyMainContent:trueGeneral web-page extraction lessons:
- Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels. may return boilerplate while Firecrawl can still recover the real main content.
web_fetch - Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
- Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
- If Markdown misses important layout or structured fields, retry once with ; use
rawHtml,json, orquestiononly when the user asked for narrow extraction and the schema/prompt is specific.highlights
对于YouTube链接,提取11位字符的视频ID。如果用户的目标是内容分析、总结、引用提取或主题挖掘,优先调用。这通常比抓取公开观看页面更有用,因为它直接返回带时间戳的转录文本片段。
youtube_video_transcript如果转录文本调用返回空值、不可用或语言不符,接下来调用以查看标题、描述、频道、发布日期以及任何公开的转录文本元数据。仅当所需语言明确时,尝试一次明显的更改(例如,非英语请求未返回转录文本时尝试)。如果不存在转录文本,则仅基于元数据生成总结,并告知用户转录文本不可用。
youtube_videolanguage_codeen对于YouTube主题查询,调用,选择相关的,然后仅对所需视频调用元数据/转录文本接口。默认情况下避免获取大量转录文本。
engine=youtubevideo_results对于被阻止或依赖JavaScript的网页,调用一次Firecrawl,设置和。将返回的Markdown作为提取基础,而非最终结果:从页面结构中解析标题、价格/价值字段、规格、正文描述、图片URL、外部链接以及明显的联系/位置提示。
formats:["markdown","links"]onlyMainContent:true通用网页提取要点:
- 许多列表/详情页面通过JavaScript、图片库、隐藏区域或重复UI标签渲染重要内容。可能返回模板内容,而Firecrawl仍可恢复真实的主要内容。
web_fetch - 请勿硬编码特定网站的标签。将页面文本转换为通用结构化总结:内容类型、位置、关键数据、证据片段、媒体/链接以及注意事项。
- 当图片和链接有助于验证页面时,保留其源URL,但除非用户要求,否则请勿下载或批量处理所有媒体资源。
- 如果Markdown遗漏了重要布局或结构化字段,使用重试一次;仅当用户要求窄范围提取且模式/提示明确时,才使用
rawHtml、json或question。highlights
Cost discipline
成本管控
SerpApi is billed per successful search-like call, regardless of result count. Firecrawl is billed per page plus expensive modifiers. Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites with this skill.
If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.
If the proxy returns 429, back off; do not parallelize around the limit.
If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.
SerpApi按成功的类搜索调用计费,与结果数量无关。Firecrawl按页面加上昂贵的附加项计费。严格控制调用范围:单个页面、单个视频或少量候选列表。切勿使用本工具批量爬取整个网站。
如果代理返回403,说明请求超出允许的使用场景。请更改方法,而非重试。
如果代理返回429,请暂停请求;请勿绕过限制并行调用。
如果上游服务返回失败,请报告确切的失败信息,除非有明确理由更改参数,否则避免重复付费重试。