web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraper
网页抓取工具
Overview
概述
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
多策略智能网页抓取工具。可从网页中提取结构化数据(表格、列表、价格),支持分页、监控以及导出为CSV/JSON格式。
When to Use This Skill
何时使用该工具
- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics
- 当用户提及“scraper”或相关话题时
- 当用户提及“scraping”或相关话题时
- 当用户提及“extrair dados web”(网页数据提取)或相关话题时
- 当用户提及“web scraping”或相关话题时
- 当用户提及“raspar dados”(数据抓取)或相关话题时
- 当用户提及“coletar dados site”(网站数据收集)或相关话题时
Do Not Use This Skill When
何时不使用该工具
- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise
- 任务与网页抓取无关时
- 更简单、更针对性的工具可处理请求时
- 用户需要无领域限制的通用协助时
How It Works
工作流程
Execute phases in strict order. Each phase feeds the next.
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMATNever skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
Fast path: If user provides URL + clear data target + the request is simple
(single page, one data type), compress Phases 1-3 into a single action:
fetch, classify, and extract in one WebFetch call. Still validate and format.
严格按顺序执行各个阶段,每个阶段的输出作为下一阶段的输入。
1. 明确需求 -> 2. 站点侦察 -> 3. 策略选择 -> 4. 数据提取 -> 5. 数据转换 -> 6. 验证 -> 7. 格式化输出绝不能跳过阶段1或阶段2,它们可避免无效工作和提取失败。
快速路径:如果用户提供了URL + 明确的数据目标 + 请求简单(单页面、单一数据类型),可将阶段1-3合并为一个操作:通过一次WebFetch调用完成抓取、分类和提取,但仍需执行验证和格式化步骤。
Capabilities
功能特性
- Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- Output formats: Markdown tables (default), JSON, CSV
- Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
- Multi-URL: extract same structure across sources with comparison and diff
- Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- Auto-escalation: WebFetch fails silently -> automatic Browser fallback
- Data transforms: cleaning, normalization, deduplication, enrichment
- Differential mode: detect changes between scraping runs
- 多策略支持:WebFetch(静态页面)、浏览器自动化(JS渲染页面)、Bash/curl(API接口)、WebSearch(数据发现)
- 提取模式:表格、列表、文章、产品、联系人、FAQ、价格、活动、职位、自定义
- 输出格式:Markdown表格(默认)、JSON、CSV
- 分页处理:自动检测并跟进分页(页码、无限滚动、加载更多按钮)
- 多URL提取:跨源提取相同结构的数据,并支持对比和差异分析
- 验证机制:为每次提取结果提供置信度评级(高/中/低)
- 自动降级:WebFetch静默失败后,自动切换为浏览器抓取策略
- 数据转换:数据清洗、标准化、去重、补充
- 差异模式:检测不同抓取运行之间的数据变化
Web Scraper
网页抓取工具
Multi-strategy web data extraction with intelligent approach selection,
automatic fallback escalation, data transformation, and structured output.
具备智能策略选择、自动降级、数据转换和结构化输出功能的多策略网页数据提取工具。
Phase 1: Clarify
阶段1:明确需求
Establish extraction parameters before touching any URL.
在访问任何URL前,先确定提取参数。
Required Parameters
必填参数
| Parameter | Resolve | Default |
|---|---|---|
| Target URL(s) | Which page(s) to scrape? | (required) |
| Data Target | What specific data to extract? | (required) |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL? | Single page |
| 参数 | 需确认内容 | 默认值 |
|---|---|---|
| 目标URL | 需要抓取哪些页面? | (必填) |
| 数据目标 | 需要提取哪些具体数据? | (必填) |
| 输出格式 | Markdown表格、JSON、CSV或文本? | Markdown表格 |
| 范围 | 单页面、分页或多URL? | 单页面 |
Optional Parameters
可选参数
| Parameter | Resolve | Default |
|---|---|---|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
| 参数 | 需确认内容 | 默认值 |
|---|---|---|
| 分页设置 | 是否跟进分页?最大页数? | 否,仅1页 |
| 最大条目数 | 最多收集多少条数据? | 无限制 |
| 过滤规则 | 需要排除或包含哪些数据? | 无 |
| 排序方式 | 如何对结果排序? | 按源页面顺序 |
| 保存路径 | 是否保存到文件?路径是哪里? | 仅显示 |
| 响应语言 | 用哪种语言回复? | 用户使用的语言 |
| 差异模式 | 是否与上次抓取结果对比? | 否 |
Clarification Rules
明确需求规则
- If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present what's available and let user choose.
- 如果用户提供了URL和明确的数据目标,直接进入阶段2,不要询问不必要的问题。
- 如果请求模糊(例如“抓取这个网站”),仅需询问:“你想从该页面提取哪些具体数据?”
- 默认输出为Markdown表格,仅在相关时提及其他可选格式。
- 接受任何语言的请求,始终用用户使用的语言回复。
- 如果用户说“所有内容”或“全部数据”,先执行站点侦察,然后展示可提取的内容,让用户选择。
Discovery Mode
发现模式
When user has a topic but no specific URL:
- Use WebSearch to find the most relevant pages
- Present top 3-5 URLs with descriptions
- Let user choose which to scrape, or scrape all
- Proceed to Phase 2 with selected URL(s)
Example: "find and extract pricing data for CRM tools"
-> WebSearch("CRM tools pricing comparison 2026")
-> Present top results -> User selects -> Extract
当用户有主题但无具体URL时:
- 使用WebSearch查找最相关的页面
- 展示前3-5个带描述的URL
- 让用户选择要抓取的页面,或全部抓取
- 使用选定的URL进入阶段2
示例:“查找并提取CRM工具的价格数据”
-> WebSearch("CRM tools pricing comparison 2026")
-> 展示顶部搜索结果 -> 用户选择 -> 提取数据
Phase 2: Reconnaissance
阶段2:站点侦察
Analyze the target page before extraction.
在提取数据前先分析目标页面。
Step 2.1: Initial Fetch
步骤2.1:初始抓取
Use WebFetch to retrieve and analyze the page structure:
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)使用WebFetch获取并分析页面结构:
WebFetch(
url = 目标URL,
prompt = "分析此页面结构并报告:
1. 页面类型:文章、产品列表、搜索结果、数据表、
目录、仪表盘、API文档、FAQ、价格页、职位板、活动页或其他
2. 主要内容结构:表格、有序/无序列表、卡片网格、自由文本、
折叠面板、标签页
3. 可见的不同数据项的大致数量
4. JavaScript渲染标识:空容器、加载动画、
SPA框架标记(React根节点、Vue应用、Angular)、HTML简洁但JS密集
5. 分页方式:下一页/上一页链接、页码、加载更多按钮、
无限滚动标识、总结果数
6. 数据密度:存在多少结构化、可提取的数据
7. 列出可提取的主要数据字段/列
8. 嵌入式结构化数据:JSON-LD、微数据、OpenGraph标签
9. 可用的下载链接:CSV、Excel、PDF、API端点"
)Step 2.2: Evaluate Fetch Quality
步骤2.2:评估抓取质量
| Signal | Interpretation | Action |
|---|---|---|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
| 信号 | 解读 | 行动 |
|---|---|---|
| 内容丰富,数据清晰可见 | 静态页面 | 策略A(WebFetch) |
| 空容器、“加载中...”、文本极少 | JS渲染页面 | 策略B(浏览器自动化) |
| 登录墙、验证码、403/401响应 | 访问被拦截 | 向用户报告 |
| 内容存在但结构混乱 | 需要精准定位 | 策略B(浏览器自动化) |
| 响应体为JSON或XML | API端点 | 策略C(Bash/curl) |
| 存在CSV/Excel下载链接 | 直接数据文件 | 策略C(下载文件) |
Step 2.3: Content Classification
步骤2.3:内容分类
Classify into an extraction mode:
| Mode | Indicators | Examples |
|---|---|---|
| HTML | Price comparison, statistics, specs |
| Repeated similar elements, card grids | Search results, product listings |
| Long-form text with headings/paragraphs | Blog post, news article, docs |
| Product name, price, specs, images, rating | E-commerce product page |
| Names, emails, phones, addresses, roles | Team page, staff directory |
| Question-answer pairs, accordions | FAQ page, help center |
| Plan names, prices, features, tiers | SaaS pricing page |
| Dates, locations, titles, descriptions | Event listings, conferences |
| Titles, companies, locations, salaries | Job boards, career pages |
| User specified CSS selectors or fields | Anything not matching above |
Record: page type, extraction mode, JS rendering needed (yes/no),
available fields, structured data present (JSON-LD etc.).
If user asked for "everything", present the available fields and let them choose.
将页面分类为对应提取模式:
| 模式 | 标识特征 | 示例 |
|---|---|---|
| HTML | 价格对比表、统计数据、参数表 |
| 重复的相似元素、卡片网格 | 搜索结果、产品列表 |
| 带标题/段落的长文本 | 博客文章、新闻报道、文档 |
| 产品名称、价格、参数、图片、评分 | 电商产品页 |
| 姓名、邮箱、电话、地址、职位 | 团队页、员工目录 |
| 问答对、折叠面板 | FAQ页、帮助中心 |
| 套餐名称、价格、功能、档位 | SaaS价格页 |
| 日期、地点、标题、描述 | 活动列表、会议信息 |
| 职位名称、公司、地点、薪资 | 职位板、招聘页 |
| 用户提供的CSS选择器或字段描述 | 不符合上述模式的自定义需求 |
记录:页面类型、提取模式、是否需要JS渲染、可用字段、是否存在结构化数据(如JSON-LD)。
如果用户要求“所有内容”,先展示可用字段,再让用户选择。
Phase 3: Strategy Selection
阶段3:策略选择
Choose the extraction approach based on recon results.
根据站点侦察结果选择提取策略。
Decision Tree
决策树
Structured data (JSON-LD, microdata) has what we need?
|
+-- YES --> STRATEGY E: Extract structured data directly
|
+-- NO: Content fully visible in WebFetch?
|
+-- YES: Need precise element targeting?
| |
| +-- NO --> STRATEGY A: WebFetch + AI extraction
| +-- YES --> STRATEGY B: Browser automation
|
+-- NO: JavaScript rendering detected?
|
+-- YES --> STRATEGY B: Browser automation
+-- NO: API/JSON/XML endpoint or download link?
|
+-- YES --> STRATEGY C: Bash (curl + jq)
+-- NO --> Report access issue to user结构化数据(JSON-LD、微数据)包含所需内容?
|
+-- 是 --> 策略E:直接提取结构化数据
|
+-- 否:WebFetch可完整获取内容?
|
+-- 是:是否需要精准元素定位?
| |
| +-- 否 --> 策略A:WebFetch + AI提取
| +-- 是 --> 策略B:浏览器自动化
|
+-- 否:检测到JavaScript渲染?
|
+-- 是 --> 策略B:浏览器自动化
+-- 否:是否存在API/JSON/XML端点或下载链接?
|
+-- 是 --> 策略C:Bash(curl + jq)
+-- 否 --> 向用户报告访问问题Strategy A: Webfetch With Ai Extraction
策略A:WebFetch + AI提取
Best for: Static pages, articles, simple tables, well-structured HTML.
Use WebFetch with a targeted extraction prompt tailored to the mode:
WebFetch(
url = URL,
prompt = "Extract [DATA_TARGET] from this page.
Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
Rules:
- If a value is missing or unclear, use 'N/A'
- Do not include navigation, ads, footers, or unrelated content
- Preserve original values exactly (numbers, currencies, dates)
- Include ALL matching items, not just the first few
- For each item, also extract the URL/link if available"
)Auto-escalation: If WebFetch returns suspiciously few items (less than
50% of expected from recon), or mostly empty fields, automatically escalate
to Strategy B without asking user. Log the escalation in notes.
适用场景:静态页面、文章、简单表格、结构清晰的HTML页面。
使用针对提取模式定制的提示词,通过WebFetch进行提取:
WebFetch(
url = URL,
prompt = "从该页面提取[数据目标]。
仅返回提取的数据,格式为[输出格式],包含以下列/字段:[字段列表]。
规则:
- 如果值缺失或不明确,使用'N/A'
- 不要包含导航栏、广告、页脚或无关内容
- 完全保留原始值(数字、货币、日期)
- 包含所有匹配项,而非仅前几项
- 如果可用,为每个条目提取对应的URL/链接"
)自动降级:如果WebFetch返回的条目数量远低于侦察阶段预期(不足50%),或大部分字段为空,则自动升级为策略B,无需询问用户。在备注中记录此次升级操作。
Strategy B: Browser Automation
策略B:浏览器自动化
Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
Sequence:
- Get tab context: -> get tabId
tabs_context_mcp(createIfEmpty=true) - Navigate to URL:
navigate(url=TARGET_URL, tabId=TAB) - Wait for content to load:
computer(action="wait", duration=3, tabId=TAB) - Check for cookie/consent banners:
find(query="cookie consent or accept button", tabId=TAB)- If found, dismiss it (prefer privacy-preserving option)
- Read page structure: or
read_page(tabId=TAB)get_page_text(tabId=TAB) - Locate target elements:
find(query="[DESCRIPTION]", tabId=TAB) - Extract with JavaScript for precise data via
javascript_tool
javascript
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);javascript
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);- For lazy-loaded content, scroll and re-extract:
then
computer(action="scroll", scroll_direction="down", tabId=TAB)computer(action="wait", duration=2, tabId=TAB)
适用场景:JS渲染页面、SPA、交互式内容、懒加载数据。
步骤:
- 获取标签页上下文:-> 获取tabId
tabs_context_mcp(createIfEmpty=true) - 导航至目标URL:
navigate(url=目标URL, tabId=标签页ID) - 等待内容加载:
computer(action="wait", duration=3, tabId=标签页ID) - 检查Cookie/授权横幅:
find(query="cookie consent or accept button", tabId=标签页ID)- 如果存在,关闭横幅(优先选择隐私保护选项)
- 读取页面结构:或
read_page(tabId=标签页ID)get_page_text(tabId=标签页ID) - 定位目标元素:
find(query="[元素描述]", tabId=标签页ID) - 通过使用JavaScript精准提取数据
javascript_tool
javascript
// 表格提取
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);javascript
// 列表/卡片提取
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);- 针对懒加载内容,滚动页面后重新提取:
然后执行
computer(action="scroll", scroll_direction="down", tabId=标签页ID)computer(action="wait", duration=2, tabId=标签页ID)
Strategy C: Bash (Curl + Jq)
策略C:Bash(curl + jq)
Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.
bash
undefined适用场景:REST API、JSON端点、XML源、CSV/Excel下载。
bash
undefinedJson Api
JSON API
curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
Csv Download
CSV下载
curl -s "CSV_URL" -o /tmp/scraped_data.csv
curl -s "CSV_URL" -o /tmp/scraped_data.csv
Xml Parsing
XML解析
curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)
curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)
... Parse And Output Json
... 解析并输出JSON
"
undefined"
undefinedStrategy D: Hybrid
策略D:混合策略
When a single strategy is insufficient, combine:
- WebSearch to discover relevant URLs
- WebFetch for initial content assessment
- Browser automation for JS-heavy sections
- Bash for post-processing (jq, python for data cleaning)
当单一策略不足以完成任务时,可组合使用:
- WebSearch发现相关URL
- WebFetch进行初始内容评估
- 浏览器自动化处理JS密集部分
- Bash进行后处理(jq、Python数据清洗)
Strategy E: Structured Data Extraction
策略E:结构化数据提取
When JSON-LD, microdata, or OpenGraph is present:
- Use Browser to extract structured data:
javascript_tool
javascript
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);- This often provides cleaner, more reliable data than DOM scraping
- Fall back to DOM extraction only for fields not in structured data
当页面存在JSON-LD、微数据或OpenGraph标签时:
- 使用浏览器的提取结构化数据:
javascript_tool
javascript
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);- 这种方式通常比DOM抓取提供更干净、更可靠的数据
- 仅当结构化数据中缺少所需字段时,才回退到DOM提取
Pagination Handling
分页处理
When pagination is detected and user wants multiple pages:
Page-number pagination (any strategy):
- Extract data from current page
- Identify URL pattern (e.g. ,
?page=N,/page/N)&offset=N - Iterate through pages up to user's max (default: 5 pages)
- Show progress: "Extracting page 2/5..."
- Concatenate all results, deduplicate if needed
Infinite scroll (Browser only):
- Extract currently visible data
- Record item count
- Scroll down:
computer(action="scroll", scroll_direction="down", tabId=TAB) - Wait:
computer(action="wait", duration=2, tabId=TAB) - Extract newly loaded data
- Compare count - if no new items after 2 scrolls, stop
- Repeat until no new content or max iterations (default: 5)
"Load More" button (Browser only):
- Extract currently visible data
- Find button:
find(query="load more button", tabId=TAB) - Click it:
computer(action="left_click", ref=REF, tabId=TAB) - Wait and extract new content
- Repeat until button disappears or max iterations reached
当检测到分页且用户需要多页数据时:
页码式分页(支持所有策略):
- 提取当前页面数据
- 识别URL模式(例如 、
?page=N、/page/N)&offset=N - 遍历页面,直至达到用户设置的最大页数(默认:5页)
- 展示进度:“正在提取第2/5页...”
- 合并所有结果,必要时去重
无限滚动(仅浏览器自动化):
- 提取当前可见数据
- 记录条目数量
- 向下滚动:
computer(action="scroll", scroll_direction="down", tabId=标签页ID) - 等待:
computer(action="wait", duration=2, tabId=标签页ID) - 提取新加载的数据
- 对比条目数量 - 如果连续2次滚动后无新条目,停止操作
- 重复操作直至无新内容或达到最大迭代次数(默认:5次)
“加载更多”按钮(仅浏览器自动化):
- 提取当前可见数据
- 查找按钮:
find(query="load more button", tabId=标签页ID) - 点击按钮:
computer(action="left_click", ref=引用ID, tabId=标签页ID) - 等待并提取新内容
- 重复操作直至按钮消失或达到最大迭代次数
Phase 4: Extract
阶段4:数据提取
Execute the selected strategy using mode-specific patterns.
See references/extraction-patterns.md
for CSS selectors and JavaScript snippets.
根据选择的策略,使用对应模式的提取规则执行提取。
参考references/extraction-patterns.md
获取CSS选择器和JavaScript代码片段。
Table Mode
表格模式
WebFetch prompt:
"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."WebFetch提示词:
"提取该页面中的所有表格行。
以Markdown表格形式返回,保留原表头。
包含所有行,不要截断或汇总。
保留数值精度、货币和单位。"List Mode
列表模式
WebFetch prompt:
"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."WebFetch提示词:
"从该页面提取每个[条目类型]。
为每个条目提取以下字段:[字段列表]。
以JSON数组形式返回,键为[键列表]。
包含所有条目,而非仅前几项。如果可用,为每个条目提取对应的链接/URL。"Article Mode
文章模式
WebFetch prompt:
"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."WebFetch提示词:
"提取文章元数据:
- 标题、作者、日期、标签/分类、字数估算
- 关键事实数据、统计信息和命名实体
以结构化Markdown形式返回。总结内容,不要复制全文。"Product Mode
产品模式
WebFetch prompt:
"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
availability, description (first 200 chars), rating, reviewCount,
specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."Also check for JSON-LD schema (Strategy E) first.
ProductWebFetch提示词:
"提取产品数据,包含以下字段:
- 名称、品牌、价格、货币、原价(如果有折扣)、
库存状态、描述(前200字符)、评分、评论数、
参数(键值对形式)、产品URL、图片URL
以JSON形式返回。缺失字段使用null。"同时优先使用策略E提取JSON-LD中的模式数据。
ProductContact Mode
联系人模式
WebFetch prompt:
"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."WebFetch提示词:
"提取每个人/实体的联系信息:
- 姓名、头衔、职位、邮箱、电话、地址、机构、网站、LinkedIn URL
以Markdown表格形式返回。仅提取页面上可见的真实联系信息。"Faq Mode
FAQ模式
WebFetch prompt:
"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."WebFetch提示词:
"提取该页面中的所有问答对。
为每个FAQ条目提取:
- 问题:完整的问题文本
- 答案:答案文本(如果过长,取前300字符)
- 分类:所属章节/分类(如果有分组)
以JSON数组形式返回。"Pricing Mode
价格模式
WebFetch prompt:
"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."WebFetch提示词:
"提取该页面中的所有价格套餐/档位。
为每个套餐提取:
- 套餐名称、月付价格、年付价格、货币
- 包含的功能(数组形式)
- 限制条件(数组形式,包含限制或排除的功能)
- 号召性用语按钮文本
- 是否突出显示(如果标记为推荐/热门则为true)
以JSON形式返回。缺失字段使用null。"Events Mode
活动模式
WebFetch prompt:
"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."WebFetch提示词:
"提取该页面中的所有活动/场次信息。
为每个活动提取:
- 标题、日期、时间、结束时间、地点、描述(前200字符)
- 演讲者(姓名数组)、分类、注册URL
以JSON形式返回。缺失字段使用null。"Jobs Mode
职位模式
WebFetch prompt:
"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."WebFetch提示词:
"提取该页面中的所有职位列表。
为每个职位提取:
- 职位名称、公司、地点、薪资、薪资范围、类型(全职/兼职/合同)
- 发布日期、描述(前200字符)、申请URL、标签
以JSON形式返回。缺失字段使用null。"Custom Mode
自定义模式
When user provides specific selectors or field descriptions:
- Use Browser automation with and user's CSS selectors
javascript_tool - Or use WebFetch with a prompt built from user's field descriptions
- Always confirm extracted schema with user before proceeding to multi-URL
当用户提供特定选择器或字段描述时:
- 使用浏览器自动化,结合和用户提供的CSS选择器
javascript_tool - 或使用WebFetch,根据用户的字段描述构建提示词
- 在进行多URL提取前,务必与用户确认提取的 schema 是否正确
Multi-Url Extraction
多URL提取
When extracting from multiple URLs:
- Extract from the first URL to establish the data schema
- Show user the first results and confirm the schema is correct
- Extract from remaining URLs using the same schema
- Add a column/field to every record with the origin URL
source - Combine all results into a single output
- Show progress: "Extracting 3/7 URLs..."
当需要从多个URL提取数据时:
- 从第一个URL提取数据,确定数据schema
- 向用户展示第一批结果,确认schema正确
- 使用相同的schema提取剩余URL的数据
- 为每条记录添加列/字段,记录来源URL
source - 将所有结果合并为单一输出
- 展示进度:“正在提取第3/7个URL...”
Phase 5: Transform
阶段5:数据转换
Clean, normalize, and enrich extracted data before validation.
See references/data-transforms.md for patterns.
在验证前,对提取的数据进行清洗、标准化和补充。
参考references/data-transforms.md
获取转换规则。
Automatic Transforms (Always Apply)
自动转换(始终执行)
| Transform | Action |
|---|---|
| Whitespace cleanup | Trim, collapse multiple spaces, remove |
| HTML entity decode | |
| Unicode normalization | NFKC normalization for consistent characters |
| Empty string to null | |
| 转换操作 | 具体动作 |
|---|---|
| 空白字符清理 | 去除首尾空白、合并多个空格、移除单元格中的 |
| HTML实体解码 | |
| Unicode标准化 | 对字符进行NFKC标准化,确保一致性 |
| 空字符串转null | |
Conditional Transforms (Apply When Relevant)
条件转换(按需执行)
| Transform | When | Action |
|---|---|---|
| Price normalization | Product/pricing modes | Extract numeric value + currency symbol |
| Date normalization | Any dates found | Normalize to ISO-8601 (YYYY-MM-DD) |
| URL resolution | Relative URLs extracted | Convert to absolute URLs |
| Phone normalization | Contact mode | Standardize to E.164 format if possible |
| Deduplication | Multi-page or multi-URL | Remove exact duplicate rows |
| Sorting | User requested or natural | Sort by user-specified field |
| 转换操作 | 适用场景 | 具体动作 |
|---|---|---|
| 价格标准化 | 产品/价格模式 | 提取数值和货币符号 |
| 日期标准化 | 存在日期数据时 | 转换为ISO-8601格式(YYYY-MM-DD) |
| URL解析 | 提取到相对URL时 | 转换为绝对URL |
| 电话号码标准化 | 联系人模式 | 尽可能转换为E.164格式 |
| 去重 | 多页面或多URL提取时 | 移除完全重复的行 |
| 排序 | 用户要求或自然排序需求时 | 按用户指定字段排序 |
Data Enrichment (Only When Useful)
数据补充(仅在有用时执行)
| Enrichment | When | Action |
|---|---|---|
| Currency conversion | User asks for single currency | Note original + convert (approximate) |
| Domain extraction | URLs in data | Add domain column from full URLs |
| Word count | Article mode | Count words in extracted text |
| Relative dates | Dates present | Add "X days ago" column if useful |
| 补充操作 | 适用场景 | 具体动作 |
|---|---|---|
| 货币转换 | 用户要求统一货币时 | 记录原货币并转换为目标货币(近似值) |
| 域名提取 | 数据中包含URL时 | 从完整URL中提取域名,添加为新列 |
| 字数统计 | 文章模式 | 统计提取文本的字数 |
| 相对日期计算 | 存在日期数据时 | 如有需要,添加“X天前”列 |
Deduplication Strategy
去重策略
When combining data from multiple pages or URLs:
- Exact match: rows with identical values in all fields -> keep first
- Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
- Report: "Removed N duplicate rows" in delivery notes
当合并多页面或多URL的数据时:
- 完全匹配:所有字段值均相同的行 -> 保留第一行
- 近似匹配:关键字段(名称+来源)相同但细节不同的行 -> 保留最完整的行(空值最少),并在备注中标记
- 报告:在交付备注中说明“已移除N条重复行”
Phase 6: Validate
阶段6:验证
Verify extraction quality before delivering results.
在交付结果前,验证提取质量。
Validation Checks
验证检查项
| Check | Action |
|---|---|
| Item count | Compare extracted count to expected count from recon |
| Empty fields | Count N/A or null values per field |
| Data type consistency | Numbers should be numeric, dates parseable |
| Duplicates | Flag exact duplicate rows (post-dedup) |
| Encoding | Check for HTML entities, garbled characters |
| Completeness | All user-requested fields present in output |
| Truncation | Verify data wasn't cut off (check last items) |
| Outliers | Flag values that seem anomalous (e.g. $0.00 price) |
| 检查项 | 具体动作 |
|---|---|
| 条目数量 | 对比提取数量与侦察阶段的预期数量 |
| 空字段 | 统计每个字段的N/A或null值数量 |
| 数据类型一致性 | 数值应为数字类型,日期应可解析 |
| 重复项 | 标记去重后仍存在的完全重复行 |
| 编码问题 | 检查是否存在HTML实体、乱码字符 |
| 完整性 | 输出中包含用户要求的所有字段 |
| 截断问题 | 验证数据未被截断(检查最后几条条目) |
| 异常值 | 标记异常值(例如0.00美元的价格) |
Confidence Rating
置信度评级
Assign to every extraction:
| Rating | Criteria |
|---|---|
| HIGH | All fields populated, count matches expected, no anomalies |
| MEDIUM | Minor gaps (<10% empty fields) or count slightly differs |
| LOW | Significant gaps (>10% empty), structural issues, partial data |
Always report confidence with specifics:
Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.
为每次提取结果分配置信度:
| 评级 | 标准 |
|---|---|
| 高 | 所有字段均已填充,数量与预期一致,无异常值 |
| 中 | 存在少量缺失(空字段占比<10%)或数量略有差异 |
| 低 | 大量缺失(空字段占比>10%)、结构问题、数据不完整 |
始终附带具体细节报告置信度:
置信度:高 - 已提取47条数据,6个字段均已填充, 与页面分析的预期数量一致。
Auto-Recovery (Try Before Reporting Issues)
自动恢复(报告问题前尝试)
| Issue | Auto-Recovery Action |
|---|---|
| Missing data | Re-attempt with Browser if WebFetch was used |
| Encoding problems | Apply HTML entity decode + unicode normalization |
| Incomplete results | Check for pagination or lazy-loading, fetch more |
| Count mismatch | Scroll/paginate to find remaining items |
| All fields empty | Page likely JS-rendered, switch to Browser strategy |
| Partial fields | Try JSON-LD extraction as supplement |
Log all recovery attempts in delivery notes.
Inform user of any irrecoverable gaps with specific details.
| 问题 | 自动恢复动作 |
|---|---|
| 数据缺失 | 如果使用了WebFetch,重新尝试使用浏览器自动化策略 |
| 编码问题 | 执行HTML实体解码和Unicode标准化 |
| 结果不完整 | 检查是否存在分页或懒加载,获取更多数据 |
| 数量不匹配 | 滚动/分页查找剩余条目 |
| 所有字段为空 | 页面可能是JS渲染,切换为浏览器自动化策略 |
| 部分字段缺失 | 尝试提取JSON-LD数据进行补充 |
在交付备注中记录所有恢复尝试。
向用户报告无法恢复的缺失,并提供具体细节。
Phase 7: Format And Deliver
阶段7:格式化与交付
Structure results according to user preference.
See references/output-templates.md
for complete formatting templates.
根据用户偏好组织结果。
参考references/output-templates.md
获取完整的格式化模板。
Delivery Envelope
交付包装
ALWAYS wrap results with this metadata header:
markdown
undefined始终使用以下元数据头包裹结果:
markdown
undefinedExtraction Results
提取结果
Source: Page Title
Date: YYYY-MM-DD HH:MM UTC
Items: N records (M fields each)
Confidence: HIGH | MEDIUM | LOW
Strategy: A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
Format: Markdown Table | JSON | CSV
[DATA HERE]
Notes:
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
undefined来源: 页面标题
日期: YYYY-MM-DD HH:MM UTC
条目数: N条记录(每条M个字段)
置信度: 高 | 中 | 低
策略: A(WebFetch) | B(浏览器) | C(API) | E(结构化数据)
格式: Markdown表格 | JSON | CSV
[数据内容]
备注:
- [任何缺失、问题或观察结果]
- [执行的转换操作:去重、标准化等]
- [如果是分页抓取:“已抓取第1-5页,共12页”]
- [如果发生自动降级:“已从WebFetch自动切换为浏览器策略”]
undefinedMarkdown Table Rules
Markdown表格规则
- Left-align text columns (), right-align numbers (
:---)---: - Consistent column widths for readability
- Include summary row for numeric data when useful (totals, averages)
- Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
- Truncate long cell values to 60 chars with indicator
... - Use for missing values, never leave cells empty
N/A - For multi-page results, show combined table (not per-page)
- 文本列左对齐(),数值列右对齐(
:---)---: - 保持列宽一致,提升可读性
- 如有需要,为数值数据添加汇总行(总计、平均值)
- 每张表格最多10列;数据列过多时,拆分为多个表格或建议使用JSON格式
- 长单元格值截断为60字符,末尾添加标识
... - 缺失值使用,不保留空单元格
N/A - 多页结果合并为单个表格展示(而非按页展示)
Json Rules
JSON规则
- Use camelCase for keys (e.g. ,
productName)unitPrice - Wrap in metadata envelope:
json
{ "metadata": { "source": "URL", "title": "Page Title", "extractedAt": "ISO-8601", "itemCount": 47, "fieldCount": 6, "confidence": "HIGH", "strategy": "A", "transforms": ["deduplication", "priceNormalization"], "notes": [] }, "data": [ ... ] } - Pretty-print with 2-space indentation
- Numbers as numbers (not strings), booleans as booleans
- null for missing values (not empty strings)
- 键使用小驼峰命名(例如 、
productName)unitPrice - 使用元数据包裹:
json
{ "metadata": { "source": "URL", "title": "页面标题", "extractedAt": "ISO-8601格式时间", "itemCount": 47, "fieldCount": 6, "confidence": "高", "strategy": "A", "transforms": ["去重", "价格标准化"], "notes": [] }, "data": [ ... ] } - 使用2空格缩进格式化输出
- 数值类型保留为数字,布尔值保留为布尔类型
- 缺失值使用null(而非空字符串)
Csv Rules
CSV规则
- First row is always headers
- Quote any field containing commas, quotes, or newlines
- UTF-8 encoding with BOM for Excel compatibility
- Use as delimiter (standard)
, - Include metadata as comments:
# Source: URL
- 第一行始终是表头
- 包含逗号、引号或换行符的字段需用引号包裹
- 使用带BOM的UTF-8编码,确保与Excel兼容
- 使用作为分隔符(标准格式)
, - 元数据以注释形式添加:
# 来源:URL
File Output
文件输出
When user requests file save:
- Markdown: extension
.md - JSON: extension
.json - CSV: extension
.csv - Confirm path before writing
- Report full file path and item count after saving
当用户要求保存为文件时:
- Markdown:扩展名
.md - JSON:扩展名
.json - CSV:扩展名
.csv - 写入前确认路径
- 保存后报告完整文件路径和条目数
Multi-Url Comparison Format
多URL对比格式
When comparing data across multiple sources:
- Add as the first column/field
Source - Use short identifiers for sources (domain name or user label)
- Group by source or interleave based on user preference
- Highlight differences if user asks for comparison
- Include summary: "Best price: $X at store-b.com"
当需要跨源对比数据时:
- 添加作为第一列/字段
来源 - 使用短标识表示来源(域名或用户自定义标签)
- 根据用户偏好,按来源分组或交叉展示
- 如果用户要求对比,高亮显示差异
- 添加汇总信息:“最低价格:X美元,来自store-b.com”
Differential Output
差异输出
When user requests change detection (diff mode):
- Compare current extraction with previous run
- Mark new items with
[NEW] - Mark removed items with
[REMOVED] - Mark changed values with
[WAS: old_value] - Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"
当用户要求检测变化(差异模式)时:
- 对比当前提取结果与上一次运行结果
- 新条目标记为
[新增] - 移除的条目标记为
[已移除] - 变更的值标记为
[原:旧值] - 添加汇总信息:“与上次运行相比:新增5条,移除2条,修改3条”
Rate Limiting
速率限制
- Maximum 1 request per 2 seconds for sequential page fetches
- For multi-URL jobs, process sequentially with pauses
- If a site returns 429 (Too Many Requests), stop and report to user
- 连续页面抓取时,每2秒最多1次请求
- 多URL任务时,按顺序处理并添加停顿
- 如果站点返回429(请求过多),停止操作并向用户报告
Access Respect
访问规范
- If a page blocks access (403, CAPTCHA, login wall), report to user
- Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
- Do NOT scrape behind authentication unless user explicitly provides access
- Respect robots.txt directives when known
- 如果页面拦截访问(403、验证码、登录墙),向用户报告
- 不要尝试绕过机器人检测、验证码或访问控制
- 除非用户明确提供访问权限,否则不要抓取需要认证的内容
- 已知情况下,遵守robots.txt指令
Copyright
版权说明
- Do NOT reproduce large blocks of copyrighted article text
- For articles: extract factual data, statistics, and structured info; summarize narrative content
- Always include source attribution (http://example.com) in output
- 不要大段复制受版权保护的文章文本
- 对于文章:提取事实数据、统计信息和结构化内容;总结叙述性内容
- 输出中始终包含来源归因(http://example.com)
Data Scope
数据范围
- Extract ONLY what the user explicitly requested
- Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
- Do not store or transmit extracted data beyond what the user sees
- 仅提取用户明确要求的内容
- 大规模收集潜在敏感数据(邮箱、电话、个人信息)前,向用户发出警告
- 除用户可见内容外,不要存储或传输提取的数据
Failure Protocol
失败处理流程
When extraction fails or is blocked:
- Explain the specific reason (JS rendering, bot detection, login, etc.)
- Suggest alternatives (different URL, API if available, manual approach)
- Never retry aggressively or escalate access attempts
当提取失败或被拦截时:
- 解释具体原因(JS渲染、机器人检测、登录要求等)
- 建议替代方案(不同URL、可用API、手动方式)
- 不要频繁重试或尝试提升访问权限
Quick Reference: Mode Cheat Sheet
快速参考:模式速查表
| User Says... | Mode | Strategy | Output Default |
|---|---|---|---|
| "extract the table" | table | A or B | Markdown table |
| "get all products/prices" | product | E then A | Markdown table |
| "scrape the listings" | list | A or B | Markdown table |
| "extract contact info / team page" | contact | A | Markdown table |
| "get the article data" | article | A | Markdown text |
| "extract the FAQ" | faq | A or B | JSON |
| "get pricing plans" | pricing | A or B | Markdown table |
| "scrape job listings" | jobs | A or B | Markdown table |
| "get event schedule" | events | A or B | Markdown table |
| "find and extract [topic]" | discovery | WebSearch | Markdown table |
| "compare prices across sites" | multi-URL | A or B | Comparison table |
| "what changed since last time" | diff | any | Diff format |
| 用户需求... | 模式 | 策略 | 默认输出格式 |
|---|---|---|---|
| "提取表格" | table | A或B | Markdown表格 |
| "获取所有产品/价格" | product | E后A | Markdown表格 |
| "抓取列表" | list | A或B | Markdown表格 |
| "提取联系信息 / 团队页" | contact | A | Markdown表格 |
| "提取文章数据" | article | A | Markdown文本 |
| "提取FAQ" | faq | A或B | JSON |
| "获取价格套餐" | pricing | A或B | Markdown表格 |
| "抓取职位列表" | jobs | A或B | Markdown表格 |
| "获取活动日程" | events | A或B | Markdown表格 |
| "查找并提取[主题]" | discovery | WebSearch | Markdown表格 |
| "跨站对比价格" | multi-URL | A或B | 对比表格 |
| "上次抓取后有哪些变化" | diff | 任意 | 差异格式 |
References
参考文档
-
Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
-
Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
-
Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.
-
提取规则:references/extraction-patterns.md CSS选择器、JavaScript代码片段、JSON-LD解析、域名相关技巧。
-
输出模板:references/output-templates.md Markdown、JSON、CSV格式的完整示例模板。
-
数据转换:references/data-transforms.md 数据清洗、标准化、去重、补充规则。
Best Practices
最佳实践
- Provide clear, specific context about your project and requirements
- Review all suggestions before applying them to production code
- Combine with other complementary skills for comprehensive analysis
- 提供清晰、具体的项目背景和需求
- 在将建议应用于生产代码前,仔细审阅
- 结合其他互补工具,进行全面分析
Common Pitfalls
常见误区
- Using this skill for tasks outside its domain expertise
- Applying recommendations without understanding your specific context
- Not providing enough project context for accurate analysis
- 将该工具用于其领域外的任务
- 在不了解具体背景的情况下应用建议
- 未提供足够的项目背景,导致分析不准确