web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraper

网页抓取工具

Overview

概述

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

多策略智能网页抓取工具。可从网页中提取结构化数据（表格、列表、价格），支持分页、监控以及导出为CSV/JSON格式。

When to Use This Skill

何时使用该工具

When the user mentions "scraper" or related topics
When the user mentions "scraping" or related topics
When the user mentions "extrair dados web" or related topics
When the user mentions "web scraping" or related topics
When the user mentions "raspar dados" or related topics
When the user mentions "coletar dados site" or related topics

当用户提及“scraper”或相关话题时
当用户提及“scraping”或相关话题时
当用户提及“extrair dados web”（网页数据提取）或相关话题时
当用户提及“web scraping”或相关话题时
当用户提及“raspar dados”（数据抓取）或相关话题时
当用户提及“coletar dados site”（网站数据收集）或相关话题时

Do Not Use This Skill When

何时不使用该工具

The task is unrelated to web scraper
A simpler, more specific tool can handle the request
The user needs general-purpose assistance without domain expertise

任务与网页抓取无关时
更简单、更针对性的工具可处理请求时
用户需要无领域限制的通用协助时

How It Works

工作流程

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.

严格按顺序执行各个阶段，每个阶段的输出作为下一阶段的输入。

1. 明确需求  ->  2. 站点侦察  ->  3. 策略选择  ->  4. 数据提取  ->  5. 数据转换  ->  6. 验证  ->  7. 格式化输出

绝不能跳过阶段1或阶段2，它们可避免无效工作和提取失败。

快速路径：如果用户提供了URL + 明确的数据目标 + 请求简单（单页面、单一数据类型），可将阶段1-3合并为一个操作：通过一次WebFetch调用完成抓取、分类和提取，但仍需执行验证和格式化步骤。

Capabilities

功能特性

Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
Output formats: Markdown tables (default), JSON, CSV
Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
Multi-URL: extract same structure across sources with comparison and diff
Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
Auto-escalation: WebFetch fails silently -> automatic Browser fallback
Data transforms: cleaning, normalization, deduplication, enrichment
Differential mode: detect changes between scraping runs

多策略支持：WebFetch（静态页面）、浏览器自动化（JS渲染页面）、Bash/curl（API接口）、WebSearch（数据发现）
提取模式：表格、列表、文章、产品、联系人、FAQ、价格、活动、职位、自定义
输出格式：Markdown表格（默认）、JSON、CSV
分页处理：自动检测并跟进分页（页码、无限滚动、加载更多按钮）
多URL提取：跨源提取相同结构的数据，并支持对比和差异分析
验证机制：为每次提取结果提供置信度评级（高/中/低）
自动降级：WebFetch静默失败后，自动切换为浏览器抓取策略
数据转换：数据清洗、标准化、去重、补充
差异模式：检测不同抓取运行之间的数据变化

Web Scraper

网页抓取工具

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

具备智能策略选择、自动降级、数据转换和结构化输出功能的多策略网页数据提取工具。

Phase 1: Clarify

阶段1：明确需求

Establish extraction parameters before touching any URL.

在访问任何URL前，先确定提取参数。

Required Parameters

必填参数

Parameter	Resolve	Default
Target URL(s)	Which page(s) to scrape?	(required)
Data Target	What specific data to extract?	(required)
Output Format	Markdown table, JSON, CSV, or text?	Markdown table
Scope	Single page, paginated, or multi-URL?	Single page

参数	需确认内容	默认值
目标URL	需要抓取哪些页面？	（必填）
数据目标	需要提取哪些具体数据？	（必填）
输出格式	Markdown表格、JSON、CSV或文本？	Markdown表格
范围	单页面、分页或多URL？	单页面

Optional Parameters

可选参数

Parameter	Resolve	Default
Pagination	Follow pagination? Max pages?	No, 1 page
Max Items	Maximum number of items to collect?	Unlimited
Filters	Data to exclude or include?	None
Sort Order	How to sort results?	Source order
Save Path	Save to file? Which path?	Display only
Language	Respond in which language?	User's lang
Diff Mode	Compare with previous run?	No

参数	需确认内容	默认值
分页设置	是否跟进分页？最大页数？	否，仅1页
最大条目数	最多收集多少条数据？	无限制
过滤规则	需要排除或包含哪些数据？	无
排序方式	如何对结果排序？	按源页面顺序
保存路径	是否保存到文件？路径是哪里？	仅显示
响应语言	用哪种语言回复？	用户使用的语言
差异模式	是否与上次抓取结果对比？	否

Clarification Rules

明确需求规则

If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
Default to Markdown table output. Mention alternatives only if relevant.
Accept requests in any language. Always respond in the user's language.
If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

如果用户提供了URL和明确的数据目标，直接进入阶段2，不要询问不必要的问题。
如果请求模糊（例如“抓取这个网站”），仅需询问：“你想从该页面提取哪些具体数据？”
默认输出为Markdown表格，仅在相关时提及其他可选格式。
接受任何语言的请求，始终用用户使用的语言回复。
如果用户说“所有内容”或“全部数据”，先执行站点侦察，然后展示可提取的内容，让用户选择。

Discovery Mode

发现模式

When user has a topic but no specific URL:

Use WebSearch to find the most relevant pages
Present top 3-5 URLs with descriptions
Let user choose which to scrape, or scrape all
Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract

当用户有主题但无具体URL时：

使用WebSearch查找最相关的页面
展示前3-5个带描述的URL
让用户选择要抓取的页面，或全部抓取
使用选定的URL进入阶段2

示例：“查找并提取CRM工具的价格数据” -> WebSearch("CRM tools pricing comparison 2026") -> 展示顶部搜索结果 -> 用户选择 -> 提取数据

Phase 2: Reconnaissance

阶段2：站点侦察

Analyze the target page before extraction.

在提取数据前先分析目标页面。

Step 2.1: Initial Fetch

步骤2.1：初始抓取

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

使用WebFetch获取并分析页面结构：

WebFetch(
  url = 目标URL,
  prompt = "分析此页面结构并报告：
    1. 页面类型：文章、产品列表、搜索结果、数据表、
       目录、仪表盘、API文档、FAQ、价格页、职位板、活动页或其他
    2. 主要内容结构：表格、有序/无序列表、卡片网格、自由文本、
       折叠面板、标签页
    3. 可见的不同数据项的大致数量
    4. JavaScript渲染标识：空容器、加载动画、
       SPA框架标记（React根节点、Vue应用、Angular）、HTML简洁但JS密集
    5. 分页方式：下一页/上一页链接、页码、加载更多按钮、
       无限滚动标识、总结果数
    6. 数据密度：存在多少结构化、可提取的数据
    7. 列出可提取的主要数据字段/列
    8. 嵌入式结构化数据：JSON-LD、微数据、OpenGraph标签
    9. 可用的下载链接：CSV、Excel、PDF、API端点"
)

Step 2.2: Evaluate Fetch Quality

步骤2.2：评估抓取质量

Signal	Interpretation	Action
Rich content with data clearly visible	Static page	Strategy A (WebFetch)
Empty containers, "loading...", minimal text	JS-rendered	Strategy B (Browser)
Login wall, CAPTCHA, 403/401 response	Blocked	Report to user
Content present but poorly structured	Needs precision	Strategy B (Browser)
JSON or XML response body	API endpoint	Strategy C (Bash/curl)
Download links for CSV/Excel available	Direct data file	Strategy C (download)

信号	解读	行动
内容丰富，数据清晰可见	静态页面	策略A（WebFetch）
空容器、“加载中...”、文本极少	JS渲染页面	策略B（浏览器自动化）
登录墙、验证码、403/401响应	访问被拦截	向用户报告
内容存在但结构混乱	需要精准定位	策略B（浏览器自动化）
响应体为JSON或XML	API端点	策略C（Bash/curl）
存在CSV/Excel下载链接	直接数据文件	策略C（下载文件）

Step 2.3: Content Classification

步骤2.3：内容分类

Classify into an extraction mode:

Mode	Indicators	Examples
`table`	HTML `<table>` , grid layout with headers	Price comparison, statistics, specs
`list`	Repeated similar elements, card grids	Search results, product listings
`article`	Long-form text with headings/paragraphs	Blog post, news article, docs
`product`	Product name, price, specs, images, rating	E-commerce product page
`contact`	Names, emails, phones, addresses, roles	Team page, staff directory
`faq`	Question-answer pairs, accordions	FAQ page, help center
`pricing`	Plan names, prices, features, tiers	SaaS pricing page
`events`	Dates, locations, titles, descriptions	Event listings, conferences
`jobs`	Titles, companies, locations, salaries	Job boards, career pages
`custom`	User specified CSS selectors or fields	Anything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.

将页面分类为对应提取模式：

模式	标识特征	示例
`table`	HTML `<table>` 、带表头的网格布局	价格对比表、统计数据、参数表
`list`	重复的相似元素、卡片网格	搜索结果、产品列表
`article`	带标题/段落的长文本	博客文章、新闻报道、文档
`product`	产品名称、价格、参数、图片、评分	电商产品页
`contact`	姓名、邮箱、电话、地址、职位	团队页、员工目录
`faq`	问答对、折叠面板	FAQ页、帮助中心
`pricing`	套餐名称、价格、功能、档位	SaaS价格页
`events`	日期、地点、标题、描述	活动列表、会议信息
`jobs`	职位名称、公司、地点、薪资	职位板、招聘页
`custom`	用户提供的CSS选择器或字段描述	不符合上述模式的自定义需求

记录：页面类型、提取模式、是否需要JS渲染、可用字段、是否存在结构化数据（如JSON-LD）。

如果用户要求“所有内容”，先展示可用字段，再让用户选择。

Phase 3: Strategy Selection

阶段3：策略选择

Choose the extraction approach based on recon results.

根据站点侦察结果选择提取策略。

Decision Tree

决策树

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

结构化数据（JSON-LD、微数据）包含所需内容？
 |
 +-- 是 --> 策略E：直接提取结构化数据
 |
 +-- 否：WebFetch可完整获取内容？
      |
      +-- 是：是否需要精准元素定位？
      |    |
      |    +-- 否  --> 策略A：WebFetch + AI提取
      |    +-- 是 --> 策略B：浏览器自动化
      |
      +-- 否：检测到JavaScript渲染？
           |
           +-- 是 --> 策略B：浏览器自动化
           +-- 否：是否存在API/JSON/XML端点或下载链接？
                |
                +-- 是 --> 策略C：Bash（curl + jq）
                +-- 否  --> 向用户报告访问问题

Strategy A: Webfetch With Ai Extraction

策略A：WebFetch + AI提取

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

适用场景：静态页面、文章、简单表格、结构清晰的HTML页面。

使用针对提取模式定制的提示词，通过WebFetch进行提取：

WebFetch(
  url = URL,
  prompt = "从该页面提取[数据目标]。
    仅返回提取的数据，格式为[输出格式]，包含以下列/字段：[字段列表]。
    规则：
    - 如果值缺失或不明确，使用'N/A'
    - 不要包含导航栏、广告、页脚或无关内容
    - 完全保留原始值（数字、货币、日期）
    - 包含所有匹配项，而非仅前几项
    - 如果可用，为每个条目提取对应的URL/链接"
)

自动降级：如果WebFetch返回的条目数量远低于侦察阶段预期（不足50%），或大部分字段为空，则自动升级为策略B，无需询问用户。在备注中记录此次升级操作。

Strategy B: Browser Automation

策略B：浏览器自动化

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

Get tab context:
```
tabs_context_mcp(createIfEmpty=true)
```
-> get tabId
Navigate to URL:
```
navigate(url=TARGET_URL, tabId=TAB)
```

Wait for content to load:

computer(action="wait", duration=3, tabId=TAB)

Check for cookie/consent banners:
```
find(query="cookie consent or accept button", tabId=TAB)
```
- If found, dismiss it (prefer privacy-preserving option)

Read page structure:

read_page(tabId=TAB)

get_page_text(tabId=TAB)

Locate target elements:
```
find(query="[DESCRIPTION]", tabId=TAB)
```
Extract with JavaScript for precise data via
```
javascript_tool
```

javascript

// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);

javascript

// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);

For lazy-loaded content, scroll and re-extract:

computer(action="scroll", scroll_direction="down", tabId=TAB)

then

computer(action="wait", duration=2, tabId=TAB)

适用场景：JS渲染页面、SPA、交互式内容、懒加载数据。

步骤：

获取标签页上下文：
```
tabs_context_mcp(createIfEmpty=true)
```
-> 获取tabId

导航至目标URL：

navigate(url=目标URL, tabId=标签页ID)

等待内容加载：

computer(action="wait", duration=3, tabId=标签页ID)

检查Cookie/授权横幅：
```
find(query="cookie consent or accept button", tabId=标签页ID)
```
- 如果存在，关闭横幅（优先选择隐私保护选项）

读取页面结构：

read_page(tabId=标签页ID)

或

get_page_text(tabId=标签页ID)

定位目标元素：

find(query="[元素描述]", tabId=标签页ID)

通过
```
javascript_tool
```
使用JavaScript精准提取数据

javascript

// 表格提取
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);

javascript

// 列表/卡片提取
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);

针对懒加载内容，滚动页面后重新提取：

computer(action="scroll", scroll_direction="down", tabId=标签页ID)

然后执行

computer(action="wait", duration=2, tabId=标签页ID)

Strategy C: Bash (Curl + Jq)

策略C：Bash（curl + jq）

Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.

bash

undefined

适用场景：REST API、JSON端点、XML源、CSV/Excel下载。

bash

undefined

Json Api

JSON API

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

Csv Download

CSV下载

curl -s "CSV_URL" -o /tmp/scraped_data.csv

Xml Parsing

XML解析

curl -s "XML_URL" | python3 -c " import xml.etree.ElementTree as ET, json, sys tree = ET.parse(sys.stdin)

... Parse And Output Json

... 解析并输出JSON

undefined

undefined

Strategy D: Hybrid

策略D：混合策略

When a single strategy is insufficient, combine:

WebSearch to discover relevant URLs
WebFetch for initial content assessment
Browser automation for JS-heavy sections
Bash for post-processing (jq, python for data cleaning)

当单一策略不足以完成任务时，可组合使用：

WebSearch发现相关URL
WebFetch进行初始内容评估
浏览器自动化处理JS密集部分
Bash进行后处理（jq、Python数据清洗）

Strategy E: Structured Data Extraction

策略E：结构化数据提取

When JSON-LD, microdata, or OpenGraph is present:

Use Browser
```
javascript_tool
```
to extract structured data:

javascript

const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);

This often provides cleaner, more reliable data than DOM scraping
Fall back to DOM extraction only for fields not in structured data

当页面存在JSON-LD、微数据或OpenGraph标签时：

使用浏览器的
```
javascript_tool
```
提取结构化数据：

javascript

const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);

这种方式通常比DOM抓取提供更干净、更可靠的数据
仅当结构化数据中缺少所需字段时，才回退到DOM提取

Pagination Handling

分页处理

When pagination is detected and user wants multiple pages:

Page-number pagination (any strategy):

Extract data from current page
Identify URL pattern (e.g.
```
?page=N
```
,
```
/page/N
```
,
```
&offset=N
```
)
Iterate through pages up to user's max (default: 5 pages)
Show progress: "Extracting page 2/5..."
Concatenate all results, deduplicate if needed

Infinite scroll (Browser only):

Extract currently visible data
Record item count

Scroll down:

computer(action="scroll", scroll_direction="down", tabId=TAB)

Wait:

computer(action="wait", duration=2, tabId=TAB)

Extract newly loaded data
Compare count - if no new items after 2 scrolls, stop
Repeat until no new content or max iterations (default: 5)

"Load More" button (Browser only):

Extract currently visible data

Find button:

find(query="load more button", tabId=TAB)

Click it:

computer(action="left_click", ref=REF, tabId=TAB)

Wait and extract new content
Repeat until button disappears or max iterations reached

当检测到分页且用户需要多页数据时：

页码式分页（支持所有策略）：

提取当前页面数据
识别URL模式（例如
```
?page=N
```
、
```
/page/N
```
、
```
&offset=N
```
）
遍历页面，直至达到用户设置的最大页数（默认：5页）
展示进度：“正在提取第2/5页...”
合并所有结果，必要时去重

无限滚动（仅浏览器自动化）：

提取当前可见数据
记录条目数量

向下滚动：

computer(action="scroll", scroll_direction="down", tabId=标签页ID)

等待：

computer(action="wait", duration=2, tabId=标签页ID)

提取新加载的数据
对比条目数量 - 如果连续2次滚动后无新条目，停止操作
重复操作直至无新内容或达到最大迭代次数（默认：5次）

“加载更多”按钮（仅浏览器自动化）：

提取当前可见数据

查找按钮：

find(query="load more button", tabId=标签页ID)

点击按钮：

computer(action="left_click", ref=引用ID, tabId=标签页ID)

等待并提取新内容
重复操作直至按钮消失或达到最大迭代次数

Phase 4: Extract

阶段4：数据提取

Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.

根据选择的策略，使用对应模式的提取规则执行提取。参考references/extraction-patterns.md 获取CSS选择器和JavaScript代码片段。

Table Mode

表格模式

WebFetch prompt:

"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."

WebFetch提示词：

"提取该页面中的所有表格行。
以Markdown表格形式返回，保留原表头。
包含所有行，不要截断或汇总。
保留数值精度、货币和单位。"

List Mode

列表模式

WebFetch prompt:

"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."

WebFetch提示词：

"从该页面提取每个[条目类型]。
为每个条目提取以下字段：[字段列表]。
以JSON数组形式返回，键为[键列表]。
包含所有条目，而非仅前几项。如果可用，为每个条目提取对应的链接/URL。"

Article Mode

文章模式

WebFetch prompt:

"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."

WebFetch提示词：

"提取文章元数据：
- 标题、作者、日期、标签/分类、字数估算
- 关键事实数据、统计信息和命名实体
以结构化Markdown形式返回。总结内容，不要复制全文。"

Product Mode

产品模式

WebFetch prompt:

"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."

Also check for JSON-LD

Product

schema (Strategy E) first.

WebFetch提示词：

"提取产品数据，包含以下字段：
- 名称、品牌、价格、货币、原价（如果有折扣）、
  库存状态、描述（前200字符）、评分、评论数、
  参数（键值对形式）、产品URL、图片URL
以JSON形式返回。缺失字段使用null。"

同时优先使用策略E提取JSON-LD中的

Product

模式数据。

Contact Mode

联系人模式

WebFetch prompt:

"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."

WebFetch提示词：

"提取每个人/实体的联系信息：
- 姓名、头衔、职位、邮箱、电话、地址、机构、网站、LinkedIn URL
以Markdown表格形式返回。仅提取页面上可见的真实联系信息。"

Faq Mode

FAQ模式

WebFetch prompt:

"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."

WebFetch提示词：

"提取该页面中的所有问答对。
为每个FAQ条目提取：
- 问题：完整的问题文本
- 答案：答案文本（如果过长，取前300字符）
- 分类：所属章节/分类（如果有分组）
以JSON数组形式返回。"

Pricing Mode

价格模式

WebFetch prompt:

"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."

WebFetch提示词：

"提取该页面中的所有价格套餐/档位。
为每个套餐提取：
- 套餐名称、月付价格、年付价格、货币
- 包含的功能（数组形式）
- 限制条件（数组形式，包含限制或排除的功能）
- 号召性用语按钮文本
- 是否突出显示（如果标记为推荐/热门则为true）
以JSON形式返回。缺失字段使用null。"

Events Mode

活动模式

WebFetch prompt:

"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."

WebFetch提示词：

"提取该页面中的所有活动/场次信息。
为每个活动提取：
- 标题、日期、时间、结束时间、地点、描述（前200字符）
- 演讲者（姓名数组）、分类、注册URL
以JSON形式返回。缺失字段使用null。"

Jobs Mode

职位模式

WebFetch prompt:

"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."

WebFetch提示词：

"提取该页面中的所有职位列表。
为每个职位提取：
- 职位名称、公司、地点、薪资、薪资范围、类型（全职/兼职/合同）
- 发布日期、描述（前200字符）、申请URL、标签
以JSON形式返回。缺失字段使用null。"

Custom Mode

自定义模式

When user provides specific selectors or field descriptions:

Use Browser automation with
```
javascript_tool
```
and user's CSS selectors
Or use WebFetch with a prompt built from user's field descriptions
Always confirm extracted schema with user before proceeding to multi-URL

当用户提供特定选择器或字段描述时：

使用浏览器自动化，结合
```
javascript_tool
```
和用户提供的CSS选择器
或使用WebFetch，根据用户的字段描述构建提示词
在进行多URL提取前，务必与用户确认提取的 schema 是否正确

Multi-Url Extraction

多URL提取

When extracting from multiple URLs:

Extract from the first URL to establish the data schema
Show user the first results and confirm the schema is correct
Extract from remaining URLs using the same schema
Add a
```
source
```
column/field to every record with the origin URL
Combine all results into a single output
Show progress: "Extracting 3/7 URLs..."

当需要从多个URL提取数据时：

从第一个URL提取数据，确定数据schema
向用户展示第一批结果，确认schema正确
使用相同的schema提取剩余URL的数据
为每条记录添加
```
source
```
列/字段，记录来源URL
将所有结果合并为单一输出
展示进度：“正在提取第3/7个URL...”

Phase 5: Transform

阶段5：数据转换

Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.

在验证前，对提取的数据进行清洗、标准化和补充。参考references/data-transforms.md 获取转换规则。

Automatic Transforms (Always Apply)

自动转换（始终执行）

Transform	Action
Whitespace cleanup	Trim, collapse multiple spaces, remove `\n` in cells
HTML entity decode	`&` -> `&` , `<` -> `<` , `'` -> `'`
Unicode normalization	NFKC normalization for consistent characters
Empty string to null	`""` -> `null` (for JSON), `""` -> `N/A` (for tables)

转换操作	具体动作
空白字符清理	去除首尾空白、合并多个空格、移除单元格中的 `\n`
HTML实体解码	`&` -> `&` 、 `<` -> `<` 、 `'` -> `'`
Unicode标准化	对字符进行NFKC标准化，确保一致性
空字符串转null	`""` -> `null` （JSON格式）、 `""` -> `N/A` （表格格式）

Conditional Transforms (Apply When Relevant)

条件转换（按需执行）

Transform	When	Action
Price normalization	Product/pricing modes	Extract numeric value + currency symbol
Date normalization	Any dates found	Normalize to ISO-8601 (YYYY-MM-DD)
URL resolution	Relative URLs extracted	Convert to absolute URLs
Phone normalization	Contact mode	Standardize to E.164 format if possible
Deduplication	Multi-page or multi-URL	Remove exact duplicate rows
Sorting	User requested or natural	Sort by user-specified field

转换操作	适用场景	具体动作
价格标准化	产品/价格模式	提取数值和货币符号
日期标准化	存在日期数据时	转换为ISO-8601格式（YYYY-MM-DD）
URL解析	提取到相对URL时	转换为绝对URL
电话号码标准化	联系人模式	尽可能转换为E.164格式
去重	多页面或多URL提取时	移除完全重复的行
排序	用户要求或自然排序需求时	按用户指定字段排序

Data Enrichment (Only When Useful)

数据补充（仅在有用时执行）

Enrichment	When	Action
Currency conversion	User asks for single currency	Note original + convert (approximate)
Domain extraction	URLs in data	Add domain column from full URLs
Word count	Article mode	Count words in extracted text
Relative dates	Dates present	Add "X days ago" column if useful

补充操作	适用场景	具体动作
货币转换	用户要求统一货币时	记录原货币并转换为目标货币（近似值）
域名提取	数据中包含URL时	从完整URL中提取域名，添加为新列
字数统计	文章模式	统计提取文本的字数
相对日期计算	存在日期数据时	如有需要，添加“X天前”列

Deduplication Strategy

去重策略

When combining data from multiple pages or URLs:

Exact match: rows with identical values in all fields -> keep first
Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
Report: "Removed N duplicate rows" in delivery notes

当合并多页面或多URL的数据时：

完全匹配：所有字段值均相同的行 -> 保留第一行
近似匹配：关键字段（名称+来源）相同但细节不同的行 -> 保留最完整的行（空值最少），并在备注中标记
报告：在交付备注中说明“已移除N条重复行”

Phase 6: Validate

阶段6：验证

Verify extraction quality before delivering results.

在交付结果前，验证提取质量。

Validation Checks

验证检查项

Check	Action
Item count	Compare extracted count to expected count from recon
Empty fields	Count N/A or null values per field
Data type consistency	Numbers should be numeric, dates parseable
Duplicates	Flag exact duplicate rows (post-dedup)
Encoding	Check for HTML entities, garbled characters
Completeness	All user-requested fields present in output
Truncation	Verify data wasn't cut off (check last items)
Outliers	Flag values that seem anomalous (e.g. $0.00 price)

检查项	具体动作
条目数量	对比提取数量与侦察阶段的预期数量
空字段	统计每个字段的N/A或null值数量
数据类型一致性	数值应为数字类型，日期应可解析
重复项	标记去重后仍存在的完全重复行
编码问题	检查是否存在HTML实体、乱码字符
完整性	输出中包含用户要求的所有字段
截断问题	验证数据未被截断（检查最后几条条目）
异常值	标记异常值（例如0.00美元的价格）

Confidence Rating

置信度评级

Assign to every extraction:

Rating	Criteria
HIGH	All fields populated, count matches expected, no anomalies
MEDIUM	Minor gaps (<10% empty fields) or count slightly differs
LOW	Significant gaps (>10% empty), structural issues, partial data

Always report confidence with specifics:

Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.

为每次提取结果分配置信度：

评级	标准
高	所有字段均已填充，数量与预期一致，无异常值
中	存在少量缺失（空字段占比<10%）或数量略有差异
低	大量缺失（空字段占比>10%）、结构问题、数据不完整

始终附带具体细节报告置信度：

置信度：高 - 已提取47条数据，6个字段均已填充，与页面分析的预期数量一致。

Auto-Recovery (Try Before Reporting Issues)

自动恢复（报告问题前尝试）

Issue	Auto-Recovery Action
Missing data	Re-attempt with Browser if WebFetch was used
Encoding problems	Apply HTML entity decode + unicode normalization
Incomplete results	Check for pagination or lazy-loading, fetch more
Count mismatch	Scroll/paginate to find remaining items
All fields empty	Page likely JS-rendered, switch to Browser strategy
Partial fields	Try JSON-LD extraction as supplement

Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.

问题	自动恢复动作
数据缺失	如果使用了WebFetch，重新尝试使用浏览器自动化策略
编码问题	执行HTML实体解码和Unicode标准化
结果不完整	检查是否存在分页或懒加载，获取更多数据
数量不匹配	滚动/分页查找剩余条目
所有字段为空	页面可能是JS渲染，切换为浏览器自动化策略
部分字段缺失	尝试提取JSON-LD数据进行补充

在交付备注中记录所有恢复尝试。向用户报告无法恢复的缺失，并提供具体细节。

Phase 7: Format And Deliver

阶段7：格式化与交付

Structure results according to user preference. See references/output-templates.md for complete formatting templates.

根据用户偏好组织结果。参考references/output-templates.md 获取完整的格式化模板。

Delivery Envelope

交付包装

ALWAYS wrap results with this metadata header:

markdown

undefined

始终使用以下元数据头包裹结果：

markdown

undefined

Extraction Results

提取结果

[DATA HERE]

Notes:

[Any gaps, issues, or observations]
[Transforms applied: deduplication, normalization, etc.]
[Pages scraped if paginated: "Pages 1-5 of 12"]
[Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]

undefined

[数据内容]

备注：

[任何缺失、问题或观察结果]
[执行的转换操作：去重、标准化等]
[如果是分页抓取：“已抓取第1-5页，共12页”]
[如果发生自动降级：“已从WebFetch自动切换为浏览器策略”]

undefined

Markdown Table Rules

Markdown表格规则

Left-align text columns (
```
:---
```
), right-align numbers (
```
---:
```
)
Consistent column widths for readability
Include summary row for numeric data when useful (totals, averages)
Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
Truncate long cell values to 60 chars with
```
...
```
indicator
Use
```
N/A
```
for missing values, never leave cells empty
For multi-page results, show combined table (not per-page)

文本列左对齐（
```
:---
```
），数值列右对齐（
```
---:
```
）
保持列宽一致，提升可读性
如有需要，为数值数据添加汇总行（总计、平均值）
每张表格最多10列；数据列过多时，拆分为多个表格或建议使用JSON格式
长单元格值截断为60字符，末尾添加
```
...
```
标识
缺失值使用
```
N/A
```
，不保留空单元格
多页结果合并为单个表格展示（而非按页展示）

Json Rules

JSON规则

Use camelCase for keys (e.g.
```
productName
```
,
```
unitPrice
```
)

Wrap in metadata envelope:

json

{
  "metadata": {
    "source": "URL",
    "title": "Page Title",
    "extractedAt": "ISO-8601",
    "itemCount": 47,
    "fieldCount": 6,
    "confidence": "HIGH",
    "strategy": "A",
    "transforms": ["deduplication", "priceNormalization"],
    "notes": []
  },
  "data": [ ... ]
}

Pretty-print with 2-space indentation
Numbers as numbers (not strings), booleans as booleans
null for missing values (not empty strings)

键使用小驼峰命名（例如
```
productName
```
、
```
unitPrice
```
）

使用元数据包裹：

json

{
  "metadata": {
    "source": "URL",
    "title": "页面标题",
    "extractedAt": "ISO-8601格式时间",
    "itemCount": 47,
    "fieldCount": 6,
    "confidence": "高",
    "strategy": "A",
    "transforms": ["去重", "价格标准化"],
    "notes": []
  },
  "data": [ ... ]
}

使用2空格缩进格式化输出
数值类型保留为数字，布尔值保留为布尔类型
缺失值使用null（而非空字符串）

Csv Rules

CSV规则

First row is always headers
Quote any field containing commas, quotes, or newlines
UTF-8 encoding with BOM for Excel compatibility
Use
```
,
```
as delimiter (standard)
Include metadata as comments:
```
# Source: URL
```

第一行始终是表头
包含逗号、引号或换行符的字段需用引号包裹
使用带BOM的UTF-8编码，确保与Excel兼容
使用
```
,
```
作为分隔符（标准格式）
元数据以注释形式添加：
```
# 来源：URL
```

File Output

文件输出

When user requests file save:

Markdown:
```
.md
```
extension
JSON:
```
.json
```
extension
CSV:
```
.csv
```
extension
Confirm path before writing
Report full file path and item count after saving

当用户要求保存为文件时：

Markdown：
```
.md
```
扩展名
JSON：
```
.json
```
扩展名
CSV：
```
.csv
```
扩展名
写入前确认路径
保存后报告完整文件路径和条目数

Multi-Url Comparison Format

多URL对比格式

When comparing data across multiple sources:

Add
```
Source
```
as the first column/field
Use short identifiers for sources (domain name or user label)
Group by source or interleave based on user preference
Highlight differences if user asks for comparison
Include summary: "Best price: $X at store-b.com"

当需要跨源对比数据时：

添加
```
来源
```
作为第一列/字段
使用短标识表示来源（域名或用户自定义标签）
根据用户偏好，按来源分组或交叉展示
如果用户要求对比，高亮显示差异
添加汇总信息：“最低价格：X美元，来自store-b.com”

Differential Output

差异输出

When user requests change detection (diff mode):

Compare current extraction with previous run
Mark new items with
```
[NEW]
```
Mark removed items with
```
[REMOVED]
```
Mark changed values with
```
[WAS: old_value]
```
Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

当用户要求检测变化（差异模式）时：

对比当前提取结果与上一次运行结果
新条目标记为
```
[新增]
```
移除的条目标记为
```
[已移除]
```
变更的值标记为
```
[原：旧值]
```
添加汇总信息：“与上次运行相比：新增5条，移除2条，修改3条”

Rate Limiting

速率限制

Maximum 1 request per 2 seconds for sequential page fetches
For multi-URL jobs, process sequentially with pauses
If a site returns 429 (Too Many Requests), stop and report to user

连续页面抓取时，每2秒最多1次请求
多URL任务时，按顺序处理并添加停顿
如果站点返回429（请求过多），停止操作并向用户报告

Access Respect

访问规范

If a page blocks access (403, CAPTCHA, login wall), report to user
Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
Do NOT scrape behind authentication unless user explicitly provides access
Respect robots.txt directives when known

如果页面拦截访问（403、验证码、登录墙），向用户报告
不要尝试绕过机器人检测、验证码或访问控制
除非用户明确提供访问权限，否则不要抓取需要认证的内容
已知情况下，遵守robots.txt指令

Copyright

版权说明

Do NOT reproduce large blocks of copyrighted article text
For articles: extract factual data, statistics, and structured info; summarize narrative content
Always include source attribution (http://example.com) in output

不要大段复制受版权保护的文章文本
对于文章：提取事实数据、统计信息和结构化内容；总结叙述性内容
输出中始终包含来源归因（http://example.com）

Data Scope

数据范围

Extract ONLY what the user explicitly requested
Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
Do not store or transmit extracted data beyond what the user sees

仅提取用户明确要求的内容
大规模收集潜在敏感数据（邮箱、电话、个人信息）前，向用户发出警告
除用户可见内容外，不要存储或传输提取的数据

Failure Protocol

失败处理流程

When extraction fails or is blocked:

Explain the specific reason (JS rendering, bot detection, login, etc.)
Suggest alternatives (different URL, API if available, manual approach)
Never retry aggressively or escalate access attempts

当提取失败或被拦截时：

解释具体原因（JS渲染、机器人检测、登录要求等）
建议替代方案（不同URL、可用API、手动方式）
不要频繁重试或尝试提升访问权限

Quick Reference: Mode Cheat Sheet

快速参考：模式速查表

User Says...	Mode	Strategy	Output Default
"extract the table"	table	A or B	Markdown table
"get all products/prices"	product	E then A	Markdown table
"scrape the listings"	list	A or B	Markdown table
"extract contact info / team page"	contact	A	Markdown table
"get the article data"	article	A	Markdown text
"extract the FAQ"	faq	A or B	JSON
"get pricing plans"	pricing	A or B	Markdown table
"scrape job listings"	jobs	A or B	Markdown table
"get event schedule"	events	A or B	Markdown table
"find and extract [topic]"	discovery	WebSearch	Markdown table
"compare prices across sites"	multi-URL	A or B	Comparison table
"what changed since last time"	diff	any	Diff format

用户需求...	模式	策略	默认输出格式
"提取表格"	table	A或B	Markdown表格
"获取所有产品/价格"	product	E后A	Markdown表格
"抓取列表"	list	A或B	Markdown表格
"提取联系信息 / 团队页"	contact	A	Markdown表格
"提取文章数据"	article	A	Markdown文本
"提取FAQ"	faq	A或B	JSON
"获取价格套餐"	pricing	A或B	Markdown表格
"抓取职位列表"	jobs	A或B	Markdown表格
"获取活动日程"	events	A或B	Markdown表格
"查找并提取[主题]"	discovery	WebSearch	Markdown表格
"跨站对比价格"	multi-URL	A或B	对比表格
"上次抓取后有哪些变化"	diff	任意	差异格式

References

参考文档

Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.

提取规则：references/extraction-patterns.md CSS选择器、JavaScript代码片段、JSON-LD解析、域名相关技巧。
输出模板：references/output-templates.md Markdown、JSON、CSV格式的完整示例模板。
数据转换：references/data-transforms.md 数据清洗、标准化、去重、补充规则。

Best Practices

最佳实践

Provide clear, specific context about your project and requirements
Review all suggestions before applying them to production code
Combine with other complementary skills for comprehensive analysis

提供清晰、具体的项目背景和需求
在将建议应用于生产代码前，仔细审阅
结合其他互补工具，进行全面分析

Common Pitfalls

常见误区

Using this skill for tasks outside its domain expertise
Applying recommendations without understanding your specific context
Not providing enough project context for accurate analysis

将该工具用于其领域外的任务
在不了解具体背景的情况下应用建议
未提供足够的项目背景，导致分析不准确