web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraper

网页抓取工具

Overview

概述

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
多策略智能网页抓取工具。可从网页中提取结构化数据(表格、列表、价格),支持分页、监控以及导出为CSV/JSON格式。

When to Use This Skill

何时使用该工具

  • When the user mentions "scraper" or related topics
  • When the user mentions "scraping" or related topics
  • When the user mentions "extrair dados web" or related topics
  • When the user mentions "web scraping" or related topics
  • When the user mentions "raspar dados" or related topics
  • When the user mentions "coletar dados site" or related topics
  • 当用户提及“scraper”或相关话题时
  • 当用户提及“scraping”或相关话题时
  • 当用户提及“extrair dados web”(网页数据提取)或相关话题时
  • 当用户提及“web scraping”或相关话题时
  • 当用户提及“raspar dados”(数据抓取)或相关话题时
  • 当用户提及“coletar dados site”(网站数据收集)或相关话题时

Do Not Use This Skill When

何时不使用该工具

  • The task is unrelated to web scraper
  • A simpler, more specific tool can handle the request
  • The user needs general-purpose assistance without domain expertise
  • 任务与网页抓取无关时
  • 更简单、更针对性的工具可处理请求时
  • 用户需要无领域限制的通用协助时

How It Works

工作流程

Execute phases in strict order. Each phase feeds the next.
1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT
Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.

严格按顺序执行各个阶段,每个阶段的输出作为下一阶段的输入。
1. 明确需求  ->  2. 站点侦察  ->  3. 策略选择  ->  4. 数据提取  ->  5. 数据转换  ->  6. 验证  ->  7. 格式化输出
绝不能跳过阶段1或阶段2,它们可避免无效工作和提取失败。
快速路径:如果用户提供了URL + 明确的数据目标 + 请求简单(单页面、单一数据类型),可将阶段1-3合并为一个操作:通过一次WebFetch调用完成抓取、分类和提取,但仍需执行验证和格式化步骤。

Capabilities

功能特性

  • Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
  • Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
  • Output formats: Markdown tables (default), JSON, CSV
  • Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
  • Multi-URL: extract same structure across sources with comparison and diff
  • Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
  • Auto-escalation: WebFetch fails silently -> automatic Browser fallback
  • Data transforms: cleaning, normalization, deduplication, enrichment
  • Differential mode: detect changes between scraping runs
  • 多策略支持:WebFetch(静态页面)、浏览器自动化(JS渲染页面)、Bash/curl(API接口)、WebSearch(数据发现)
  • 提取模式:表格、列表、文章、产品、联系人、FAQ、价格、活动、职位、自定义
  • 输出格式:Markdown表格(默认)、JSON、CSV
  • 分页处理:自动检测并跟进分页(页码、无限滚动、加载更多按钮)
  • 多URL提取:跨源提取相同结构的数据,并支持对比和差异分析
  • 验证机制:为每次提取结果提供置信度评级(高/中/低)
  • 自动降级:WebFetch静默失败后,自动切换为浏览器抓取策略
  • 数据转换:数据清洗、标准化、去重、补充
  • 差异模式:检测不同抓取运行之间的数据变化

Web Scraper

网页抓取工具

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.
具备智能策略选择、自动降级、数据转换和结构化输出功能的多策略网页数据提取工具。

Phase 1: Clarify

阶段1:明确需求

Establish extraction parameters before touching any URL.
在访问任何URL前,先确定提取参数。

Required Parameters

必填参数

ParameterResolveDefault
Target URL(s)Which page(s) to scrape?(required)
Data TargetWhat specific data to extract?(required)
Output FormatMarkdown table, JSON, CSV, or text?Markdown table
ScopeSingle page, paginated, or multi-URL?Single page
参数需确认内容默认值
目标URL需要抓取哪些页面?(必填)
数据目标需要提取哪些具体数据?(必填)
输出格式Markdown表格、JSON、CSV或文本?Markdown表格
范围单页面、分页或多URL?单页面

Optional Parameters

可选参数

ParameterResolveDefault
PaginationFollow pagination? Max pages?No, 1 page
Max ItemsMaximum number of items to collect?Unlimited
FiltersData to exclude or include?None
Sort OrderHow to sort results?Source order
Save PathSave to file? Which path?Display only
LanguageRespond in which language?User's lang
Diff ModeCompare with previous run?No
参数需确认内容默认值
分页设置是否跟进分页?最大页数?否,仅1页
最大条目数最多收集多少条数据?无限制
过滤规则需要排除或包含哪些数据?
排序方式如何对结果排序?按源页面顺序
保存路径是否保存到文件?路径是哪里?仅显示
响应语言用哪种语言回复?用户使用的语言
差异模式是否与上次抓取结果对比?

Clarification Rules

明确需求规则

  • If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
  • If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
  • Default to Markdown table output. Mention alternatives only if relevant.
  • Accept requests in any language. Always respond in the user's language.
  • If user says "everything" or "all data", perform recon first, then present what's available and let user choose.
  • 如果用户提供了URL和明确的数据目标,直接进入阶段2,不要询问不必要的问题。
  • 如果请求模糊(例如“抓取这个网站”),仅需询问:“你想从该页面提取哪些具体数据?”
  • 默认输出为Markdown表格,仅在相关时提及其他可选格式。
  • 接受任何语言的请求,始终用用户使用的语言回复。
  • 如果用户说“所有内容”或“全部数据”,先执行站点侦察,然后展示可提取的内容,让用户选择。

Discovery Mode

发现模式

When user has a topic but no specific URL:
  1. Use WebSearch to find the most relevant pages
  2. Present top 3-5 URLs with descriptions
  3. Let user choose which to scrape, or scrape all
  4. Proceed to Phase 2 with selected URL(s)
Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract

当用户有主题但无具体URL时:
  1. 使用WebSearch查找最相关的页面
  2. 展示前3-5个带描述的URL
  3. 让用户选择要抓取的页面,或全部抓取
  4. 使用选定的URL进入阶段2
示例:“查找并提取CRM工具的价格数据” -> WebSearch("CRM tools pricing comparison 2026") -> 展示顶部搜索结果 -> 用户选择 -> 提取数据

Phase 2: Reconnaissance

阶段2:站点侦察

Analyze the target page before extraction.
在提取数据前先分析目标页面。

Step 2.1: Initial Fetch

步骤2.1:初始抓取

Use WebFetch to retrieve and analyze the page structure:
WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)
使用WebFetch获取并分析页面结构:
WebFetch(
  url = 目标URL,
  prompt = "分析此页面结构并报告:
    1. 页面类型:文章、产品列表、搜索结果、数据表、
       目录、仪表盘、API文档、FAQ、价格页、职位板、活动页或其他
    2. 主要内容结构:表格、有序/无序列表、卡片网格、自由文本、
       折叠面板、标签页
    3. 可见的不同数据项的大致数量
    4. JavaScript渲染标识:空容器、加载动画、
       SPA框架标记(React根节点、Vue应用、Angular)、HTML简洁但JS密集
    5. 分页方式:下一页/上一页链接、页码、加载更多按钮、
       无限滚动标识、总结果数
    6. 数据密度:存在多少结构化、可提取的数据
    7. 列出可提取的主要数据字段/列
    8. 嵌入式结构化数据:JSON-LD、微数据、OpenGraph标签
    9. 可用的下载链接:CSV、Excel、PDF、API端点"
)

Step 2.2: Evaluate Fetch Quality

步骤2.2:评估抓取质量

SignalInterpretationAction
Rich content with data clearly visibleStatic pageStrategy A (WebFetch)
Empty containers, "loading...", minimal textJS-renderedStrategy B (Browser)
Login wall, CAPTCHA, 403/401 responseBlockedReport to user
Content present but poorly structuredNeeds precisionStrategy B (Browser)
JSON or XML response bodyAPI endpointStrategy C (Bash/curl)
Download links for CSV/Excel availableDirect data fileStrategy C (download)
信号解读行动
内容丰富,数据清晰可见静态页面策略A(WebFetch)
空容器、“加载中...”、文本极少JS渲染页面策略B(浏览器自动化)
登录墙、验证码、403/401响应访问被拦截向用户报告
内容存在但结构混乱需要精准定位策略B(浏览器自动化)
响应体为JSON或XMLAPI端点策略C(Bash/curl)
存在CSV/Excel下载链接直接数据文件策略C(下载文件)

Step 2.3: Content Classification

步骤2.3:内容分类

Classify into an extraction mode:
ModeIndicatorsExamples
table
HTML
<table>
, grid layout with headers
Price comparison, statistics, specs
list
Repeated similar elements, card gridsSearch results, product listings
article
Long-form text with headings/paragraphsBlog post, news article, docs
product
Product name, price, specs, images, ratingE-commerce product page
contact
Names, emails, phones, addresses, rolesTeam page, staff directory
faq
Question-answer pairs, accordionsFAQ page, help center
pricing
Plan names, prices, features, tiersSaaS pricing page
events
Dates, locations, titles, descriptionsEvent listings, conferences
jobs
Titles, companies, locations, salariesJob boards, career pages
custom
User specified CSS selectors or fieldsAnything not matching above
Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).
If user asked for "everything", present the available fields and let them choose.

将页面分类为对应提取模式:
模式标识特征示例
table
HTML
<table>
、带表头的网格布局
价格对比表、统计数据、参数表
list
重复的相似元素、卡片网格搜索结果、产品列表
article
带标题/段落的长文本博客文章、新闻报道、文档
product
产品名称、价格、参数、图片、评分电商产品页
contact
姓名、邮箱、电话、地址、职位团队页、员工目录
faq
问答对、折叠面板FAQ页、帮助中心
pricing
套餐名称、价格、功能、档位SaaS价格页
events
日期、地点、标题、描述活动列表、会议信息
jobs
职位名称、公司、地点、薪资职位板、招聘页
custom
用户提供的CSS选择器或字段描述不符合上述模式的自定义需求
记录:页面类型提取模式是否需要JS渲染可用字段是否存在结构化数据(如JSON-LD)
如果用户要求“所有内容”,先展示可用字段,再让用户选择。

Phase 3: Strategy Selection

阶段3:策略选择

Choose the extraction approach based on recon results.
根据站点侦察结果选择提取策略。

Decision Tree

决策树

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user
结构化数据(JSON-LD、微数据)包含所需内容?
 |
 +-- 是 --> 策略E:直接提取结构化数据
 |
 +-- 否:WebFetch可完整获取内容?
      |
      +-- 是:是否需要精准元素定位?
      |    |
      |    +-- 否  --> 策略A:WebFetch + AI提取
      |    +-- 是 --> 策略B:浏览器自动化
      |
      +-- 否:检测到JavaScript渲染?
           |
           +-- 是 --> 策略B:浏览器自动化
           +-- 否:是否存在API/JSON/XML端点或下载链接?
                |
                +-- 是 --> 策略C:Bash(curl + jq)
                +-- 否  --> 向用户报告访问问题

Strategy A: Webfetch With Ai Extraction

策略A:WebFetch + AI提取

Best for: Static pages, articles, simple tables, well-structured HTML.
Use WebFetch with a targeted extraction prompt tailored to the mode:
WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)
Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.
适用场景:静态页面、文章、简单表格、结构清晰的HTML页面。
使用针对提取模式定制的提示词,通过WebFetch进行提取:
WebFetch(
  url = URL,
  prompt = "从该页面提取[数据目标]。
    仅返回提取的数据,格式为[输出格式],包含以下列/字段:[字段列表]。
    规则:
    - 如果值缺失或不明确,使用'N/A'
    - 不要包含导航栏、广告、页脚或无关内容
    - 完全保留原始值(数字、货币、日期)
    - 包含所有匹配项,而非仅前几项
    - 如果可用,为每个条目提取对应的URL/链接"
)
自动降级:如果WebFetch返回的条目数量远低于侦察阶段预期(不足50%),或大部分字段为空,则自动升级为策略B,无需询问用户。在备注中记录此次升级操作。

Strategy B: Browser Automation

策略B:浏览器自动化

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
Sequence:
  1. Get tab context:
    tabs_context_mcp(createIfEmpty=true)
    -> get tabId
  2. Navigate to URL:
    navigate(url=TARGET_URL, tabId=TAB)
  3. Wait for content to load:
    computer(action="wait", duration=3, tabId=TAB)
  4. Check for cookie/consent banners:
    find(query="cookie consent or accept button", tabId=TAB)
    • If found, dismiss it (prefer privacy-preserving option)
  5. Read page structure:
    read_page(tabId=TAB)
    or
    get_page_text(tabId=TAB)
  6. Locate target elements:
    find(query="[DESCRIPTION]", tabId=TAB)
  7. Extract with JavaScript for precise data via
    javascript_tool
javascript
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
javascript
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
  1. For lazy-loaded content, scroll and re-extract:
    computer(action="scroll", scroll_direction="down", tabId=TAB)
    then
    computer(action="wait", duration=2, tabId=TAB)
适用场景:JS渲染页面、SPA、交互式内容、懒加载数据。
步骤:
  1. 获取标签页上下文:
    tabs_context_mcp(createIfEmpty=true)
    -> 获取tabId
  2. 导航至目标URL:
    navigate(url=目标URL, tabId=标签页ID)
  3. 等待内容加载:
    computer(action="wait", duration=3, tabId=标签页ID)
  4. 检查Cookie/授权横幅:
    find(query="cookie consent or accept button", tabId=标签页ID)
    • 如果存在,关闭横幅(优先选择隐私保护选项)
  5. 读取页面结构:
    read_page(tabId=标签页ID)
    get_page_text(tabId=标签页ID)
  6. 定位目标元素:
    find(query="[元素描述]", tabId=标签页ID)
  7. 通过
    javascript_tool
    使用JavaScript精准提取数据
javascript
// 表格提取
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
javascript
// 列表/卡片提取
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
  1. 针对懒加载内容,滚动页面后重新提取:
    computer(action="scroll", scroll_direction="down", tabId=标签页ID)
    然后执行
    computer(action="wait", duration=2, tabId=标签页ID)

Strategy C: Bash (Curl + Jq)

策略C:Bash(curl + jq)

Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.
bash
undefined
适用场景:REST API、JSON端点、XML源、CSV/Excel下载。
bash
undefined

Json Api

JSON API

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

Csv Download

CSV下载

curl -s "CSV_URL" -o /tmp/scraped_data.csv
curl -s "CSV_URL" -o /tmp/scraped_data.csv

Xml Parsing

XML解析

curl -s "XML_URL" | python3 -c " import xml.etree.ElementTree as ET, json, sys tree = ET.parse(sys.stdin)
curl -s "XML_URL" | python3 -c " import xml.etree.ElementTree as ET, json, sys tree = ET.parse(sys.stdin)

... Parse And Output Json

... 解析并输出JSON

"
undefined
"
undefined

Strategy D: Hybrid

策略D:混合策略

When a single strategy is insufficient, combine:
  1. WebSearch to discover relevant URLs
  2. WebFetch for initial content assessment
  3. Browser automation for JS-heavy sections
  4. Bash for post-processing (jq, python for data cleaning)
当单一策略不足以完成任务时,可组合使用:
  1. WebSearch发现相关URL
  2. WebFetch进行初始内容评估
  3. 浏览器自动化处理JS密集部分
  4. Bash进行后处理(jq、Python数据清洗)

Strategy E: Structured Data Extraction

策略E:结构化数据提取

When JSON-LD, microdata, or OpenGraph is present:
  1. Use Browser
    javascript_tool
    to extract structured data:
javascript
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
  1. This often provides cleaner, more reliable data than DOM scraping
  2. Fall back to DOM extraction only for fields not in structured data
当页面存在JSON-LD、微数据或OpenGraph标签时:
  1. 使用浏览器的
    javascript_tool
    提取结构化数据:
javascript
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
  1. 这种方式通常比DOM抓取提供更干净、更可靠的数据
  2. 仅当结构化数据中缺少所需字段时,才回退到DOM提取

Pagination Handling

分页处理

When pagination is detected and user wants multiple pages:
Page-number pagination (any strategy):
  1. Extract data from current page
  2. Identify URL pattern (e.g.
    ?page=N
    ,
    /page/N
    ,
    &offset=N
    )
  3. Iterate through pages up to user's max (default: 5 pages)
  4. Show progress: "Extracting page 2/5..."
  5. Concatenate all results, deduplicate if needed
Infinite scroll (Browser only):
  1. Extract currently visible data
  2. Record item count
  3. Scroll down:
    computer(action="scroll", scroll_direction="down", tabId=TAB)
  4. Wait:
    computer(action="wait", duration=2, tabId=TAB)
  5. Extract newly loaded data
  6. Compare count - if no new items after 2 scrolls, stop
  7. Repeat until no new content or max iterations (default: 5)
"Load More" button (Browser only):
  1. Extract currently visible data
  2. Find button:
    find(query="load more button", tabId=TAB)
  3. Click it:
    computer(action="left_click", ref=REF, tabId=TAB)
  4. Wait and extract new content
  5. Repeat until button disappears or max iterations reached

当检测到分页且用户需要多页数据时:
页码式分页(支持所有策略)
  1. 提取当前页面数据
  2. 识别URL模式(例如
    ?page=N
    /page/N
    &offset=N
  3. 遍历页面,直至达到用户设置的最大页数(默认:5页)
  4. 展示进度:“正在提取第2/5页...”
  5. 合并所有结果,必要时去重
无限滚动(仅浏览器自动化)
  1. 提取当前可见数据
  2. 记录条目数量
  3. 向下滚动:
    computer(action="scroll", scroll_direction="down", tabId=标签页ID)
  4. 等待:
    computer(action="wait", duration=2, tabId=标签页ID)
  5. 提取新加载的数据
  6. 对比条目数量 - 如果连续2次滚动后无新条目,停止操作
  7. 重复操作直至无新内容或达到最大迭代次数(默认:5次)
“加载更多”按钮(仅浏览器自动化)
  1. 提取当前可见数据
  2. 查找按钮:
    find(query="load more button", tabId=标签页ID)
  3. 点击按钮:
    computer(action="left_click", ref=引用ID, tabId=标签页ID)
  4. 等待并提取新内容
  5. 重复操作直至按钮消失或达到最大迭代次数

Phase 4: Extract

阶段4:数据提取

Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.
根据选择的策略,使用对应模式的提取规则执行提取。 参考references/extraction-patterns.md 获取CSS选择器和JavaScript代码片段。

Table Mode

表格模式

WebFetch prompt:
"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."
WebFetch提示词:
"提取该页面中的所有表格行。
以Markdown表格形式返回,保留原表头。
包含所有行,不要截断或汇总。
保留数值精度、货币和单位。"

List Mode

列表模式

WebFetch prompt:
"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."
WebFetch提示词:
"从该页面提取每个[条目类型]。
为每个条目提取以下字段:[字段列表]。
以JSON数组形式返回,键为[键列表]。
包含所有条目,而非仅前几项。如果可用,为每个条目提取对应的链接/URL。"

Article Mode

文章模式

WebFetch prompt:
"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."
WebFetch提示词:
"提取文章元数据:
- 标题、作者、日期、标签/分类、字数估算
- 关键事实数据、统计信息和命名实体
以结构化Markdown形式返回。总结内容,不要复制全文。"

Product Mode

产品模式

WebFetch prompt:
"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."
Also check for JSON-LD
Product
schema (Strategy E) first.
WebFetch提示词:
"提取产品数据,包含以下字段:
- 名称、品牌、价格、货币、原价(如果有折扣)、
  库存状态、描述(前200字符)、评分、评论数、
  参数(键值对形式)、产品URL、图片URL
以JSON形式返回。缺失字段使用null。"
同时优先使用策略E提取JSON-LD中的
Product
模式数据。

Contact Mode

联系人模式

WebFetch prompt:
"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."
WebFetch提示词:
"提取每个人/实体的联系信息:
- 姓名、头衔、职位、邮箱、电话、地址、机构、网站、LinkedIn URL
以Markdown表格形式返回。仅提取页面上可见的真实联系信息。"

Faq Mode

FAQ模式

WebFetch prompt:
"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."
WebFetch提示词:
"提取该页面中的所有问答对。
为每个FAQ条目提取:
- 问题:完整的问题文本
- 答案:答案文本(如果过长,取前300字符)
- 分类:所属章节/分类(如果有分组)
以JSON数组形式返回。"

Pricing Mode

价格模式

WebFetch prompt:
"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."
WebFetch提示词:
"提取该页面中的所有价格套餐/档位。
为每个套餐提取:
- 套餐名称、月付价格、年付价格、货币
- 包含的功能(数组形式)
- 限制条件(数组形式,包含限制或排除的功能)
- 号召性用语按钮文本
- 是否突出显示(如果标记为推荐/热门则为true)
以JSON形式返回。缺失字段使用null。"

Events Mode

活动模式

WebFetch prompt:
"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."
WebFetch提示词:
"提取该页面中的所有活动/场次信息。
为每个活动提取:
- 标题、日期、时间、结束时间、地点、描述(前200字符)
- 演讲者(姓名数组)、分类、注册URL
以JSON形式返回。缺失字段使用null。"

Jobs Mode

职位模式

WebFetch prompt:
"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."
WebFetch提示词:
"提取该页面中的所有职位列表。
为每个职位提取:
- 职位名称、公司、地点、薪资、薪资范围、类型(全职/兼职/合同)
- 发布日期、描述(前200字符)、申请URL、标签
以JSON形式返回。缺失字段使用null。"

Custom Mode

自定义模式

When user provides specific selectors or field descriptions:
  • Use Browser automation with
    javascript_tool
    and user's CSS selectors
  • Or use WebFetch with a prompt built from user's field descriptions
  • Always confirm extracted schema with user before proceeding to multi-URL
当用户提供特定选择器或字段描述时:
  • 使用浏览器自动化,结合
    javascript_tool
    和用户提供的CSS选择器
  • 或使用WebFetch,根据用户的字段描述构建提示词
  • 在进行多URL提取前,务必与用户确认提取的 schema 是否正确

Multi-Url Extraction

多URL提取

When extracting from multiple URLs:
  1. Extract from the first URL to establish the data schema
  2. Show user the first results and confirm the schema is correct
  3. Extract from remaining URLs using the same schema
  4. Add a
    source
    column/field to every record with the origin URL
  5. Combine all results into a single output
  6. Show progress: "Extracting 3/7 URLs..."

当需要从多个URL提取数据时:
  1. 第一个URL提取数据,确定数据schema
  2. 向用户展示第一批结果,确认schema正确
  3. 使用相同的schema提取剩余URL的数据
  4. 为每条记录添加
    source
    列/字段,记录来源URL
  5. 将所有结果合并为单一输出
  6. 展示进度:“正在提取第3/7个URL...”

Phase 5: Transform

阶段5:数据转换

Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.
在验证前,对提取的数据进行清洗、标准化和补充。 参考references/data-transforms.md 获取转换规则。

Automatic Transforms (Always Apply)

自动转换(始终执行)

TransformAction
Whitespace cleanupTrim, collapse multiple spaces, remove
\n
in cells
HTML entity decode
&amp;
->
&
,
&lt;
->
<
,
&#39;
->
'
Unicode normalizationNFKC normalization for consistent characters
Empty string to null
""
->
null
(for JSON),
""
->
N/A
(for tables)
转换操作具体动作
空白字符清理去除首尾空白、合并多个空格、移除单元格中的
\n
HTML实体解码
&amp;
->
&
&lt;
->
<
&#39;
->
'
Unicode标准化对字符进行NFKC标准化,确保一致性
空字符串转null
""
->
null
(JSON格式)、
""
->
N/A
(表格格式)

Conditional Transforms (Apply When Relevant)

条件转换(按需执行)

TransformWhenAction
Price normalizationProduct/pricing modesExtract numeric value + currency symbol
Date normalizationAny dates foundNormalize to ISO-8601 (YYYY-MM-DD)
URL resolutionRelative URLs extractedConvert to absolute URLs
Phone normalizationContact modeStandardize to E.164 format if possible
DeduplicationMulti-page or multi-URLRemove exact duplicate rows
SortingUser requested or naturalSort by user-specified field
转换操作适用场景具体动作
价格标准化产品/价格模式提取数值和货币符号
日期标准化存在日期数据时转换为ISO-8601格式(YYYY-MM-DD)
URL解析提取到相对URL时转换为绝对URL
电话号码标准化联系人模式尽可能转换为E.164格式
去重多页面或多URL提取时移除完全重复的行
排序用户要求或自然排序需求时按用户指定字段排序

Data Enrichment (Only When Useful)

数据补充(仅在有用时执行)

EnrichmentWhenAction
Currency conversionUser asks for single currencyNote original + convert (approximate)
Domain extractionURLs in dataAdd domain column from full URLs
Word countArticle modeCount words in extracted text
Relative datesDates presentAdd "X days ago" column if useful
补充操作适用场景具体动作
货币转换用户要求统一货币时记录原货币并转换为目标货币(近似值)
域名提取数据中包含URL时从完整URL中提取域名,添加为新列
字数统计文章模式统计提取文本的字数
相对日期计算存在日期数据时如有需要,添加“X天前”列

Deduplication Strategy

去重策略

When combining data from multiple pages or URLs:
  1. Exact match: rows with identical values in all fields -> keep first
  2. Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
  3. Report: "Removed N duplicate rows" in delivery notes

当合并多页面或多URL的数据时:
  1. 完全匹配:所有字段值均相同的行 -> 保留第一行
  2. 近似匹配:关键字段(名称+来源)相同但细节不同的行 -> 保留最完整的行(空值最少),并在备注中标记
  3. 报告:在交付备注中说明“已移除N条重复行”

Phase 6: Validate

阶段6:验证

Verify extraction quality before delivering results.
在交付结果前,验证提取质量。

Validation Checks

验证检查项

CheckAction
Item countCompare extracted count to expected count from recon
Empty fieldsCount N/A or null values per field
Data type consistencyNumbers should be numeric, dates parseable
DuplicatesFlag exact duplicate rows (post-dedup)
EncodingCheck for HTML entities, garbled characters
CompletenessAll user-requested fields present in output
TruncationVerify data wasn't cut off (check last items)
OutliersFlag values that seem anomalous (e.g. $0.00 price)
检查项具体动作
条目数量对比提取数量与侦察阶段的预期数量
空字段统计每个字段的N/A或null值数量
数据类型一致性数值应为数字类型,日期应可解析
重复项标记去重后仍存在的完全重复行
编码问题检查是否存在HTML实体、乱码字符
完整性输出中包含用户要求的所有字段
截断问题验证数据未被截断(检查最后几条条目)
异常值标记异常值(例如0.00美元的价格)

Confidence Rating

置信度评级

Assign to every extraction:
RatingCriteria
HIGHAll fields populated, count matches expected, no anomalies
MEDIUMMinor gaps (<10% empty fields) or count slightly differs
LOWSignificant gaps (>10% empty), structural issues, partial data
Always report confidence with specifics:
Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.
为每次提取结果分配置信度:
评级标准
所有字段均已填充,数量与预期一致,无异常值
存在少量缺失(空字段占比<10%)或数量略有差异
大量缺失(空字段占比>10%)、结构问题、数据不完整
始终附带具体细节报告置信度:
置信度: - 已提取47条数据,6个字段均已填充, 与页面分析的预期数量一致。

Auto-Recovery (Try Before Reporting Issues)

自动恢复(报告问题前尝试)

IssueAuto-Recovery Action
Missing dataRe-attempt with Browser if WebFetch was used
Encoding problemsApply HTML entity decode + unicode normalization
Incomplete resultsCheck for pagination or lazy-loading, fetch more
Count mismatchScroll/paginate to find remaining items
All fields emptyPage likely JS-rendered, switch to Browser strategy
Partial fieldsTry JSON-LD extraction as supplement
Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.

问题自动恢复动作
数据缺失如果使用了WebFetch,重新尝试使用浏览器自动化策略
编码问题执行HTML实体解码和Unicode标准化
结果不完整检查是否存在分页或懒加载,获取更多数据
数量不匹配滚动/分页查找剩余条目
所有字段为空页面可能是JS渲染,切换为浏览器自动化策略
部分字段缺失尝试提取JSON-LD数据进行补充
在交付备注中记录所有恢复尝试。 向用户报告无法恢复的缺失,并提供具体细节。

Phase 7: Format And Deliver

阶段7:格式化与交付

Structure results according to user preference. See references/output-templates.md for complete formatting templates.
根据用户偏好组织结果。 参考references/output-templates.md 获取完整的格式化模板。

Delivery Envelope

交付包装

ALWAYS wrap results with this metadata header:
markdown
undefined
始终使用以下元数据头包裹结果:
markdown
undefined

Extraction Results

提取结果

Source: Page Title Date: YYYY-MM-DD HH:MM UTC Items: N records (M fields each) Confidence: HIGH | MEDIUM | LOW Strategy: A (WebFetch) | B (Browser) | C (API) | E (Structured Data) Format: Markdown Table | JSON | CSV

[DATA HERE]

Notes:
  • [Any gaps, issues, or observations]
  • [Transforms applied: deduplication, normalization, etc.]
  • [Pages scraped if paginated: "Pages 1-5 of 12"]
  • [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
undefined
来源: 页面标题 日期: YYYY-MM-DD HH:MM UTC 条目数: N条记录(每条M个字段) 置信度: 高 | 中 | 低 策略: A(WebFetch) | B(浏览器) | C(API) | E(结构化数据) 格式: Markdown表格 | JSON | CSV

[数据内容]

备注:
  • [任何缺失、问题或观察结果]
  • [执行的转换操作:去重、标准化等]
  • [如果是分页抓取:“已抓取第1-5页,共12页”]
  • [如果发生自动降级:“已从WebFetch自动切换为浏览器策略”]
undefined

Markdown Table Rules

Markdown表格规则

  • Left-align text columns (
    :---
    ), right-align numbers (
    ---:
    )
  • Consistent column widths for readability
  • Include summary row for numeric data when useful (totals, averages)
  • Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
  • Truncate long cell values to 60 chars with
    ...
    indicator
  • Use
    N/A
    for missing values, never leave cells empty
  • For multi-page results, show combined table (not per-page)
  • 文本列左对齐(
    :---
    ),数值列右对齐(
    ---:
  • 保持列宽一致,提升可读性
  • 如有需要,为数值数据添加汇总行(总计、平均值)
  • 每张表格最多10列;数据列过多时,拆分为多个表格或建议使用JSON格式
  • 长单元格值截断为60字符,末尾添加
    ...
    标识
  • 缺失值使用
    N/A
    ,不保留空单元格
  • 多页结果合并为单个表格展示(而非按页展示)

Json Rules

JSON规则

  • Use camelCase for keys (e.g.
    productName
    ,
    unitPrice
    )
  • Wrap in metadata envelope:
    json
    {
      "metadata": {
        "source": "URL",
        "title": "Page Title",
        "extractedAt": "ISO-8601",
        "itemCount": 47,
        "fieldCount": 6,
        "confidence": "HIGH",
        "strategy": "A",
        "transforms": ["deduplication", "priceNormalization"],
        "notes": []
      },
      "data": [ ... ]
    }
  • Pretty-print with 2-space indentation
  • Numbers as numbers (not strings), booleans as booleans
  • null for missing values (not empty strings)
  • 键使用小驼峰命名(例如
    productName
    unitPrice
  • 使用元数据包裹:
    json
    {
      "metadata": {
        "source": "URL",
        "title": "页面标题",
        "extractedAt": "ISO-8601格式时间",
        "itemCount": 47,
        "fieldCount": 6,
        "confidence": "高",
        "strategy": "A",
        "transforms": ["去重", "价格标准化"],
        "notes": []
      },
      "data": [ ... ]
    }
  • 使用2空格缩进格式化输出
  • 数值类型保留为数字,布尔值保留为布尔类型
  • 缺失值使用null(而非空字符串)

Csv Rules

CSV规则

  • First row is always headers
  • Quote any field containing commas, quotes, or newlines
  • UTF-8 encoding with BOM for Excel compatibility
  • Use
    ,
    as delimiter (standard)
  • Include metadata as comments:
    # Source: URL
  • 第一行始终是表头
  • 包含逗号、引号或换行符的字段需用引号包裹
  • 使用带BOM的UTF-8编码,确保与Excel兼容
  • 使用
    ,
    作为分隔符(标准格式)
  • 元数据以注释形式添加:
    # 来源:URL

File Output

文件输出

When user requests file save:
  • Markdown:
    .md
    extension
  • JSON:
    .json
    extension
  • CSV:
    .csv
    extension
  • Confirm path before writing
  • Report full file path and item count after saving
当用户要求保存为文件时:
  • Markdown:
    .md
    扩展名
  • JSON:
    .json
    扩展名
  • CSV:
    .csv
    扩展名
  • 写入前确认路径
  • 保存后报告完整文件路径和条目数

Multi-Url Comparison Format

多URL对比格式

When comparing data across multiple sources:
  • Add
    Source
    as the first column/field
  • Use short identifiers for sources (domain name or user label)
  • Group by source or interleave based on user preference
  • Highlight differences if user asks for comparison
  • Include summary: "Best price: $X at store-b.com"
当需要跨源对比数据时:
  • 添加
    来源
    作为第一列/字段
  • 使用短标识表示来源(域名或用户自定义标签)
  • 根据用户偏好,按来源分组或交叉展示
  • 如果用户要求对比,高亮显示差异
  • 添加汇总信息:“最低价格:X美元,来自store-b.com”

Differential Output

差异输出

When user requests change detection (diff mode):
  • Compare current extraction with previous run
  • Mark new items with
    [NEW]
  • Mark removed items with
    [REMOVED]
  • Mark changed values with
    [WAS: old_value]
  • Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

当用户要求检测变化(差异模式)时:
  • 对比当前提取结果与上一次运行结果
  • 新条目标记为
    [新增]
  • 移除的条目标记为
    [已移除]
  • 变更的值标记为
    [原:旧值]
  • 添加汇总信息:“与上次运行相比:新增5条,移除2条,修改3条”

Rate Limiting

速率限制

  • Maximum 1 request per 2 seconds for sequential page fetches
  • For multi-URL jobs, process sequentially with pauses
  • If a site returns 429 (Too Many Requests), stop and report to user
  • 连续页面抓取时,每2秒最多1次请求
  • 多URL任务时,按顺序处理并添加停顿
  • 如果站点返回429(请求过多),停止操作并向用户报告

Access Respect

访问规范

  • If a page blocks access (403, CAPTCHA, login wall), report to user
  • Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
  • Do NOT scrape behind authentication unless user explicitly provides access
  • Respect robots.txt directives when known
  • 如果页面拦截访问(403、验证码、登录墙),向用户报告
  • 不要尝试绕过机器人检测、验证码或访问控制
  • 除非用户明确提供访问权限,否则不要抓取需要认证的内容
  • 已知情况下,遵守robots.txt指令

Copyright

版权说明

  • Do NOT reproduce large blocks of copyrighted article text
  • For articles: extract factual data, statistics, and structured info; summarize narrative content
  • Always include source attribution (http://example.com) in output
  • 不要大段复制受版权保护的文章文本
  • 对于文章:提取事实数据、统计信息和结构化内容;总结叙述性内容
  • 输出中始终包含来源归因(http://example.com)

Data Scope

数据范围

  • Extract ONLY what the user explicitly requested
  • Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
  • Do not store or transmit extracted data beyond what the user sees
  • 仅提取用户明确要求的内容
  • 大规模收集潜在敏感数据(邮箱、电话、个人信息)前,向用户发出警告
  • 除用户可见内容外,不要存储或传输提取的数据

Failure Protocol

失败处理流程

When extraction fails or is blocked:
  1. Explain the specific reason (JS rendering, bot detection, login, etc.)
  2. Suggest alternatives (different URL, API if available, manual approach)
  3. Never retry aggressively or escalate access attempts

当提取失败或被拦截时:
  1. 解释具体原因(JS渲染、机器人检测、登录要求等)
  2. 建议替代方案(不同URL、可用API、手动方式)
  3. 不要频繁重试或尝试提升访问权限

Quick Reference: Mode Cheat Sheet

快速参考:模式速查表

User Says...ModeStrategyOutput Default
"extract the table"tableA or BMarkdown table
"get all products/prices"productE then AMarkdown table
"scrape the listings"listA or BMarkdown table
"extract contact info / team page"contactAMarkdown table
"get the article data"articleAMarkdown text
"extract the FAQ"faqA or BJSON
"get pricing plans"pricingA or BMarkdown table
"scrape job listings"jobsA or BMarkdown table
"get event schedule"eventsA or BMarkdown table
"find and extract [topic]"discoveryWebSearchMarkdown table
"compare prices across sites"multi-URLA or BComparison table
"what changed since last time"diffanyDiff format

用户需求...模式策略默认输出格式
"提取表格"tableA或BMarkdown表格
"获取所有产品/价格"productE后AMarkdown表格
"抓取列表"listA或BMarkdown表格
"提取联系信息 / 团队页"contactAMarkdown表格
"提取文章数据"articleAMarkdown文本
"提取FAQ"faqA或BJSON
"获取价格套餐"pricingA或BMarkdown表格
"抓取职位列表"jobsA或BMarkdown表格
"获取活动日程"eventsA或BMarkdown表格
"查找并提取[主题]"discoveryWebSearchMarkdown表格
"跨站对比价格"multi-URLA或B对比表格
"上次抓取后有哪些变化"diff任意差异格式

References

参考文档

  • Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
  • Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
  • Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.
  • 提取规则references/extraction-patterns.md CSS选择器、JavaScript代码片段、JSON-LD解析、域名相关技巧。
  • 输出模板references/output-templates.md Markdown、JSON、CSV格式的完整示例模板。
  • 数据转换references/data-transforms.md 数据清洗、标准化、去重、补充规则。

Best Practices

最佳实践

  • Provide clear, specific context about your project and requirements
  • Review all suggestions before applying them to production code
  • Combine with other complementary skills for comprehensive analysis
  • 提供清晰、具体的项目背景和需求
  • 在将建议应用于生产代码前,仔细审阅
  • 结合其他互补工具,进行全面分析

Common Pitfalls

常见误区

  • Using this skill for tasks outside its domain expertise
  • Applying recommendations without understanding your specific context
  • Not providing enough project context for accurate analysis
  • 将该工具用于其领域外的任务
  • 在不了解具体背景的情况下应用建议
  • 未提供足够的项目背景,导致分析不准确