Loading...
Loading...
Compare original and translation side by side
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT1. 明确需求 -> 2. 站点侦察 -> 3. 策略选择 -> 4. 数据提取 -> 5. 数据转换 -> 6. 验证 -> 7. 格式化输出| Parameter | Resolve | Default |
|---|---|---|
| Target URL(s) | Which page(s) to scrape? | (required) |
| Data Target | What specific data to extract? | (required) |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL? | Single page |
| 参数 | 需确认内容 | 默认值 |
|---|---|---|
| 目标URL | 需要抓取哪些页面? | (必填) |
| 数据目标 | 需要提取哪些具体数据? | (必填) |
| 输出格式 | Markdown表格、JSON、CSV或文本? | Markdown表格 |
| 范围 | 单页面、分页或多URL? | 单页面 |
| Parameter | Resolve | Default |
|---|---|---|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
| 参数 | 需确认内容 | 默认值 |
|---|---|---|
| 分页设置 | 是否跟进分页?最大页数? | 否,仅1页 |
| 最大条目数 | 最多收集多少条数据? | 无限制 |
| 过滤规则 | 需要排除或包含哪些数据? | 无 |
| 排序方式 | 如何对结果排序? | 按源页面顺序 |
| 保存路径 | 是否保存到文件?路径是哪里? | 仅显示 |
| 响应语言 | 用哪种语言回复? | 用户使用的语言 |
| 差异模式 | 是否与上次抓取结果对比? | 否 |
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)WebFetch(
url = 目标URL,
prompt = "分析此页面结构并报告:
1. 页面类型:文章、产品列表、搜索结果、数据表、
目录、仪表盘、API文档、FAQ、价格页、职位板、活动页或其他
2. 主要内容结构:表格、有序/无序列表、卡片网格、自由文本、
折叠面板、标签页
3. 可见的不同数据项的大致数量
4. JavaScript渲染标识:空容器、加载动画、
SPA框架标记(React根节点、Vue应用、Angular)、HTML简洁但JS密集
5. 分页方式:下一页/上一页链接、页码、加载更多按钮、
无限滚动标识、总结果数
6. 数据密度:存在多少结构化、可提取的数据
7. 列出可提取的主要数据字段/列
8. 嵌入式结构化数据:JSON-LD、微数据、OpenGraph标签
9. 可用的下载链接:CSV、Excel、PDF、API端点"
)| Signal | Interpretation | Action |
|---|---|---|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
| 信号 | 解读 | 行动 |
|---|---|---|
| 内容丰富,数据清晰可见 | 静态页面 | 策略A(WebFetch) |
| 空容器、“加载中...”、文本极少 | JS渲染页面 | 策略B(浏览器自动化) |
| 登录墙、验证码、403/401响应 | 访问被拦截 | 向用户报告 |
| 内容存在但结构混乱 | 需要精准定位 | 策略B(浏览器自动化) |
| 响应体为JSON或XML | API端点 | 策略C(Bash/curl) |
| 存在CSV/Excel下载链接 | 直接数据文件 | 策略C(下载文件) |
| Mode | Indicators | Examples |
|---|---|---|
| HTML | Price comparison, statistics, specs |
| Repeated similar elements, card grids | Search results, product listings |
| Long-form text with headings/paragraphs | Blog post, news article, docs |
| Product name, price, specs, images, rating | E-commerce product page |
| Names, emails, phones, addresses, roles | Team page, staff directory |
| Question-answer pairs, accordions | FAQ page, help center |
| Plan names, prices, features, tiers | SaaS pricing page |
| Dates, locations, titles, descriptions | Event listings, conferences |
| Titles, companies, locations, salaries | Job boards, career pages |
| User specified CSS selectors or fields | Anything not matching above |
| 模式 | 标识特征 | 示例 |
|---|---|---|
| HTML | 价格对比表、统计数据、参数表 |
| 重复的相似元素、卡片网格 | 搜索结果、产品列表 |
| 带标题/段落的长文本 | 博客文章、新闻报道、文档 |
| 产品名称、价格、参数、图片、评分 | 电商产品页 |
| 姓名、邮箱、电话、地址、职位 | 团队页、员工目录 |
| 问答对、折叠面板 | FAQ页、帮助中心 |
| 套餐名称、价格、功能、档位 | SaaS价格页 |
| 日期、地点、标题、描述 | 活动列表、会议信息 |
| 职位名称、公司、地点、薪资 | 职位板、招聘页 |
| 用户提供的CSS选择器或字段描述 | 不符合上述模式的自定义需求 |
Structured data (JSON-LD, microdata) has what we need?
|
+-- YES --> STRATEGY E: Extract structured data directly
|
+-- NO: Content fully visible in WebFetch?
|
+-- YES: Need precise element targeting?
| |
| +-- NO --> STRATEGY A: WebFetch + AI extraction
| +-- YES --> STRATEGY B: Browser automation
|
+-- NO: JavaScript rendering detected?
|
+-- YES --> STRATEGY B: Browser automation
+-- NO: API/JSON/XML endpoint or download link?
|
+-- YES --> STRATEGY C: Bash (curl + jq)
+-- NO --> Report access issue to user结构化数据(JSON-LD、微数据)包含所需内容?
|
+-- 是 --> 策略E:直接提取结构化数据
|
+-- 否:WebFetch可完整获取内容?
|
+-- 是:是否需要精准元素定位?
| |
| +-- 否 --> 策略A:WebFetch + AI提取
| +-- 是 --> 策略B:浏览器自动化
|
+-- 否:检测到JavaScript渲染?
|
+-- 是 --> 策略B:浏览器自动化
+-- 否:是否存在API/JSON/XML端点或下载链接?
|
+-- 是 --> 策略C:Bash(curl + jq)
+-- 否 --> 向用户报告访问问题WebFetch(
url = URL,
prompt = "Extract [DATA_TARGET] from this page.
Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
Rules:
- If a value is missing or unclear, use 'N/A'
- Do not include navigation, ads, footers, or unrelated content
- Preserve original values exactly (numbers, currencies, dates)
- Include ALL matching items, not just the first few
- For each item, also extract the URL/link if available"
)WebFetch(
url = URL,
prompt = "从该页面提取[数据目标]。
仅返回提取的数据,格式为[输出格式],包含以下列/字段:[字段列表]。
规则:
- 如果值缺失或不明确,使用'N/A'
- 不要包含导航栏、广告、页脚或无关内容
- 完全保留原始值(数字、货币、日期)
- 包含所有匹配项,而非仅前几项
- 如果可用,为每个条目提取对应的URL/链接"
)tabs_context_mcp(createIfEmpty=true)navigate(url=TARGET_URL, tabId=TAB)computer(action="wait", duration=3, tabId=TAB)find(query="cookie consent or accept button", tabId=TAB)read_page(tabId=TAB)get_page_text(tabId=TAB)find(query="[DESCRIPTION]", tabId=TAB)javascript_tool// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);computer(action="scroll", scroll_direction="down", tabId=TAB)computer(action="wait", duration=2, tabId=TAB)tabs_context_mcp(createIfEmpty=true)navigate(url=目标URL, tabId=标签页ID)computer(action="wait", duration=3, tabId=标签页ID)find(query="cookie consent or accept button", tabId=标签页ID)read_page(tabId=标签页ID)get_page_text(tabId=标签页ID)find(query="[元素描述]", tabId=标签页ID)javascript_tool// 表格提取
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);// 列表/卡片提取
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);computer(action="scroll", scroll_direction="down", tabId=标签页ID)computer(action="wait", duration=2, tabId=标签页ID)undefinedundefinedundefinedundefinedjavascript_toolconst scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);javascript_toolconst scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);?page=N/page/N&offset=Ncomputer(action="scroll", scroll_direction="down", tabId=TAB)computer(action="wait", duration=2, tabId=TAB)find(query="load more button", tabId=TAB)computer(action="left_click", ref=REF, tabId=TAB)?page=N/page/N&offset=Ncomputer(action="scroll", scroll_direction="down", tabId=标签页ID)computer(action="wait", duration=2, tabId=标签页ID)find(query="load more button", tabId=标签页ID)computer(action="left_click", ref=引用ID, tabId=标签页ID)"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units.""提取该页面中的所有表格行。
以Markdown表格形式返回,保留原表头。
包含所有行,不要截断或汇总。
保留数值精度、货币和单位。""Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available.""从该页面提取每个[条目类型]。
为每个条目提取以下字段:[字段列表]。
以JSON数组形式返回,键为[键列表]。
包含所有条目,而非仅前几项。如果可用,为每个条目提取对应的链接/URL。""Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text.""提取文章元数据:
- 标题、作者、日期、标签/分类、字数估算
- 关键事实数据、统计信息和命名实体
以结构化Markdown形式返回。总结内容,不要复制全文。""Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
availability, description (first 200 chars), rating, reviewCount,
specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."Product"提取产品数据,包含以下字段:
- 名称、品牌、价格、货币、原价(如果有折扣)、
库存状态、描述(前200字符)、评分、评论数、
参数(键值对形式)、产品URL、图片URL
以JSON形式返回。缺失字段使用null。"Product"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page.""提取每个人/实体的联系信息:
- 姓名、头衔、职位、邮箱、电话、地址、机构、网站、LinkedIn URL
以Markdown表格形式返回。仅提取页面上可见的真实联系信息。""Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects.""提取该页面中的所有问答对。
为每个FAQ条目提取:
- 问题:完整的问题文本
- 答案:答案文本(如果过长,取前300字符)
- 分类:所属章节/分类(如果有分组)
以JSON数组形式返回。""Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields.""提取该页面中的所有价格套餐/档位。
为每个套餐提取:
- 套餐名称、月付价格、年付价格、货币
- 包含的功能(数组形式)
- 限制条件(数组形式,包含限制或排除的功能)
- 号召性用语按钮文本
- 是否突出显示(如果标记为推荐/热门则为true)
以JSON形式返回。缺失字段使用null。""Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields.""提取该页面中的所有活动/场次信息。
为每个活动提取:
- 标题、日期、时间、结束时间、地点、描述(前200字符)
- 演讲者(姓名数组)、分类、注册URL
以JSON形式返回。缺失字段使用null。""Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields.""提取该页面中的所有职位列表。
为每个职位提取:
- 职位名称、公司、地点、薪资、薪资范围、类型(全职/兼职/合同)
- 发布日期、描述(前200字符)、申请URL、标签
以JSON形式返回。缺失字段使用null。"javascript_tooljavascript_toolsourcesource| Transform | Action |
|---|---|
| Whitespace cleanup | Trim, collapse multiple spaces, remove |
| HTML entity decode | |
| Unicode normalization | NFKC normalization for consistent characters |
| Empty string to null | |
| 转换操作 | 具体动作 |
|---|---|
| 空白字符清理 | 去除首尾空白、合并多个空格、移除单元格中的 |
| HTML实体解码 | |
| Unicode标准化 | 对字符进行NFKC标准化,确保一致性 |
| 空字符串转null | |
| Transform | When | Action |
|---|---|---|
| Price normalization | Product/pricing modes | Extract numeric value + currency symbol |
| Date normalization | Any dates found | Normalize to ISO-8601 (YYYY-MM-DD) |
| URL resolution | Relative URLs extracted | Convert to absolute URLs |
| Phone normalization | Contact mode | Standardize to E.164 format if possible |
| Deduplication | Multi-page or multi-URL | Remove exact duplicate rows |
| Sorting | User requested or natural | Sort by user-specified field |
| 转换操作 | 适用场景 | 具体动作 |
|---|---|---|
| 价格标准化 | 产品/价格模式 | 提取数值和货币符号 |
| 日期标准化 | 存在日期数据时 | 转换为ISO-8601格式(YYYY-MM-DD) |
| URL解析 | 提取到相对URL时 | 转换为绝对URL |
| 电话号码标准化 | 联系人模式 | 尽可能转换为E.164格式 |
| 去重 | 多页面或多URL提取时 | 移除完全重复的行 |
| 排序 | 用户要求或自然排序需求时 | 按用户指定字段排序 |
| Enrichment | When | Action |
|---|---|---|
| Currency conversion | User asks for single currency | Note original + convert (approximate) |
| Domain extraction | URLs in data | Add domain column from full URLs |
| Word count | Article mode | Count words in extracted text |
| Relative dates | Dates present | Add "X days ago" column if useful |
| 补充操作 | 适用场景 | 具体动作 |
|---|---|---|
| 货币转换 | 用户要求统一货币时 | 记录原货币并转换为目标货币(近似值) |
| 域名提取 | 数据中包含URL时 | 从完整URL中提取域名,添加为新列 |
| 字数统计 | 文章模式 | 统计提取文本的字数 |
| 相对日期计算 | 存在日期数据时 | 如有需要,添加“X天前”列 |
| Check | Action |
|---|---|
| Item count | Compare extracted count to expected count from recon |
| Empty fields | Count N/A or null values per field |
| Data type consistency | Numbers should be numeric, dates parseable |
| Duplicates | Flag exact duplicate rows (post-dedup) |
| Encoding | Check for HTML entities, garbled characters |
| Completeness | All user-requested fields present in output |
| Truncation | Verify data wasn't cut off (check last items) |
| Outliers | Flag values that seem anomalous (e.g. $0.00 price) |
| 检查项 | 具体动作 |
|---|---|
| 条目数量 | 对比提取数量与侦察阶段的预期数量 |
| 空字段 | 统计每个字段的N/A或null值数量 |
| 数据类型一致性 | 数值应为数字类型,日期应可解析 |
| 重复项 | 标记去重后仍存在的完全重复行 |
| 编码问题 | 检查是否存在HTML实体、乱码字符 |
| 完整性 | 输出中包含用户要求的所有字段 |
| 截断问题 | 验证数据未被截断(检查最后几条条目) |
| 异常值 | 标记异常值(例如0.00美元的价格) |
| Rating | Criteria |
|---|---|
| HIGH | All fields populated, count matches expected, no anomalies |
| MEDIUM | Minor gaps (<10% empty fields) or count slightly differs |
| LOW | Significant gaps (>10% empty), structural issues, partial data |
Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.
| 评级 | 标准 |
|---|---|
| 高 | 所有字段均已填充,数量与预期一致,无异常值 |
| 中 | 存在少量缺失(空字段占比<10%)或数量略有差异 |
| 低 | 大量缺失(空字段占比>10%)、结构问题、数据不完整 |
置信度:高 - 已提取47条数据,6个字段均已填充, 与页面分析的预期数量一致。
| Issue | Auto-Recovery Action |
|---|---|
| Missing data | Re-attempt with Browser if WebFetch was used |
| Encoding problems | Apply HTML entity decode + unicode normalization |
| Incomplete results | Check for pagination or lazy-loading, fetch more |
| Count mismatch | Scroll/paginate to find remaining items |
| All fields empty | Page likely JS-rendered, switch to Browser strategy |
| Partial fields | Try JSON-LD extraction as supplement |
| 问题 | 自动恢复动作 |
|---|---|
| 数据缺失 | 如果使用了WebFetch,重新尝试使用浏览器自动化策略 |
| 编码问题 | 执行HTML实体解码和Unicode标准化 |
| 结果不完整 | 检查是否存在分页或懒加载,获取更多数据 |
| 数量不匹配 | 滚动/分页查找剩余条目 |
| 所有字段为空 | 页面可能是JS渲染,切换为浏览器自动化策略 |
| 部分字段缺失 | 尝试提取JSON-LD数据进行补充 |
undefinedundefinedundefinedundefined:------:...N/A:------:...N/AproductNameunitPrice{
"metadata": {
"source": "URL",
"title": "Page Title",
"extractedAt": "ISO-8601",
"itemCount": 47,
"fieldCount": 6,
"confidence": "HIGH",
"strategy": "A",
"transforms": ["deduplication", "priceNormalization"],
"notes": []
},
"data": [ ... ]
}productNameunitPrice{
"metadata": {
"source": "URL",
"title": "页面标题",
"extractedAt": "ISO-8601格式时间",
"itemCount": 47,
"fieldCount": 6,
"confidence": "高",
"strategy": "A",
"transforms": ["去重", "价格标准化"],
"notes": []
},
"data": [ ... ]
},# Source: URL,# 来源:URL.md.json.csv.md.json.csvSource来源[NEW][REMOVED][WAS: old_value][新增][已移除][原:旧值]| User Says... | Mode | Strategy | Output Default |
|---|---|---|---|
| "extract the table" | table | A or B | Markdown table |
| "get all products/prices" | product | E then A | Markdown table |
| "scrape the listings" | list | A or B | Markdown table |
| "extract contact info / team page" | contact | A | Markdown table |
| "get the article data" | article | A | Markdown text |
| "extract the FAQ" | faq | A or B | JSON |
| "get pricing plans" | pricing | A or B | Markdown table |
| "scrape job listings" | jobs | A or B | Markdown table |
| "get event schedule" | events | A or B | Markdown table |
| "find and extract [topic]" | discovery | WebSearch | Markdown table |
| "compare prices across sites" | multi-URL | A or B | Comparison table |
| "what changed since last time" | diff | any | Diff format |
| 用户需求... | 模式 | 策略 | 默认输出格式 |
|---|---|---|---|
| "提取表格" | table | A或B | Markdown表格 |
| "获取所有产品/价格" | product | E后A | Markdown表格 |
| "抓取列表" | list | A或B | Markdown表格 |
| "提取联系信息 / 团队页" | contact | A | Markdown表格 |
| "提取文章数据" | article | A | Markdown文本 |
| "提取FAQ" | faq | A或B | JSON |
| "获取价格套餐" | pricing | A或B | Markdown表格 |
| "抓取职位列表" | jobs | A或B | Markdown表格 |
| "获取活动日程" | events | A或B | Markdown表格 |
| "查找并提取[主题]" | discovery | WebSearch | Markdown表格 |
| "跨站对比价格" | multi-URL | A或B | 对比表格 |
| "上次抓取后有哪些变化" | diff | 任意 | 差异格式 |