scraper-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scraper Builder

爬虫构建工具

Generate complete, runnable web scraper projects using the PageObject pattern with Playwright and TypeScript. This skill produces site-specific scrapers with typed data extraction, Docker deployment, and optional agent-browser integration for automated site analysis.
使用PageObject模式、Playwright和TypeScript生成完整可运行的网页爬虫项目。本技能可生成针对特定站点的爬虫,支持类型化数据提取、Docker部署,还可选择集成agent-browser以实现自动化站点分析。

When to Use This Skill

适用场景

Use this skill when:
  • Building a site-specific web scraper for data extraction
  • Generating PageObject classes for a target website
  • Scaffolding a complete scraper project with Docker support
  • Using agent-browser to analyze a site and auto-generate selectors
  • Creating reusable scraping components (pagination, data tables)
Do NOT use this skill when:
  • Building API clients (use HTTP client libraries directly)
  • Writing QA/E2E test suites (use Playwright test runner with test-focused patterns)
  • Mass crawling or spidering entire domains (use Crawlee or Scrapy)
  • Scraping sites that require authentication bypass or CAPTCHA solving
在以下场景使用本技能:
  • 构建针对特定站点的网页爬虫以进行数据提取
  • 为目标网站生成PageObject类
  • 搭建带有Docker支持的完整爬虫项目框架
  • 使用agent-browser分析站点并自动生成选择器
  • 创建可复用的爬虫组件(分页、数据表格)
请勿在以下场景使用本技能:
  • 构建API客户端(直接使用HTTP客户端库)
  • 编写QA/E2E测试套件(使用Playwright测试运行器及测试专用模式)
  • 大规模爬取或遍历整个域名(使用Crawlee或Scrapy)
  • 爬取需要绕过身份验证或验证码的站点

Core Principles

核心原则

1. PageObject Encapsulation

1. PageObject封装

Each page on the target site maps to one PageObject class. Locators are defined in the constructor, and scraping logic lives in methods. Page objects never contain assertions or business logic — they extract and return data.
目标站点的每个页面对应一个PageObject类。定位器定义在构造函数中,爬虫逻辑在方法中实现。PageObject类绝不包含断言或业务逻辑——仅负责提取并返回数据。

2. Selector Resilience

2. 选择器鲁棒性

Prefer selectors in this order:
data-testid
>
id
> semantic HTML (
role
,
aria-label
) > structured CSS classes > text content. Avoid positional selectors (
nth-child
) and layout-dependent paths. See
references/playwright-selectors.md
for the full hierarchy.
选择器优先级顺序:
data-testid
>
id
> 语义化HTML(
role
aria-label
) > 结构化CSS类 > 文本内容。避免使用位置选择器(
nth-child
)和依赖布局的路径。完整优先级体系请参考
references/playwright-selectors.md

3. Composition Over Inheritance

3. 组合优于继承

Reusable UI patterns (pagination, data tables, search bars) are modeled as component classes that page objects compose via properties. Only
BasePage
uses inheritance — everything else composes.
可复用UI模式(分页、数据表格、搜索栏)建模为组件类,通过属性被PageObject类组合使用。仅
BasePage
使用继承——其余所有组件均采用组合方式。

4. Typed Data Extraction

4. 类型化数据提取

All scraped data flows through Zod schemas for validation. This catches selector drift (when a site changes its markup) at extraction time rather than downstream. See
assets/templates/data-schema.ts.md
.
所有爬取的数据均通过Zod模式进行验证。这能在数据提取阶段就捕获选择器漂移(站点标记变更)问题,而非在下游流程中才发现。请参考
assets/templates/data-schema.ts.md

5. Docker-First Deployment

5. Docker优先部署

Generated projects include a Dockerfile using Microsoft's official Playwright images and a docker-compose.yml with volume mounts for output data and debug screenshots. This ensures consistent browser environments across machines.
生成的项目包含使用微软官方Playwright镜像的Dockerfile,以及带有输出数据和调试截图卷挂载的docker-compose.yml。这确保了在不同机器上的浏览器环境一致性。

Generation Modes

生成模式

Mode 1: Agent-Browser Analysis

模式1:Agent-Browser分析

Use
agent-browser
to navigate the target site, capture accessibility tree snapshots, and automatically discover selectors. This is the preferred mode when the agent has access to the agent-browser CLI.
Prerequisites: If
agent-browser
is not already installed, add it as a skill first:
bash
npx skills add vercel-labs/agent-browser
Workflow:
bash
undefined
使用
agent-browser
导航目标站点、捕获可访问性树快照并自动发现选择器。当Agent可访问agent-browser CLI时,这是首选模式。
前置条件: 如果尚未安装
agent-browser
,请先添加该技能:
bash
npx skills add vercel-labs/agent-browser
工作流程:
bash
undefined

1. Open the target page

1. 打开目标页面

agent-browser open https://example.com/products
agent-browser open https://example.com/products

2. Capture interactive snapshot with element references

2. 捕获带有元素引用的交互式快照

agent-browser snapshot -i --json > snapshot.json
agent-browser snapshot -i --json > snapshot.json

3. Capture scoped sections for focused analysis

3. 捕获特定区域以进行聚焦分析

agent-browser snapshot -i --json -s "main" > main-content.json agent-browser snapshot -i --json -s "nav" > navigation.json
agent-browser snapshot -i --json -s "main" > main-content.json agent-browser snapshot -i --json -s "nav" > navigation.json

4. Test dynamic behavior (pagination, load-more)

4. 测试动态行为(分页、加载更多)

agent-browser click @e3 agent-browser wait --load networkidle agent-browser snapshot -i --json > after-click.json
agent-browser click @e3 agent-browser wait --load networkidle agent-browser snapshot -i --json > after-click.json

5. Close when done

5. 完成后关闭

agent-browser close

**What the agent does with snapshots:**

1. Parse element references (`@e1`, `@e2`, etc.) and their roles
2. Group elements by semantic purpose (navigation, data display, forms, actions)
3. Map data elements to fields (title, price, image, etc.)
4. Generate PageObject classes with discovered selectors
5. Identify pagination and dynamic loading patterns

See `references/agent-browser-workflow.md` for the complete workflow reference.
agent-browser close

**Agent对快照的处理:**

1. 解析元素引用(`@e1`、`@e2`等)及其角色
2. 按语义用途分组元素(导航、数据展示、表单、操作)
3. 将数据元素映射到字段(标题、价格、图片等)
4. 使用发现的选择器生成PageObject类
5. 识别分页和动态加载模式

完整工作流程参考请见`references/agent-browser-workflow.md`。

Mode 2: Manual Description

模式2:手动描述

The user describes the target site's page structure and the agent maps it to page objects. The agent asks structured questions:
  1. What pages to scrape? — List of URLs or page types
  2. What data to extract? — Field names and expected types per page
  3. How is data paginated? — Numbered pages, load-more, infinite scroll, or single page
  4. What selectors are known? — Any CSS selectors, data-testid values, or XPath the user already knows
The agent then:
  • Matches the description to a site archetype from
    data/site-archetypes.json
  • Proposes a page object map with class names and responsibilities
  • Generates code after the user confirms the plan
用户描述目标站点的页面结构,Agent将其映射为PageObject类。Agent会提出结构化问题:
  1. 需要爬取哪些页面? —— URL列表或页面类型
  2. 需要提取哪些数据? —— 每个页面的字段名称及预期类型
  3. 数据如何分页? —— 编号页面、加载更多、无限滚动或单页
  4. 已知哪些选择器? —— 用户已知的任何CSS选择器、data-testid值或XPath
随后Agent会:
  • 将描述与
    data/site-archetypes.json
    中的站点原型匹配
  • 提出包含类名及职责的PageObject映射方案
  • 在用户确认方案后生成代码

Mode 3: Full Project Scaffold

模式3:完整项目搭建

Generate a complete runnable project in one operation using the scaffolder script:
bash
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
  --name "my-scraper" \
  --url "https://example.com" \
  --pages "ProductListing,ProductDetail" \
  --fields "title,price,image_url,description"
This produces a project with all source files, configuration, Docker setup, and an entry point ready to run. See the Scripts Reference section for full options.
使用脚手架脚本一键生成完整可运行的项目:
bash
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
  --name "my-scraper" \
  --url "https://example.com" \
  --pages "ProductListing,ProductDetail" \
  --fields "title,price,image_url,description"
这会生成包含所有源文件、配置、Docker设置及可直接运行的入口文件的项目。完整选项请参考脚本参考部分。

Quick Reference

快速参考

CategoryApproachDetails
FrameworkPlaywright
playwright
package, not
@playwright/test
LanguageTypeScriptStrict mode, ES2022 target
PatternPageObjectOne class per page, compose components
SelectorsResilientdata-testid > id > role > CSS class > text
Wait strategyAuto-waitPlaywright built-in, plus
networkidle
for navigation
ValidationZodSchema per page object's output type
OutputJSON + CSVConfigurable via storage utility
DockerOfficial image
mcr.microsoft.com/playwright:v1.48.0-jammy
RetryExponential backoff3 attempts default, configurable
ScreenshotsOn errorSaved to
screenshots/
for debugging
分类方案详情
框架Playwright
playwright
包,而非
@playwright/test
语言TypeScript严格模式,ES2022目标
模式PageObject一个页面对应一个类,组合组件
选择器鲁棒性data-testid > id > role > CSS类 > 文本
等待策略自动等待Playwright内置,导航时附加
networkidle
验证Zod每个PageObject输出类型对应一个模式
输出JSON + CSV通过存储工具可配置
Docker官方镜像
mcr.microsoft.com/playwright:v1.48.0-jammy
重试指数退避默认3次尝试,可配置
截图错误时捕获保存至
screenshots/
目录用于调试

Generation Process

生成流程

Follow this sequence when generating a scraper:
生成爬虫时请遵循以下步骤:

Step 1: Gather Requirements

步骤1:收集需求

Ask the user for:
  • Target site URL(s)
  • Data fields to extract
  • Number of pages/items expected
  • Output format preference (JSON, CSV, both)
  • Whether Docker deployment is needed
向用户询问:
  • 目标站点URL
  • 需要提取的数据字段
  • 预期的页面/条目数量
  • 输出格式偏好(JSON、CSV或两者)
  • 是否需要Docker部署

Step 2: Analyze the Site

步骤2:分析站点

Use Mode 1 (agent-browser) or Mode 2 (manual description) to understand:
  • Page structure and navigation flow
  • Data element locations and selector strategies
  • Pagination or infinite scroll patterns
  • Dynamic content loading behavior
使用模式1(agent-browser)或模式2(手动描述)了解:
  • 页面结构和导航流程
  • 数据元素位置和选择器策略
  • 分页或无限滚动模式
  • 动态内容加载行为

Step 3: Design the Page Object Map

步骤3:设计PageObject映射

Create a plan listing:
  • Each PageObject class and its URL pattern
  • Component classes needed (Pagination, DataTable, etc.)
  • Data schema fields and types per page
  • The scraper's navigation flow between pages
创建包含以下内容的方案:
  • 每个PageObject类及其URL模式
  • 需要的组件类(Pagination、DataTable等)
  • 每个页面的数据模式字段及类型
  • 爬虫在页面间的导航流程

Step 4: Present the Plan

步骤4:展示方案

Show the user the page object map before generating code. Include class names, field names, and the execution flow. Wait for confirmation.
在生成代码前向用户展示PageObject映射,包括类名、字段名和执行流程,等待用户确认。

Step 5: Generate Code

步骤5:生成代码

Use the templates in
assets/templates/
as the foundation:
  • base-page.ts.md
    — BasePage abstract class
  • page-object.ts.md
    — Site-specific page object
  • component.ts.md
    — Reusable components
  • scraper-runner.ts.md
    — Orchestrator
  • data-schema.ts.md
    — Zod validation schemas
assets/templates/
中的模板为基础生成代码:
  • base-page.ts.md
    —— BasePage抽象类
  • page-object.ts.md
    —— 站点专用PageObject
  • component.ts.md
    —— 可复用组件
  • scraper-runner.ts.md
    —— 编排器
  • data-schema.ts.md
    —— Zod验证模式

Step 6: Deliver

步骤6:交付

Provide the complete project with:
  • All source files
  • Configuration files from
    assets/configs/
  • A README explaining how to run it
  • Docker setup (unless explicitly excluded)
提供包含以下内容的完整项目:
  • 所有源文件
  • assets/configs/
    中的配置文件
  • 说明运行方法的README
  • Docker设置(除非明确排除)

Code Patterns

代码模式

BasePage

BasePage

Abstract class providing
navigate()
,
waitForPageLoad()
,
screenshot()
, and
getText()
helpers. All page objects extend this.
typescript
export abstract class BasePage {
  constructor(protected readonly page: Page) {}
  async navigate(url: string): Promise<void> { /* ... */ }
  async screenshot(name: string): Promise<void> { /* ... */ }
}
See:
assets/templates/base-page.ts.md
提供
navigate()
waitForPageLoad()
screenshot()
getText()
辅助方法的抽象类。所有PageObject类均继承自此类。
typescript
export abstract class BasePage {
  constructor(protected readonly page: Page) {}
  async navigate(url: string): Promise<void> { /* ... */ }
  async screenshot(name: string): Promise<void> { /* ... */ }
}
参考:
assets/templates/base-page.ts.md

PageObject

PageObject

Site-specific class with locators as readonly properties, scrape methods returning typed data, and navigation methods for multi-page flows.
typescript
export class ProductListingPage extends BasePage {
  readonly productCards: Locator;
  readonly nextButton: Locator;
  async scrapeProducts(): Promise<Product[]> { /* ... */ }
  async goToNextPage(): Promise<boolean> { /* ... */ }
}
See:
assets/templates/page-object.ts.md
站点专用类,包含只读属性形式的定位器、返回类型化数据的爬虫方法,以及用于多页面流程的导航方法。
typescript
export class ProductListingPage extends BasePage {
  readonly productCards: Locator;
  readonly nextButton: Locator;
  async scrapeProducts(): Promise<Product[]> { /* ... */ }
  async goToNextPage(): Promise<boolean> { /* ... */ }
}
参考:
assets/templates/page-object.ts.md

Component

Component

Reusable UI pattern (Pagination, DataTable) that receives a parent locator scope and provides extraction methods.
typescript
export class Pagination {
  constructor(private page: Page, private scope: Locator) {}
  async hasNextPage(): Promise<boolean> { /* ... */ }
  async goToNext(): Promise<void> { /* ... */ }
}
See:
assets/templates/component.ts.md
可复用UI模式(Pagination、DataTable),接收父定位器作用域并提供提取方法。
typescript
export class Pagination {
  constructor(private page: Page, private scope: Locator) {}
  async hasNextPage(): Promise<boolean> { /* ... */ }
  async goToNext(): Promise<void> { /* ... */ }
}
参考:
assets/templates/component.ts.md

ScraperRunner

ScraperRunner

Orchestrator that launches the browser, creates page objects, iterates through pages, collects data, validates with schemas, and writes output.
typescript
export class SiteScraper {
  async run(): Promise<void> {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    // navigate, scrape, validate, write
  }
}
See:
assets/templates/scraper-runner.ts.md
编排器,负责启动浏览器、创建PageObject、遍历页面、收集数据、使用模式验证并写入输出。
typescript
export class SiteScraper {
  async run(): Promise<void> {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    // navigate, scrape, validate, write
  }
}
参考:
assets/templates/scraper-runner.ts.md

DataSchema

DataSchema

Zod schemas that validate scraped records, catching selector drift and malformed data at extraction time.
typescript
export const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
});
See:
assets/templates/data-schema.ts.md
用于验证爬取记录的Zod模式,在数据提取阶段捕获选择器漂移和格式错误的数据。
typescript
export const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
});
参考:
assets/templates/data-schema.ts.md

Anti-Patterns

反模式

Anti-PatternProblemSolution
Monolith ScraperAll scraping logic in one fileSplit into PageObject classes per page
Sleep WaiterUsing
setTimeout
/fixed delays
Use Playwright auto-wait and
networkidle
Unvalidated PipelineNo schema validation on outputAdd Zod schemas for every data type
Selector LotteryFragile positional selectorsUse resilient selector hierarchy
Silent FailureSwallowing errors without loggingLog failures and save debug screenshots
Unthrottled CrawlerNo delay between requestsAdd configurable request delays
Hardcoded ConfigURLs and selectors in codeUse environment variables and config files
No Retry LogicSingle attempt per requestImplement exponential backoff
See
references/anti-patterns.md
for the extended catalog with examples and fixes.
反模式问题解决方案
单体爬虫所有爬虫逻辑集中在一个文件中按页面拆分为PageObject类
固定等待使用
setTimeout
/固定延迟
使用Playwright自动等待和
networkidle
未验证流水线输出无模式验证为每种数据类型添加Zod模式
选择器随机脆弱的位置选择器使用鲁棒性选择器优先级体系
静默失败吞掉错误而不记录记录失败并保存调试截图
无节流爬虫请求间无延迟添加可配置的请求延迟
硬编码配置URL和选择器写在代码中使用环境变量和配置文件
无重试逻辑每个请求仅尝试一次实现指数退避重试
扩展反模式目录及示例和修复方案请参考
references/anti-patterns.md

Scripts Reference

脚本参考

scaffold-scraper-project.ts

scaffold-scraper-project.ts

Generate a complete scraper project:
bash
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]

Options:
  --name <name>       Project name (required)
  --path <path>       Target directory (default: ./)
  --url <url>         Target site base URL
  --pages <pages>     Comma-separated page names (e.g., ProductListing,ProductDetail)
  --fields <fields>   Comma-separated data fields (e.g., title,price,rating)
  --no-docker         Skip Docker setup
  --no-validation     Skip Zod validation setup
  --json              Output as JSON
  -h, --help          Show help

Examples:
  # Scaffold a product scraper
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "shop-scraper" --url "https://shop.example.com" \
    --pages "ProductListing,ProductDetail" --fields "title,price,image_url"

  # Minimal scraper without Docker
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "blog-scraper" --no-docker
生成完整爬虫项目:
bash
deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts [options]

Options:
  --name <name>       项目名称(必填)
  --path <path>       目标目录(默认:./)
  --url <url>         目标站点基础URL
  --pages <pages>     逗号分隔的页面名称(例如:ProductListing,ProductDetail)
  --fields <fields>   逗号分隔的数据字段(例如:title,price,rating)
  --no-docker         跳过Docker设置
  --no-validation     跳过Zod验证设置
  --json              以JSON格式输出
  -h, --help          显示帮助信息

Examples:
  # 搭建产品爬虫
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "shop-scraper" --url "https://shop.example.com" \
    --pages "ProductListing,ProductDetail" --fields "title,price,image_url"

  # 不包含Docker的极简爬虫
  deno run --allow-read --allow-write scripts/scaffold-scraper-project.ts \
    --name "blog-scraper" --no-docker

generate-page-object.ts

generate-page-object.ts

Generate a single PageObject class for an existing project:
bash
deno run --allow-read --allow-write scripts/generate-page-object.ts [options]

Options:
  --name <name>           Class name (required)
  --url <url>             Page URL (for documentation comment)
  --fields <fields>       Comma-separated data fields
  --selectors <json>      JSON map of field to selector
  --with-pagination       Include pagination methods
  --output <path>         Output file path (default: stdout)
  --json                  Output as JSON
  -h, --help              Show help

Examples:
  # Generate a page object with known selectors
  deno run --allow-read --allow-write scripts/generate-page-object.ts \
    --name "ProductListing" --url "https://shop.example.com/products" \
    --fields "title,price,rating" \
    --selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
    --with-pagination --output src/pages/ProductListingPage.ts

  # Quick generation to stdout
  deno run --allow-read scripts/generate-page-object.ts \
    --name "SearchResults" --fields "title,url,snippet"
为现有项目生成单个PageObject类:
bash
deno run --allow-read --allow-write scripts/generate-page-object.ts [options]

Options:
  --name <name>           类名称(必填)
  --url <url>             页面URL(用于文档注释)
  --fields <fields>       逗号分隔的数据字段
  --selectors <json>      字段到选择器的JSON映射
  --with-pagination       包含分页方法
  --output <path>         输出文件路径(默认:标准输出)
  --json                  以JSON格式输出
  -h, --help              显示帮助信息

Examples:
  # 使用已知选择器生成PageObject
  deno run --allow-read --allow-write scripts/generate-page-object.ts \
    --name "ProductListing" --url "https://shop.example.com/products" \
    --fields "title,price,rating" \
    --selectors '{"title":".product-title","price":".product-price","rating":".star-rating"}' \
    --with-pagination --output src/pages/ProductListingPage.ts

  # 快速生成并输出到标准输出
  deno run --allow-read scripts/generate-page-object.ts \
    --name "SearchResults" --fields "title,url,snippet"

Templates & References

模板与参考

Templates (assets/templates/)

模板(assets/templates/)

TemplatePurpose
base-page.ts.md
Abstract BasePage with navigation, screenshots, text helpers
page-object.ts.md
Site-specific page object with locators and scrape methods
component.ts.md
Reusable components: Pagination, DataTable
scraper-runner.ts.md
Orchestrator: browser launch, iteration, collection, output
data-schema.ts.md
Zod schemas for scraped data validation
模板用途
base-page.ts.md
包含导航、截图、文本辅助方法的抽象BasePage
page-object.ts.md
站点专用PageObject,包含定位器和爬虫方法
component.ts.md
可复用组件:Pagination、DataTable
scraper-runner.ts.md
编排器:浏览器启动、迭代、收集、输出
data-schema.ts.md
爬取数据验证用的Zod模式

Configs (assets/configs/)

配置(assets/configs/)

ConfigPurpose
dockerfile.md
Multi-stage Dockerfile using official Playwright image
docker-compose.yml.md
Service with data/screenshots volume mounts
tsconfig.json.md
Strict TypeScript with ES2022 target
package.json.md
playwright, zod, tsx dependencies
playwright.config.ts.md
Scraper-focused Playwright configuration
配置用途
dockerfile.md
使用官方Playwright镜像的多阶段Dockerfile
docker-compose.yml.md
带有数据/截图卷挂载的服务配置
tsconfig.json.md
严格模式TypeScript,ES2022目标
package.json.md
playwright、zod、tsx依赖
playwright.config.ts.md
面向爬虫的Playwright配置

References (references/)

参考(references/)

ReferencePurpose
pageobject-pattern.md
PageObject pattern adapted for scraping
playwright-selectors.md
Selector strategies and resilience hierarchy
docker-setup.md
Docker configuration and deployment
agent-browser-workflow.md
Agent-browser analysis workflow
anti-patterns.md
Extended anti-pattern catalog
参考用途
pageobject-pattern.md
适配爬虫的PageObject模式
playwright-selectors.md
选择器策略和鲁棒性优先级体系
docker-setup.md
Docker配置与部署
agent-browser-workflow.md
Agent-browser分析工作流程
anti-patterns.md
扩展反模式目录

Examples (assets/examples/)

示例(assets/examples/)

ExamplePurpose
ecommerce-scraper.md
Complete multi-page product scraper walkthrough
multi-page-pagination.md
Pagination handling strategies
示例用途
ecommerce-scraper.md
完整多页面产品爬虫演练
multi-page-pagination.md
分页处理策略

Data Files (data/)

数据文件(data/)

FilePurpose
selector-patterns.json
Common selectors organized by UI element type
site-archetypes.json
Website structure archetypes with typical pages and fields
文件用途
selector-patterns.json
按UI元素类型分类的常见选择器
site-archetypes.json
网站结构原型,包含典型页面和字段

Example Interaction

示例交互

User: "I need a scraper for an online bookstore. I want to get book titles, authors, prices, and ratings from the catalog pages."
Agent workflow:
  1. Checks
    site-archetypes.json
    — matches
    ecommerce
    archetype
  2. Proposes page object map:
    • BookListingPage
      — catalog with pagination
    • BookDetailPage
      — individual book page (if detail scraping needed)
    • Pagination
      component — shared pagination handler
  3. Presents the plan with field mapping:
    • title
      [itemprop="name"]
      or
      .book-title
    • author
      [itemprop="author"]
      or
      .book-author
    • price
      [itemprop="price"]
      or
      .price
    • rating
      .star-rating
      or
      [data-rating]
  4. After confirmation, generates using the scaffold script or manual code generation
  5. Delivers project with Docker setup and Zod schemas for
    Book
    type
用户: "我需要一个在线书店的爬虫,要从目录页面获取书名、作者、价格和评分。"
Agent工作流程:
  1. 检查
    site-archetypes.json
    —— 匹配
    ecommerce
    原型
  2. 提出PageObject映射方案:
    • BookListingPage
      —— 带有分页的目录页面
    • BookDetailPage
      —— 单个书籍页面(如需详情爬取)
    • Pagination
      组件 —— 共享分页处理器
  3. 展示带有字段映射的方案:
    • title
      [itemprop="name"]
      .book-title
    • author
      [itemprop="author"]
      .book-author
    • price
      [itemprop="price"]
      .price
    • rating
      .star-rating
      [data-rating]
  4. 确认后,使用脚手架脚本或手动生成代码
  5. 交付带有Docker设置和
    Book
    类型Zod模式的项目

Integration

集成

This skill connects to:
  • typescript-best-practices — TypeScript coding patterns used in generated code
  • devcontainer — Development container setup for the generated project
  • agent-browser — Site analysis and selector discovery (external tool)
本技能可与以下技能集成:
  • typescript-best-practices —— 生成代码中使用的TypeScript编码模式
  • devcontainer —— 为生成的项目配置开发容器
  • agent-browser —— 站点分析与选择器发现(外部工具)

What You Do NOT Do

不支持的功能

This skill does NOT:
  • Bypass authentication or login walls
  • Solve CAPTCHAs or bot detection
  • Generate JavaScript-only output (always TypeScript)
  • Produce crawlers that spider entire domains
  • Create scrapers that violate robots.txt
  • Handle rate-limited APIs (use HTTP clients for API work)
  • Generate test suites (use Playwright test patterns for QA)
本技能不支持:
  • 绕过身份验证或登录墙
  • 解决验证码或机器人检测
  • 仅生成JavaScript输出(始终为TypeScript)
  • 生成遍历整个域名的爬虫
  • 创建违反robots.txt的爬虫
  • 处理速率限制的API(API相关工作请使用HTTP客户端)
  • 生成测试套件(QA测试请使用Playwright测试模式)