playwright-web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Playwright Web Scraper

Playwright 网页爬虫

Extract structured data from multiple web pages with respectful, ethical crawling practices.
遵循合规、友好的爬虫实践,从多个网页提取结构化数据。

When to Use This Skill

何时使用本技能

Use when extracting structured data from websites with "scrape data from", "extract information from pages", "collect data from site", or "crawl multiple pages".
Do NOT use for testing workflows (use
playwright-e2e-testing
), monitoring errors (use
playwright-console-monitor
), or analyzing network (use
playwright-network-analyzer
). Always respect robots.txt and rate limits.
当需要从网站提取结构化数据,需求包含“从xx抓取数据”、“从页面提取信息”、“从站点收集数据”、“爬取多个页面”时使用。
禁止用于测试工作流(请使用
playwright-e2e-testing
)、错误监控(请使用
playwright-console-monitor
)或网络分析(请使用
playwright-network-analyzer
)。请始终遵守robots.txt规则和速率限制要求。

Quick Start

快速开始

Scrape product listings from an e-commerce site:
javascript
// 1. Validate URLs
python scripts/validate_urls.py urls.txt

// 2. Scrape pages with rate limiting
const results = [];
for (const url of urls) {
  await browser_navigate({ url });
  await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3s delay

  const data = await browser_evaluate({
    function: `
      Array.from(document.querySelectorAll('.product')).map(el => ({
        title: el.querySelector('.title')?.textContent?.trim(),
        price: el.querySelector('.price')?.textContent?.trim(),
        url: el.querySelector('a')?.getAttribute('href')
      }))
    `
  });

  results.push(...data);
}

// 3. Process results
python scripts/process_results.py scraped.json -o products.csv
从电商网站抓取商品列表:
javascript
// 1. 校验URL
python scripts/validate_urls.py urls.txt

// 2. 带速率限制的页面爬取
const results = [];
for (const url of urls) {
  await browser_navigate({ url });
  await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3秒延迟

  const data = await browser_evaluate({
    function: `
      Array.from(document.querySelectorAll('.product')).map(el => ({
        title: el.querySelector('.title')?.textContent?.trim(),
        price: el.querySelector('.price')?.textContent?.trim(),
        url: el.querySelector('a')?.getAttribute('href')
      }))
    `
  });

  results.push(...data);
}

// 3. 处理结果
python scripts/process_results.py scraped.json -o products.csv

Table of Contents

目录

  1. Core Workflow
  2. Rate Limiting Strategy
  3. URL Validation
  4. Data Extraction
  5. Error Handling
  6. Processing Results
  7. Supporting Files
  8. Expected Outcomes
  1. 核心工作流
  2. 速率限制策略
  3. URL校验
  4. 数据提取
  5. 错误处理
  6. 结果处理
  7. 支持文件
  8. 预期结果

Core Workflow

核心工作流

Step 1: Prepare URL List

步骤1:准备URL列表

Create a text file with URLs to scrape (one per line):
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
Validate URLs and check robots.txt compliance:
bash
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
创建包含待爬取URL的文本文件,每行一个URL:
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
校验URL并检查robots.txt合规性:
bash
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"

Step 2: Initialize Scraping Session

步骤2:初始化爬取会话

Navigate to the site and take a snapshot to understand structure:
javascript
await browser_navigate({ url: firstUrl });
await browser_snapshot();
Identify CSS selectors for data extraction using the snapshot.
访问站点并生成快照以了解页面结构:
javascript
await browser_navigate({ url: firstUrl });
await browser_snapshot();
借助快照确定数据提取所需的CSS选择器。

Step 3: Implement Rate-Limited Crawling

步骤3:实现带速率限制的爬取

Use random delays between requests (1-3 seconds minimum):
javascript
const results = [];

for (const url of urlList) {
  // Navigate to page
  await browser_navigate({ url });

  // Wait for content to load
  await browser_wait_for({ text: 'Expected content marker' });

  // Add respectful delay (1-3 seconds)
  const delay = Math.random() * 2 + 1;
  await browser_wait_for({ time: delay });

  // Extract data
  const pageData = await browser_evaluate({
    function: `/* extraction code */`
  });

  results.push(...pageData);

  // Check console for errors/warnings
  const console = await browser_console_messages();
  // Monitor for rate limit warnings
}
请求之间添加随机延迟(最低1-3秒):
javascript
const results = [];

for (const url of urlList) {
  // 跳转至目标页面
  await browser_navigate({ url });

  // 等待内容加载完成
  await browser_wait_for({ text: 'Expected content marker' });

  // 添加友好延迟(1-3秒)
  const delay = Math.random() * 2 + 1;
  await browser_wait_for({ time: delay });

  // 提取数据
  const pageData = await browser_evaluate({
    function: `/* 提取代码 */`
  });

  results.push(...pageData);

  // 检查控制台错误/警告
  const console = await browser_console_messages();
  // 监控速率限制警告
}

Step 4: Extract Structured Data

步骤4:提取结构化数据

Use
browser_evaluate
to extract data with JavaScript:
javascript
const data = await browser_evaluate({
  function: `
    try {
      return Array.from(document.querySelectorAll('.item')).map(el => ({
        title: el.querySelector('.title')?.textContent?.trim(),
        price: el.querySelector('.price')?.textContent?.trim(),
        rating: el.querySelector('.rating')?.textContent?.trim(),
        url: el.querySelector('a')?.getAttribute('href')
      })).filter(item => item.title && item.price); // Filter incomplete records
    } catch (e) {
      console.error('Extraction failed:', e);
      return [];
    }
  `
});
See
references/extraction-patterns.md
for comprehensive extraction patterns.
使用
browser_evaluate
通过JavaScript提取数据:
javascript
const data = await browser_evaluate({
  function: `
    try {
      return Array.from(document.querySelectorAll('.item')).map(el => ({
        title: el.querySelector('.title')?.textContent?.trim(),
        price: el.querySelector('.price')?.textContent?.trim(),
        rating: el.querySelector('.rating')?.textContent?.trim(),
        url: el.querySelector('a')?.getAttribute('href')
      })).filter(item => item.title && item.price); // 过滤不完整记录
    } catch (e) {
      console.error('提取失败:', e);
      return [];
    }
  `
});
查看
references/extraction-patterns.md
获取完整的提取模式参考。

Step 5: Handle Errors and Rate Limits

步骤5:处理错误和速率限制

Monitor for rate limiting indicators:
javascript
// Check HTTP responses via browser_network_requests
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);

if (rateLimited) {
  // Back off exponentially
  await browser_wait_for({ time: 10 }); // Wait 10 seconds
  // Retry or skip
}

// Check console for blocking messages
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
  // Handle blocking
}
监控速率限制提示:
javascript
// 通过browser_network_requests检查HTTP响应
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);

if (rateLimited) {
  // 指数退避
  await browser_wait_for({ time: 10 }); // 等待10秒
  // 重试或跳过
}

// 检查控制台拦截消息
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
  // 处理拦截
}

Step 6: Aggregate and Store Results

步骤6:聚合并存储结果

Save results to JSON file:
javascript
// In your scraping script
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));
Process and convert to desired format:
bash
undefined
将结果保存为JSON文件:
javascript
// 在爬取脚本中
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));
处理并转换为所需格式:
bash
undefined

View statistics

查看统计信息

python scripts/process_results.py scraped.json --stats
python scripts/process_results.py scraped.json --stats

Convert to CSV

转换为CSV

python scripts/process_results.py scraped.json -o output.csv
python scripts/process_results.py scraped.json -o output.csv

Convert to Markdown table

转换为Markdown表格

python scripts/process_results.py scraped.json -o output.md
undefined
python scripts/process_results.py scraped.json -o output.md
undefined

Rate Limiting Strategy

速率限制策略

Minimum Delays

最低延迟要求

Always add delays between requests:
  • Standard sites: 1-3 seconds (random)
  • High-traffic sites: 3-5 seconds
  • Small sites: 5-10 seconds
  • After errors: Exponential backoff (5s, 10s, 20s, 40s)
请求之间必须添加延迟:
  • 常规站点:1-3秒(随机)
  • 高流量站点:3-5秒
  • 小型站点:5-10秒
  • 错误后重试:指数退避(5秒、10秒、20秒、40秒)

Implementation

实现方式

javascript
// Random delay between 1-3 seconds
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });

// Exponential backoff after rate limit
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
  try {
    await browser_navigate({ url });
    break; // Success
  } catch (e) {
    await browser_wait_for({ time: backoffSeconds });
    backoffSeconds *= 2; // Double delay each retry
  }
}
javascript
// 1-3秒随机延迟
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });

// 触发速率限制后的指数退避
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
  try {
    await browser_navigate({ url });
    break; // 访问成功
  } catch (e) {
    await browser_wait_for({ time: backoffSeconds });
    backoffSeconds *= 2; // 每次重试延迟翻倍
  }
}

Adaptive Rate Limiting

自适应速率限制

Adjust delays based on response:
Response CodeAction
200 OKContinue with normal delay (1-3s)
429 Too Many RequestsIncrease delay to 10s, retry
503 Service UnavailableWait 60s, then retry
403 ForbiddenStop scraping this domain
See
references/ethical-scraping.md
for detailed rate limiting strategies.
根据响应调整延迟:
响应码处理动作
200 OK保持常规延迟(1-3秒)继续爬取
429 Too Many Requests延迟提升至10秒后重试
503 Service Unavailable等待60秒后重试
403 Forbidden停止爬取该域名
查看
references/ethical-scraping.md
获取详细的速率限制策略。

URL Validation

URL校验

Use
validate_urls.py
before scraping to ensure compliance:
bash
undefined
爬取前使用
validate_urls.py
确保合规:
bash
undefined

Basic validation

基础校验

python scripts/validate_urls.py urls.txt
python scripts/validate_urls.py urls.txt

Check robots.txt with specific user agent

使用指定User Agent检查robots.txt

python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"

Strict mode (exit on any invalid/disallowed URL)

严格模式(发现无效/不允许的URL直接退出)

python scripts/validate_urls.py urls.txt --strict

**Output includes**:

- URL format validation
- Domain grouping
- robots.txt compliance check
- Summary statistics
python scripts/validate_urls.py urls.txt --strict

**输出包含**:

- URL格式校验结果
- 域名分组
- robots.txt合规检查结果
- 统计摘要

Data Extraction

数据提取

Basic Pattern

基础模式

javascript
// Single page extraction
const data = await browser_evaluate({
  function: `
    Array.from(document.querySelectorAll('.item')).map(el => ({
      field1: el.querySelector('.selector1')?.textContent?.trim(),
      field2: el.querySelector('.selector2')?.getAttribute('href')
    }))
  `
});
javascript
// 单页面提取
const data = await browser_evaluate({
  function: `
    Array.from(document.querySelectorAll('.item')).map(el => ({
      field1: el.querySelector('.selector1')?.textContent?.trim(),
      field2: el.querySelector('.selector2')?.getAttribute('href')
    }))
  `
});

Pagination Pattern

分页模式

javascript
let hasMore = true;
let page = 1;

while (hasMore) {
  await browser_navigate({ url: `${baseUrl}?page=${page}` });
  await browser_wait_for({ time: randomDelay() });

  const pageData = await browser_evaluate({ function: extractionCode });
  results.push(...pageData);

  // Check for next page
  hasMore = await browser_evaluate({
    function: `document.querySelector('.next:not(.disabled)') !== null`
  });

  page++;
}
See
references/extraction-patterns.md
for:
  • Advanced selectors
  • Data cleaning patterns
  • Table extraction
  • JSON-LD extraction
  • Shadow DOM access
javascript
let hasMore = true;
let page = 1;

while (hasMore) {
  await browser_navigate({ url: `${baseUrl}?page=${page}` });
  await browser_wait_for({ time: randomDelay() });

  const pageData = await browser_evaluate({ function: extractionCode });
  results.push(...pageData);

  // 检查是否存在下一页
  hasMore = await browser_evaluate({
    function: `document.querySelector('.next:not(.disabled)') !== null`
  });

  page++;
}
查看
references/extraction-patterns.md
获取以下内容:
  • 高级选择器
  • 数据清洗模式
  • 表格提取
  • JSON-LD提取
  • Shadow DOM访问

Error Handling

错误处理

Network Errors

网络错误

javascript
try {
  await browser_navigate({ url });
} catch (e) {
  console.error(`Failed to load ${url}:`, e);
  failedUrls.push(url);
  continue; // Skip to next URL
}
javascript
try {
  await browser_navigate({ url });
} catch (e) {
  console.error(`加载${url}失败:`, e);
  failedUrls.push(url);
  continue; // 跳过当前URL,处理下一个
}

Content Validation

内容校验

javascript
const data = await browser_evaluate({ function: extractionCode });

if (!data || data.length === 0) {
  console.warn(`No data extracted from ${url}`);
  // Log for manual review
}

// Validate data structure
const validData = data.filter(item =>
  item.title && item.price // Ensure required fields exist
);
javascript
const data = await browser_evaluate({ function: extractionCode });

if (!data || data.length === 0) {
  console.warn(`${url}未提取到任何数据`);
  // 记录日志供人工复核
}

// 校验数据结构
const validData = data.filter(item =>
  item.title && item.price // 确保必填字段存在
);

Monitoring Indicators

监控指标

Check for blocking/errors:
javascript
// Monitor console
const console = await browser_console_messages({
  pattern: 'error|rate|limit|captcha',
  onlyErrors: true
});

if (console.length > 0) {
  console.log('Warnings detected:', console);
}

// Monitor network
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);
检查拦截/错误情况:
javascript
// 监控控制台
const console = await browser_console_messages({
  pattern: 'error|rate|limit|captcha',
  onlyErrors: true
});

if (console.length > 0) {
  console.log('检测到警告:', console);
}

// 监控网络
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);

Processing Results

结果处理

View Statistics

查看统计信息

bash
python scripts/process_results.py scraped.json --stats
Output:
📊 Statistics:
  Total records: 150
  Fields (5): title, price, rating, url, image
  Sample record: {...}
bash
python scripts/process_results.py scraped.json --stats
输出示例:
📊 统计信息:
  总记录数: 150
  字段数 (5): title, price, rating, url, image
  示例记录: {...}

Convert Formats

格式转换

bash
undefined
bash
undefined

To CSV

转换为CSV

python scripts/process_results.py scraped.json -o products.csv
python scripts/process_results.py scraped.json -o products.csv

To JSON (compact)

转换为紧凑格式JSON

python scripts/process_results.py scraped.json -o products.json --compact
python scripts/process_results.py scraped.json -o products.json --compact

To Markdown table

转换为Markdown表格

python scripts/process_results.py scraped.json -o products.md
undefined
python scripts/process_results.py scraped.json -o products.md
undefined

Combine Statistics with Conversion

同时统计与转换

bash
python scripts/process_results.py scraped.json -o products.csv --stats
bash
python scripts/process_results.py scraped.json -o products.csv --stats

Supporting Files

支持文件

Scripts

脚本

  • scripts/validate_urls.py
    - Validate URL lists, check robots.txt compliance, group by domain
  • scripts/process_results.py
    - Convert scraped JSON to CSV/JSON/Markdown, view statistics
  • scripts/validate_urls.py
    - 校验URL列表、检查robots.txt合规性、按域名分组
  • scripts/process_results.py
    - 将爬取得到的JSON转换为CSV/JSON/Markdown格式、查看统计信息

References

参考文档

  • references/ethical-scraping.md
    - Comprehensive guide to rate limiting, robots.txt, error handling, and monitoring
  • references/extraction-patterns.md
    - JavaScript patterns for data extraction, selectors, pagination, tables
  • references/ethical-scraping.md
    - 速率限制、robots.txt、错误处理、监控的完整指南
  • references/extraction-patterns.md
    - 数据提取、选择器、分页、表格提取的JavaScript模式参考

Expected Outcomes

预期结果

Successful Scraping

爬取成功

✅ Validated 50 URLs
✅ Scraped 50 pages in 5 minutes (6 req/min)
✅ Extracted 1,250 products
✅ Zero rate limit errors
✅ Exported to products.csv (1,250 rows)
✅ 已校验50个URL
✅ 5分钟内完成50个页面爬取(每分钟6次请求)
✅ 提取到1250条商品数据
✅ 无速率限制错误
✅ 已导出至products.csv(共1250行)

With Error Handling

含错误处理的爬取结果

⚠️  Validated 50 URLs (2 disallowed by robots.txt)
✅ Scraped 48 pages
⚠️  3 pages returned no data (logged for review)
✅ Extracted 1,100 products
⚠️  1 rate limit warning (backed off successfully)
✅ Exported to products.csv (1,100 rows)
⚠️  已校验50个URL(2个被robots.txt禁止访问)
✅ 完成48个页面爬取
⚠️  3个页面未提取到数据(已记录待复核)
✅ 提取到1100条商品数据
⚠️  1次速率限制警告(退避重试成功)
✅ 已导出至products.csv(共1100行)

Rate Limit Detection

速率限制检测结果

❌ Rate limited after 20 pages (429 responses)
✅ Backed off exponentially (5s → 10s → 20s)
✅ Resumed scraping successfully
✅ Extracted 450 products from 25 pages
❌ 爬取20个页面后触发速率限制(返回429响应)
✅ 已执行指数退避(5秒 → 10秒 → 20秒)
✅ 已成功恢复爬取
✅ 从25个页面提取到450条商品数据

Expected Benefits

预期收益

MetricBeforeAfter
Setup time30-45 min5-10 min
Rate limit errorsCommonRare
robots.txt violationsPossiblePrevented
Data format conversionManualAutomated
Error detectionManual reviewAutomated monitoring
指标使用前使用后
配置时间30-45分钟5-10分钟
速率限制错误频繁出现极少出现
robots.txt违规可能发生完全避免
数据格式转换手动处理自动完成
错误检测人工复核自动监控

Success Metrics

成功指标

  • Success rate > 95% (pages successfully scraped)
  • Rate limit errors < 5% of requests
  • Valid data rate > 90% (complete records)
  • Scraping speed 6-12 requests/minute (polite crawling)
  • 成功率 > 95%(页面成功爬取占比)
  • 速率限制错误 < 5%(占总请求数比例)
  • 有效数据率 > 90%(完整记录占比)
  • 爬取速度 每分钟6-12次请求(友好爬取标准)

Requirements

要求

Tools

工具依赖

  • Playwright MCP browser tools
  • Python 3.8+ (for scripts)
  • Standard library only (no external dependencies for scripts)
  • Playwright MCP浏览器工具
  • Python 3.8+(运行脚本使用)
  • 仅需标准库(脚本无外部依赖)

Knowledge

知识要求

  • Basic CSS selectors
  • JavaScript for data extraction
  • Understanding of HTTP status codes
  • Awareness of web scraping ethics
  • 基础CSS选择器知识
  • 用于数据提取的JavaScript基础
  • 了解HTTP状态码含义
  • 具备网页爬取伦理意识

Red Flags to Avoid

需要避免的红线

  • ❌ Scraping without checking robots.txt
  • ❌ No delays between requests (hammering servers)
  • ❌ Ignoring 429/503 response codes
  • ❌ Scraping personal/private information
  • ❌ Not monitoring console for blocking messages
  • ❌ Scraping sites that explicitly prohibit it (check ToS)
  • ❌ Using scraped data in violation of copyright
  • ❌ Not handling pagination correctly (missing data)
  • ❌ Hardcoding selectors without fallbacks
  • ❌ Not validating extracted data structure
  • ❌ 未检查robots.txt直接爬取
  • ❌ 请求之间无延迟(频繁请求冲击服务器)
  • ❌ 忽略429/503响应码
  • ❌ 爬取个人/隐私信息
  • ❌ 未监控控制台拦截消息
  • ❌ 爬取明确禁止爬虫的站点(需查看服务条款)
  • ❌ 违反版权规定使用爬取到的数据
  • ❌ 分页处理错误(遗漏数据)
  • ❌ 硬编码选择器且无兜底方案
  • ❌ 未校验提取到的数据结构

Notes

注意事项

  • Default to polite crawling: 1-3 second delays minimum, adjust based on site response
  • Always check robots.txt first: Use
    validate_urls.py
    before scraping
  • Monitor console and network: Watch for rate limit warnings and adjust delays
  • Start small: Test with 5-10 URLs before scaling to hundreds
  • Save progress: Write results incrementally in case of interruption
  • Respect ToS: Some sites prohibit scraping in their terms of service
  • Use descriptive user agents: Identify your bot clearly
  • Handle errors gracefully: Log failures for manual review, don't crash
  • 默认采用友好爬取策略:最低1-3秒延迟,根据站点响应调整
  • 务必先检查robots.txt:爬取前使用
    validate_urls.py
    校验
  • 监控控制台和网络状态:关注速率限制警告并调整延迟
  • 从小规模测试开始:先测试5-10个URL,再扩展到大规模爬取
  • 实时保存进度:增量写入结果,避免中断导致数据丢失
  • 遵守服务条款:部分站点的服务条款明确禁止爬取
  • 使用辨识度高的User Agent:明确标识你的爬虫身份
  • 优雅处理错误:记录失败请求供人工复核,避免程序直接崩溃