playwright-web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePlaywright Web Scraper
Playwright 网页爬虫
Extract structured data from multiple web pages with respectful, ethical crawling practices.
遵循合规、友好的爬虫实践,从多个网页提取结构化数据。
When to Use This Skill
何时使用本技能
Use when extracting structured data from websites with "scrape data from", "extract information from pages", "collect data from site", or "crawl multiple pages".
Do NOT use for testing workflows (use ), monitoring errors (use ), or analyzing network (use ). Always respect robots.txt and rate limits.
playwright-e2e-testingplaywright-console-monitorplaywright-network-analyzer当需要从网站提取结构化数据,需求包含“从xx抓取数据”、“从页面提取信息”、“从站点收集数据”、“爬取多个页面”时使用。
禁止用于测试工作流(请使用)、错误监控(请使用)或网络分析(请使用)。请始终遵守robots.txt规则和速率限制要求。
playwright-e2e-testingplaywright-console-monitorplaywright-network-analyzerQuick Start
快速开始
Scrape product listings from an e-commerce site:
javascript
// 1. Validate URLs
python scripts/validate_urls.py urls.txt
// 2. Scrape pages with rate limiting
const results = [];
for (const url of urls) {
await browser_navigate({ url });
await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3s delay
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
}))
`
});
results.push(...data);
}
// 3. Process results
python scripts/process_results.py scraped.json -o products.csv从电商网站抓取商品列表:
javascript
// 1. 校验URL
python scripts/validate_urls.py urls.txt
// 2. 带速率限制的页面爬取
const results = [];
for (const url of urls) {
await browser_navigate({ url });
await browser_wait_for({ time: Math.random() * 2 + 1 }); // 1-3秒延迟
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
}))
`
});
results.push(...data);
}
// 3. 处理结果
python scripts/process_results.py scraped.json -o products.csvTable of Contents
目录
- Core Workflow
- Rate Limiting Strategy
- URL Validation
- Data Extraction
- Error Handling
- Processing Results
- Supporting Files
- Expected Outcomes
- 核心工作流
- 速率限制策略
- URL校验
- 数据提取
- 错误处理
- 结果处理
- 支持文件
- 预期结果
Core Workflow
核心工作流
Step 1: Prepare URL List
步骤1:准备URL列表
Create a text file with URLs to scrape (one per line):
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3Validate URLs and check robots.txt compliance:
bash
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"创建包含待爬取URL的文本文件,每行一个URL:
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3校验URL并检查robots.txt合规性:
bash
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"Step 2: Initialize Scraping Session
步骤2:初始化爬取会话
Navigate to the site and take a snapshot to understand structure:
javascript
await browser_navigate({ url: firstUrl });
await browser_snapshot();Identify CSS selectors for data extraction using the snapshot.
访问站点并生成快照以了解页面结构:
javascript
await browser_navigate({ url: firstUrl });
await browser_snapshot();借助快照确定数据提取所需的CSS选择器。
Step 3: Implement Rate-Limited Crawling
步骤3:实现带速率限制的爬取
Use random delays between requests (1-3 seconds minimum):
javascript
const results = [];
for (const url of urlList) {
// Navigate to page
await browser_navigate({ url });
// Wait for content to load
await browser_wait_for({ text: 'Expected content marker' });
// Add respectful delay (1-3 seconds)
const delay = Math.random() * 2 + 1;
await browser_wait_for({ time: delay });
// Extract data
const pageData = await browser_evaluate({
function: `/* extraction code */`
});
results.push(...pageData);
// Check console for errors/warnings
const console = await browser_console_messages();
// Monitor for rate limit warnings
}请求之间添加随机延迟(最低1-3秒):
javascript
const results = [];
for (const url of urlList) {
// 跳转至目标页面
await browser_navigate({ url });
// 等待内容加载完成
await browser_wait_for({ text: 'Expected content marker' });
// 添加友好延迟(1-3秒)
const delay = Math.random() * 2 + 1;
await browser_wait_for({ time: delay });
// 提取数据
const pageData = await browser_evaluate({
function: `/* 提取代码 */`
});
results.push(...pageData);
// 检查控制台错误/警告
const console = await browser_console_messages();
// 监控速率限制警告
}Step 4: Extract Structured Data
步骤4:提取结构化数据
Use to extract data with JavaScript:
browser_evaluatejavascript
const data = await browser_evaluate({
function: `
try {
return Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
rating: el.querySelector('.rating')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
})).filter(item => item.title && item.price); // Filter incomplete records
} catch (e) {
console.error('Extraction failed:', e);
return [];
}
`
});See for comprehensive extraction patterns.
references/extraction-patterns.md使用通过JavaScript提取数据:
browser_evaluatejavascript
const data = await browser_evaluate({
function: `
try {
return Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
rating: el.querySelector('.rating')?.textContent?.trim(),
url: el.querySelector('a')?.getAttribute('href')
})).filter(item => item.title && item.price); // 过滤不完整记录
} catch (e) {
console.error('提取失败:', e);
return [];
}
`
});查看获取完整的提取模式参考。
references/extraction-patterns.mdStep 5: Handle Errors and Rate Limits
步骤5:处理错误和速率限制
Monitor for rate limiting indicators:
javascript
// Check HTTP responses via browser_network_requests
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);
if (rateLimited) {
// Back off exponentially
await browser_wait_for({ time: 10 }); // Wait 10 seconds
// Retry or skip
}
// Check console for blocking messages
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
// Handle blocking
}监控速率限制提示:
javascript
// 通过browser_network_requests检查HTTP响应
const requests = await browser_network_requests();
const rateLimited = requests.some(r => r.status === 429 || r.status === 503);
if (rateLimited) {
// 指数退避
await browser_wait_for({ time: 10 }); // 等待10秒
// 重试或跳过
}
// 检查控制台拦截消息
const console = await browser_console_messages({ pattern: 'rate limit|blocked|captcha' });
if (console.length > 0) {
// 处理拦截
}Step 6: Aggregate and Store Results
步骤6:聚合并存储结果
Save results to JSON file:
javascript
// In your scraping script
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));Process and convert to desired format:
bash
undefined将结果保存为JSON文件:
javascript
// 在爬取脚本中
fs.writeFileSync('scraped.json', JSON.stringify({ results }, null, 2));处理并转换为所需格式:
bash
undefinedView statistics
查看统计信息
python scripts/process_results.py scraped.json --stats
python scripts/process_results.py scraped.json --stats
Convert to CSV
转换为CSV
python scripts/process_results.py scraped.json -o output.csv
python scripts/process_results.py scraped.json -o output.csv
Convert to Markdown table
转换为Markdown表格
python scripts/process_results.py scraped.json -o output.md
undefinedpython scripts/process_results.py scraped.json -o output.md
undefinedRate Limiting Strategy
速率限制策略
Minimum Delays
最低延迟要求
Always add delays between requests:
- Standard sites: 1-3 seconds (random)
- High-traffic sites: 3-5 seconds
- Small sites: 5-10 seconds
- After errors: Exponential backoff (5s, 10s, 20s, 40s)
请求之间必须添加延迟:
- 常规站点:1-3秒(随机)
- 高流量站点:3-5秒
- 小型站点:5-10秒
- 错误后重试:指数退避(5秒、10秒、20秒、40秒)
Implementation
实现方式
javascript
// Random delay between 1-3 seconds
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });
// Exponential backoff after rate limit
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
try {
await browser_navigate({ url });
break; // Success
} catch (e) {
await browser_wait_for({ time: backoffSeconds });
backoffSeconds *= 2; // Double delay each retry
}
}javascript
// 1-3秒随机延迟
const randomDelay = () => Math.random() * 2 + 1;
await browser_wait_for({ time: randomDelay() });
// 触发速率限制后的指数退避
let backoffSeconds = 5;
for (let retry = 0; retry < 3; retry++) {
try {
await browser_navigate({ url });
break; // 访问成功
} catch (e) {
await browser_wait_for({ time: backoffSeconds });
backoffSeconds *= 2; // 每次重试延迟翻倍
}
}Adaptive Rate Limiting
自适应速率限制
Adjust delays based on response:
| Response Code | Action |
|---|---|
| 200 OK | Continue with normal delay (1-3s) |
| 429 Too Many Requests | Increase delay to 10s, retry |
| 503 Service Unavailable | Wait 60s, then retry |
| 403 Forbidden | Stop scraping this domain |
See for detailed rate limiting strategies.
references/ethical-scraping.md根据响应调整延迟:
| 响应码 | 处理动作 |
|---|---|
| 200 OK | 保持常规延迟(1-3秒)继续爬取 |
| 429 Too Many Requests | 延迟提升至10秒后重试 |
| 503 Service Unavailable | 等待60秒后重试 |
| 403 Forbidden | 停止爬取该域名 |
查看获取详细的速率限制策略。
references/ethical-scraping.mdURL Validation
URL校验
Use before scraping to ensure compliance:
validate_urls.pybash
undefined爬取前使用确保合规:
validate_urls.pybash
undefinedBasic validation
基础校验
python scripts/validate_urls.py urls.txt
python scripts/validate_urls.py urls.txt
Check robots.txt with specific user agent
使用指定User Agent检查robots.txt
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
python scripts/validate_urls.py urls.txt --user-agent "MyBot/1.0"
Strict mode (exit on any invalid/disallowed URL)
严格模式(发现无效/不允许的URL直接退出)
python scripts/validate_urls.py urls.txt --strict
**Output includes**:
- URL format validation
- Domain grouping
- robots.txt compliance check
- Summary statisticspython scripts/validate_urls.py urls.txt --strict
**输出包含**:
- URL格式校验结果
- 域名分组
- robots.txt合规检查结果
- 统计摘要Data Extraction
数据提取
Basic Pattern
基础模式
javascript
// Single page extraction
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.item')).map(el => ({
field1: el.querySelector('.selector1')?.textContent?.trim(),
field2: el.querySelector('.selector2')?.getAttribute('href')
}))
`
});javascript
// 单页面提取
const data = await browser_evaluate({
function: `
Array.from(document.querySelectorAll('.item')).map(el => ({
field1: el.querySelector('.selector1')?.textContent?.trim(),
field2: el.querySelector('.selector2')?.getAttribute('href')
}))
`
});Pagination Pattern
分页模式
javascript
let hasMore = true;
let page = 1;
while (hasMore) {
await browser_navigate({ url: `${baseUrl}?page=${page}` });
await browser_wait_for({ time: randomDelay() });
const pageData = await browser_evaluate({ function: extractionCode });
results.push(...pageData);
// Check for next page
hasMore = await browser_evaluate({
function: `document.querySelector('.next:not(.disabled)') !== null`
});
page++;
}See for:
references/extraction-patterns.md- Advanced selectors
- Data cleaning patterns
- Table extraction
- JSON-LD extraction
- Shadow DOM access
javascript
let hasMore = true;
let page = 1;
while (hasMore) {
await browser_navigate({ url: `${baseUrl}?page=${page}` });
await browser_wait_for({ time: randomDelay() });
const pageData = await browser_evaluate({ function: extractionCode });
results.push(...pageData);
// 检查是否存在下一页
hasMore = await browser_evaluate({
function: `document.querySelector('.next:not(.disabled)') !== null`
});
page++;
}查看获取以下内容:
references/extraction-patterns.md- 高级选择器
- 数据清洗模式
- 表格提取
- JSON-LD提取
- Shadow DOM访问
Error Handling
错误处理
Network Errors
网络错误
javascript
try {
await browser_navigate({ url });
} catch (e) {
console.error(`Failed to load ${url}:`, e);
failedUrls.push(url);
continue; // Skip to next URL
}javascript
try {
await browser_navigate({ url });
} catch (e) {
console.error(`加载${url}失败:`, e);
failedUrls.push(url);
continue; // 跳过当前URL,处理下一个
}Content Validation
内容校验
javascript
const data = await browser_evaluate({ function: extractionCode });
if (!data || data.length === 0) {
console.warn(`No data extracted from ${url}`);
// Log for manual review
}
// Validate data structure
const validData = data.filter(item =>
item.title && item.price // Ensure required fields exist
);javascript
const data = await browser_evaluate({ function: extractionCode });
if (!data || data.length === 0) {
console.warn(`从${url}未提取到任何数据`);
// 记录日志供人工复核
}
// 校验数据结构
const validData = data.filter(item =>
item.title && item.price // 确保必填字段存在
);Monitoring Indicators
监控指标
Check for blocking/errors:
javascript
// Monitor console
const console = await browser_console_messages({
pattern: 'error|rate|limit|captcha',
onlyErrors: true
});
if (console.length > 0) {
console.log('Warnings detected:', console);
}
// Monitor network
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);检查拦截/错误情况:
javascript
// 监控控制台
const console = await browser_console_messages({
pattern: 'error|rate|limit|captcha',
onlyErrors: true
});
if (console.length > 0) {
console.log('检测到警告:', console);
}
// 监控网络
const requests = await browser_network_requests();
const errors = requests.filter(r => r.status >= 400);Processing Results
结果处理
View Statistics
查看统计信息
bash
python scripts/process_results.py scraped.json --statsOutput:
📊 Statistics:
Total records: 150
Fields (5): title, price, rating, url, image
Sample record: {...}bash
python scripts/process_results.py scraped.json --stats输出示例:
📊 统计信息:
总记录数: 150
字段数 (5): title, price, rating, url, image
示例记录: {...}Convert Formats
格式转换
bash
undefinedbash
undefinedTo CSV
转换为CSV
python scripts/process_results.py scraped.json -o products.csv
python scripts/process_results.py scraped.json -o products.csv
To JSON (compact)
转换为紧凑格式JSON
python scripts/process_results.py scraped.json -o products.json --compact
python scripts/process_results.py scraped.json -o products.json --compact
To Markdown table
转换为Markdown表格
python scripts/process_results.py scraped.json -o products.md
undefinedpython scripts/process_results.py scraped.json -o products.md
undefinedCombine Statistics with Conversion
同时统计与转换
bash
python scripts/process_results.py scraped.json -o products.csv --statsbash
python scripts/process_results.py scraped.json -o products.csv --statsSupporting Files
支持文件
Scripts
脚本
- - Validate URL lists, check robots.txt compliance, group by domain
scripts/validate_urls.py - - Convert scraped JSON to CSV/JSON/Markdown, view statistics
scripts/process_results.py
- - 校验URL列表、检查robots.txt合规性、按域名分组
scripts/validate_urls.py - - 将爬取得到的JSON转换为CSV/JSON/Markdown格式、查看统计信息
scripts/process_results.py
References
参考文档
- - Comprehensive guide to rate limiting, robots.txt, error handling, and monitoring
references/ethical-scraping.md - - JavaScript patterns for data extraction, selectors, pagination, tables
references/extraction-patterns.md
- - 速率限制、robots.txt、错误处理、监控的完整指南
references/ethical-scraping.md - - 数据提取、选择器、分页、表格提取的JavaScript模式参考
references/extraction-patterns.md
Expected Outcomes
预期结果
Successful Scraping
爬取成功
✅ Validated 50 URLs
✅ Scraped 50 pages in 5 minutes (6 req/min)
✅ Extracted 1,250 products
✅ Zero rate limit errors
✅ Exported to products.csv (1,250 rows)✅ 已校验50个URL
✅ 5分钟内完成50个页面爬取(每分钟6次请求)
✅ 提取到1250条商品数据
✅ 无速率限制错误
✅ 已导出至products.csv(共1250行)With Error Handling
含错误处理的爬取结果
⚠️ Validated 50 URLs (2 disallowed by robots.txt)
✅ Scraped 48 pages
⚠️ 3 pages returned no data (logged for review)
✅ Extracted 1,100 products
⚠️ 1 rate limit warning (backed off successfully)
✅ Exported to products.csv (1,100 rows)⚠️ 已校验50个URL(2个被robots.txt禁止访问)
✅ 完成48个页面爬取
⚠️ 3个页面未提取到数据(已记录待复核)
✅ 提取到1100条商品数据
⚠️ 1次速率限制警告(退避重试成功)
✅ 已导出至products.csv(共1100行)Rate Limit Detection
速率限制检测结果
❌ Rate limited after 20 pages (429 responses)
✅ Backed off exponentially (5s → 10s → 20s)
✅ Resumed scraping successfully
✅ Extracted 450 products from 25 pages❌ 爬取20个页面后触发速率限制(返回429响应)
✅ 已执行指数退避(5秒 → 10秒 → 20秒)
✅ 已成功恢复爬取
✅ 从25个页面提取到450条商品数据Expected Benefits
预期收益
| Metric | Before | After |
|---|---|---|
| Setup time | 30-45 min | 5-10 min |
| Rate limit errors | Common | Rare |
| robots.txt violations | Possible | Prevented |
| Data format conversion | Manual | Automated |
| Error detection | Manual review | Automated monitoring |
| 指标 | 使用前 | 使用后 |
|---|---|---|
| 配置时间 | 30-45分钟 | 5-10分钟 |
| 速率限制错误 | 频繁出现 | 极少出现 |
| robots.txt违规 | 可能发生 | 完全避免 |
| 数据格式转换 | 手动处理 | 自动完成 |
| 错误检测 | 人工复核 | 自动监控 |
Success Metrics
成功指标
- Success rate > 95% (pages successfully scraped)
- Rate limit errors < 5% of requests
- Valid data rate > 90% (complete records)
- Scraping speed 6-12 requests/minute (polite crawling)
- 成功率 > 95%(页面成功爬取占比)
- 速率限制错误 < 5%(占总请求数比例)
- 有效数据率 > 90%(完整记录占比)
- 爬取速度 每分钟6-12次请求(友好爬取标准)
Requirements
要求
Tools
工具依赖
- Playwright MCP browser tools
- Python 3.8+ (for scripts)
- Standard library only (no external dependencies for scripts)
- Playwright MCP浏览器工具
- Python 3.8+(运行脚本使用)
- 仅需标准库(脚本无外部依赖)
Knowledge
知识要求
- Basic CSS selectors
- JavaScript for data extraction
- Understanding of HTTP status codes
- Awareness of web scraping ethics
- 基础CSS选择器知识
- 用于数据提取的JavaScript基础
- 了解HTTP状态码含义
- 具备网页爬取伦理意识
Red Flags to Avoid
需要避免的红线
- ❌ Scraping without checking robots.txt
- ❌ No delays between requests (hammering servers)
- ❌ Ignoring 429/503 response codes
- ❌ Scraping personal/private information
- ❌ Not monitoring console for blocking messages
- ❌ Scraping sites that explicitly prohibit it (check ToS)
- ❌ Using scraped data in violation of copyright
- ❌ Not handling pagination correctly (missing data)
- ❌ Hardcoding selectors without fallbacks
- ❌ Not validating extracted data structure
- ❌ 未检查robots.txt直接爬取
- ❌ 请求之间无延迟(频繁请求冲击服务器)
- ❌ 忽略429/503响应码
- ❌ 爬取个人/隐私信息
- ❌ 未监控控制台拦截消息
- ❌ 爬取明确禁止爬虫的站点(需查看服务条款)
- ❌ 违反版权规定使用爬取到的数据
- ❌ 分页处理错误(遗漏数据)
- ❌ 硬编码选择器且无兜底方案
- ❌ 未校验提取到的数据结构
Notes
注意事项
- Default to polite crawling: 1-3 second delays minimum, adjust based on site response
- Always check robots.txt first: Use before scraping
validate_urls.py - Monitor console and network: Watch for rate limit warnings and adjust delays
- Start small: Test with 5-10 URLs before scaling to hundreds
- Save progress: Write results incrementally in case of interruption
- Respect ToS: Some sites prohibit scraping in their terms of service
- Use descriptive user agents: Identify your bot clearly
- Handle errors gracefully: Log failures for manual review, don't crash
- 默认采用友好爬取策略:最低1-3秒延迟,根据站点响应调整
- 务必先检查robots.txt:爬取前使用校验
validate_urls.py - 监控控制台和网络状态:关注速率限制警告并调整延迟
- 从小规模测试开始:先测试5-10个URL,再扩展到大规模爬取
- 实时保存进度:增量写入结果,避免中断导致数据丢失
- 遵守服务条款:部分站点的服务条款明确禁止爬取
- 使用辨识度高的User Agent:明确标识你的爬虫身份
- 优雅处理错误:记录失败请求供人工复核,避免程序直接崩溃