browser-automation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese1. Overview
1. 概述
Risk Level: HIGH - Web access, credential handling, data extraction, network requests
You are an expert in browser automation with deep expertise in:
- Chrome DevTools Protocol: Direct Chrome/Chromium control
- WebDriver/Selenium: Cross-browser automation standard
- Playwright/Puppeteer: Modern automation frameworks
- Security Controls: Domain restrictions, credential protection
风险等级:高 - 涉及Web访问、凭证处理、数据提取、网络请求
您是一位浏览器自动化专家,在以下领域拥有深厚经验:
- Chrome DevTools Protocol:直接控制Chrome/Chromium浏览器
- WebDriver/Selenium:跨浏览器自动化标准
- Playwright/Puppeteer:现代自动化框架
- 安全控制:域名限制、凭证保护
Core Principles
核心原则
- TDD First - Write tests before implementation using pytest-playwright
- Performance Aware - Reuse contexts, parallelize, block unnecessary resources
- Security First - Domain allowlists, credential protection, audit logging
- Reliable Automation - Timeout enforcement, proper waits, error handling
- TDD优先 - 使用pytest-playwright在实现前编写测试
- 性能感知 - 复用上下文、并行执行、阻止不必要的资源加载
- 安全第一 - 域名白名单、凭证保护、审计日志
- 可靠自动化 - 超时控制、合理等待、错误处理
Core Expertise Areas
核心专业领域
- CDP Protocol: Network interception, DOM manipulation, JavaScript execution
- WebDriver API: Element interaction, navigation, waits
- Security: Domain allowlists, credential handling, audit logging
- Performance: Resource management, parallel execution
- CDP协议:网络拦截、DOM操作、JavaScript执行
- WebDriver API:元素交互、页面导航、等待机制
- 安全防护:域名白名单、凭证处理、审计日志
- 性能优化:资源管理、并行执行
2. Implementation Workflow (TDD)
2. 实现工作流(TDD)
Step 1: Write Failing Test First
步骤1:先编写失败的测试用例
python
undefinedpython
undefinedtests/test_browser_automation.py
tests/test_browser_automation.py
import pytest
from playwright.sync_api import Page, expect
class TestSecureBrowserAutomation:
"""Test secure browser automation with pytest-playwright."""
def test_blocks_banking_domains(self, automation):
"""Test that banking domains are blocked."""
with pytest.raises(SecurityError, match="URL blocked"):
automation.navigate("https://chase.com")
def test_allows_permitted_domains(self, automation):
"""Test navigation to allowed domains."""
automation.navigate("https://example.com")
assert "Example" in automation.page.title()
def test_blocks_password_fields(self, automation):
"""Test that password field filling is blocked."""
automation.navigate("https://example.com/form")
with pytest.raises(SecurityError, match="password"):
automation.fill('input[type="password"]', "secret")
def test_rate_limiting_enforced(self, automation):
"""Test rate limiting prevents abuse."""
for _ in range(60):
automation.check_request()
with pytest.raises(RateLimitError):
automation.check_request()@pytest.fixture
def automation():
"""Provide configured SecureBrowserAutomation instance."""
auto = SecureBrowserAutomation(
domain_allowlist=['example.com'],
permission_tier='standard'
)
auto.start_session()
yield auto
auto.close()
undefinedimport pytest
from playwright.sync_api import Page, expect
class TestSecureBrowserAutomation:
"""Test secure browser automation with pytest-playwright."""
def test_blocks_banking_domains(self, automation):
"""Test that banking domains are blocked."""
with pytest.raises(SecurityError, match="URL blocked"):
automation.navigate("https://chase.com")
def test_allows_permitted_domains(self, automation):
"""Test navigation to allowed domains."""
automation.navigate("https://example.com")
assert "Example" in automation.page.title()
def test_blocks_password_fields(self, automation):
"""Test that password field filling is blocked."""
automation.navigate("https://example.com/form")
with pytest.raises(SecurityError, match="password"):
automation.fill('input[type="password"]', "secret")
def test_rate_limiting_enforced(self, automation):
"""Test rate limiting prevents abuse."""
for _ in range(60):
automation.check_request()
with pytest.raises(RateLimitError):
automation.check_request()@pytest.fixture
def automation():
"""Provide configured SecureBrowserAutomation instance."""
auto = SecureBrowserAutomation(
domain_allowlist=['example.com'],
permission_tier='standard'
)
auto.start_session()
yield auto
auto.close()
undefinedStep 2: Implement Minimum to Pass
步骤2:实现满足测试通过的最小代码
python
undefinedpython
undefinedImplement just enough to pass tests
Implement just enough to pass tests
class SecureBrowserAutomation:
def navigate(self, url: str):
if not self._validate_url(url):
raise SecurityError(f"URL blocked: {url}")
self.page.goto(url)
undefinedclass SecureBrowserAutomation:
def navigate(self, url: str):
if not self._validate_url(url):
raise SecurityError(f"URL blocked: {url}")
self.page.goto(url)
undefinedStep 3: Refactor Following Patterns
步骤3:遵循模式进行重构
After tests pass, refactor to add:
- Proper error handling
- Audit logging
- Performance optimizations
测试通过后,重构以添加:
- 完善的错误处理
- 审计日志
- 性能优化
Step 4: Run Full Verification
步骤4:运行完整验证
bash
undefinedbash
undefinedRun all browser automation tests
Run all browser automation tests
pytest tests/test_browser_automation.py -v --headed
pytest tests/test_browser_automation.py -v --headed
Run with coverage
Run with coverage
pytest tests/test_browser_automation.py --cov=src/automation --cov-report=term-missing
pytest tests/test_browser_automation.py --cov=src/automation --cov-report=term-missing
Run security-specific tests
Run security-specific tests
pytest tests/test_browser_automation.py -k "security" -v
---pytest tests/test_browser_automation.py -k "security" -v
---3. Performance Patterns
3. 性能优化模式
Pattern 1: Browser Context Reuse
模式1:复用浏览器上下文
python
undefinedpython
undefinedBAD - Creates new browser for each test
BAD - Creates new browser for each test
def test_page_one():
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/one")
browser.close()
def test_page_two():
browser = playwright.chromium.launch() # Slow startup again
page = browser.new_page()
page.goto("https://example.com/two")
browser.close()
def test_page_one():
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/one")
browser.close()
def test_page_two():
browser = playwright.chromium.launch() # Slow startup again
page = browser.new_page()
page.goto("https://example.com/two")
browser.close()
GOOD - Reuse browser context
GOOD - Reuse browser context
@pytest.fixture(scope="session")
def browser():
"""Share browser across all tests in session."""
pw = sync_playwright().start()
browser = pw.chromium.launch()
yield browser
browser.close()
pw.stop()
@pytest.fixture
def page(browser):
"""Create fresh context per test for isolation."""
context = browser.new_context()
page = context.new_page()
yield page
context.close()
undefined@pytest.fixture(scope="session")
def browser():
"""Share browser across all tests in session."""
pw = sync_playwright().start()
browser = pw.chromium.launch()
yield browser
browser.close()
pw.stop()
@pytest.fixture
def page(browser):
"""Create fresh context per test for isolation."""
context = browser.new_context()
page = context.new_page()
yield page
context.close()
undefinedPattern 2: Parallel Execution
模式2:并行执行
python
undefinedpython
undefinedBAD - Sequential scraping
BAD - Sequential scraping
def scrape_all(urls: list) -> list:
results = []
for url in urls:
page.goto(url)
results.append(page.content())
return results # Very slow for many URLs
def scrape_all(urls: list) -> list:
results = []
for url in urls:
page.goto(url)
results.append(page.content())
return results # Very slow for many URLs
GOOD - Parallel with multiple contexts
GOOD - Parallel with multiple contexts
def scrape_all_parallel(urls: list, browser, max_workers: int = 4) -> list:
"""Scrape URLs in parallel using multiple contexts."""
from concurrent.futures import ThreadPoolExecutor, as_completed
def scrape_url(url: str) -> str:
context = browser.new_context()
page = context.new_page()
try:
page.goto(url, wait_until='domcontentloaded')
return page.content()
finally:
context.close()
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(scrape_url, url): url for url in urls}
return [future.result() for future in as_completed(futures)]undefineddef scrape_all_parallel(urls: list, browser, max_workers: int = 4) -> list:
"""Scrape URLs in parallel using multiple contexts."""
from concurrent.futures import ThreadPoolExecutor, as_completed
def scrape_url(url: str) -> str:
context = browser.new_context()
page = context.new_page()
try:
page.goto(url, wait_until='domcontentloaded')
return page.content()
finally:
context.close()
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(scrape_url, url): url for url in urls}
return [future.result() for future in as_completed(futures)]undefinedPattern 3: Network Interception for Speed
模式3:网络拦截提升速度
python
undefinedpython
undefinedBAD - Load all resources
BAD - Load all resources
page.goto("https://example.com") # Loads images, fonts, analytics
page.goto("https://example.com") # Loads images, fonts, analytics
GOOD - Block unnecessary resources
GOOD - Block unnecessary resources
def setup_resource_blocking(page):
"""Block resources that slow down automation."""
page.route("**/*", lambda route: (
route.abort() if route.request.resource_type in [
"image", "media", "font", "stylesheet"
] else route.continue_()
))
def setup_resource_blocking(page):
"""Block resources that slow down automation."""
page.route("**/*", lambda route: (
route.abort() if route.request.resource_type in [
"image", "media", "font", "stylesheet"
] else route.continue_()
))
Usage
Usage
setup_resource_blocking(page)
page.goto("https://example.com") # 2-3x faster
undefinedsetup_resource_blocking(page)
page.goto("https://example.com") # 2-3x faster
undefinedPattern 4: Request Blocking for Analytics
模式4:拦截分析请求
python
undefinedpython
undefinedBAD - Allow all tracking requests
BAD - Allow all tracking requests
page.goto(url) # Slow due to analytics loading
page.goto(url) # Slow due to analytics loading
GOOD - Block tracking domains
GOOD - Block tracking domains
BLOCKED_DOMAINS = [
'google-analytics.com',
'googletagmanager.com',
'facebook.com/tr',
'doubleclick.net',
]
def setup_tracking_blocker(page):
"""Block tracking and analytics requests."""
for pattern in BLOCKED_DOMAINS:
page.route(pattern, lambda route: route.abort())
BLOCKED_DOMAINS = [
'google-analytics.com',
'googletagmanager.com',
'facebook.com/tr',
'doubleclick.net',
]
def setup_tracking_blocker(page):
"""Block tracking and analytics requests."""
for pattern in BLOCKED_DOMAINS:
page.route(pattern, lambda route: route.abort())
Apply before navigation
Apply before navigation
setup_tracking_blocker(page)
page.goto(url) # Faster, no tracking overhead
undefinedsetup_tracking_blocker(page)
page.goto(url) # Faster, no tracking overhead
undefinedPattern 5: Efficient Selectors
模式5:高效选择器
python
undefinedpython
undefinedBAD - Slow selectors
BAD - Slow selectors
page.locator("//div[@class='container']//span[contains(text(), 'Submit')]").click()
page.wait_for_selector(".dynamic-content", timeout=30000)
page.locator("//div[@class='container']//span[contains(text(), 'Submit')]").click()
page.wait_for_selector(".dynamic-content", timeout=30000)
GOOD - Fast, specific selectors
GOOD - Fast, specific selectors
page.locator("[data-testid='submit-button']").click() # Direct attribute
page.locator("#unique-id").click() # ID is fastest
page.locator("[data-testid='submit-button']").click() # Direct attribute
page.locator("#unique-id").click() # ID is fastest
GOOD - Use role selectors for accessibility
GOOD - Use role selectors for accessibility
page.get_by_role("button", name="Submit").click()
page.get_by_label("Email").fill("test@example.com")
page.get_by_role("button", name="Submit").click()
page.get_by_label("Email").fill("test@example.com")
GOOD - Combine selectors for specificity without XPath
GOOD - Combine selectors for specificity without XPath
page.locator("form.login >> button[type='submit']").click()
---page.locator("form.login >> button[type='submit']").click()
---4. Core Responsibilities
4. 核心职责
4.1 Safe Automation Principles
4.1 安全自动化原则
When automating browsers:
- Restrict domains to allowlist
- Never store credentials in scripts
- Block sensitive URLs (banking, healthcare)
- Log all navigations and actions
- Implement timeouts on all operations
进行浏览器自动化时:
- 限制域名为白名单内的站点
- 绝不在脚本中存储凭证
- 阻止敏感URL(银行、医疗类网站)
- 记录所有导航和操作
- 为所有操作设置超时
4.2 Security-First Approach
4.2 安全优先的方法
Every browser operation MUST:
- Validate URL against domain allowlist
- Check for credential exposure
- Block sensitive site access
- Log operation details
- Enforce timeout limits
每一项浏览器操作必须:
- 对照域名白名单验证URL
- 检查是否存在凭证泄露风险
- 阻止敏感站点访问
- 记录操作详情
- 强制执行超时限制
4.3 Data Handling
4.3 数据处理
- Never extract credentials from pages
- Redact sensitive data in logs
- Clear browser state after sessions
- Use isolated profiles
- 绝不从页面中提取凭证
- 在日志中脱敏敏感数据
- 会话结束后清除浏览器状态
- 使用隔离配置文件
5. Technical Foundation
5. 技术基础
5.1 Automation Frameworks
5.1 自动化框架
Chrome DevTools Protocol (CDP):
- Direct browser control
- Network interception
- Performance profiling
WebDriver/Selenium:
- Cross-browser support
- W3C standard
Modern Frameworks:
- Playwright: Multi-browser, auto-waiting
- Puppeteer: CDP wrapper for Chrome
Chrome DevTools Protocol (CDP):
- 直接控制浏览器
- 网络拦截
- 性能分析
WebDriver/Selenium:
- 跨浏览器支持
- W3C标准
现代框架:
- Playwright:多浏览器支持、自动等待
- Puppeteer:基于CDP的Chrome封装
5.2 Security Considerations
5.2 安全考量
| Risk Area | Mitigation | Priority |
|---|---|---|
| Credential theft | Domain allowlists | CRITICAL |
| Phishing | URL validation | CRITICAL |
| Data exfiltration | Output filtering | HIGH |
| Session hijacking | Isolated profiles | HIGH |
| 风险领域 | 缓解措施 | 优先级 |
|---|---|---|
| 凭证窃取 | 域名白名单 | 关键 |
| 钓鱼攻击 | URL验证 | 关键 |
| 数据泄露 | 输出过滤 | 高 |
| 会话劫持 | 隔离配置文件 | 高 |
6. Implementation Patterns
6. 实现模式
Pattern 1: Secure Browser Session
模式1:安全浏览器会话
python
from playwright.sync_api import sync_playwright
import logging
import re
from urllib.parse import urlparse
class SecureBrowserAutomation:
"""Secure browser automation with comprehensive controls."""
BLOCKED_DOMAINS = {
'chase.com', 'bankofamerica.com', 'wellsfargo.com',
'accounts.google.com', 'login.microsoft.com',
'paypal.com', 'venmo.com', 'stripe.com',
}
BLOCKED_URL_PATTERNS = [
r'/login', r'/signin', r'/auth', r'/password',
r'/payment', r'/checkout', r'/billing',
]
def __init__(self, domain_allowlist: list = None, permission_tier: str = 'standard'):
self.domain_allowlist = domain_allowlist
self.permission_tier = permission_tier
self.logger = logging.getLogger('browser.security')
self.timeout = 30000
def start_session(self):
"""Start browser with security settings."""
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(
headless=True,
args=['--disable-extensions', '--disable-plugins', '--no-sandbox']
)
self.context = self.browser.new_context(ignore_https_errors=False)
self.context.set_default_timeout(self.timeout)
self.page = self.context.new_page()
def navigate(self, url: str):
"""Navigate with URL validation."""
if not self._validate_url(url):
raise SecurityError(f"URL blocked: {url}")
self._audit_log('navigate', url)
self.page.goto(url, wait_until='networkidle')
def _validate_url(self, url: str) -> bool:
"""Validate URL against security rules."""
parsed = urlparse(url)
domain = parsed.netloc.lower().removeprefix('www.')
if any(domain == d or domain.endswith('.' + d) for d in self.BLOCKED_DOMAINS):
return False
if self.domain_allowlist:
if not any(domain == d or domain.endswith('.' + d) for d in self.domain_allowlist):
return False
return not any(re.search(p, url, re.I) for p in self.BLOCKED_URL_PATTERNS)
def close(self):
"""Clean up browser session."""
if hasattr(self, 'context'):
self.context.clear_cookies()
self.context.close()
if hasattr(self, 'browser'):
self.browser.close()
if hasattr(self, 'playwright'):
self.playwright.stop()python
from playwright.sync_api import sync_playwright
import logging
import re
from urllib.parse import urlparse
class SecureBrowserAutomation:
"""Secure browser automation with comprehensive controls."""
BLOCKED_DOMAINS = {
'chase.com', 'bankofamerica.com', 'wellsfargo.com',
'accounts.google.com', 'login.microsoft.com',
'paypal.com', 'venmo.com', 'stripe.com',
}
BLOCKED_URL_PATTERNS = [
r'/login', r'/signin', r'/auth', r'/password',
r'/payment', r'/checkout', r'/billing',
]
def __init__(self, domain_allowlist: list = None, permission_tier: str = 'standard'):
self.domain_allowlist = domain_allowlist
self.permission_tier = permission_tier
self.logger = logging.getLogger('browser.security')
self.timeout = 30000
def start_session(self):
"""Start browser with security settings."""
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(
headless=True,
args=['--disable-extensions', '--disable-plugins', '--no-sandbox']
)
self.context = self.browser.new_context(ignore_https_errors=False)
self.context.set_default_timeout(self.timeout)
self.page = self.context.new_page()
def navigate(self, url: str):
"""Navigate with URL validation."""
if not self._validate_url(url):
raise SecurityError(f"URL blocked: {url}")
self._audit_log('navigate', url)
self.page.goto(url, wait_until='networkidle')
def _validate_url(self, url: str) -> bool:
"""Validate URL against security rules."""
parsed = urlparse(url)
domain = parsed.netloc.lower().removeprefix('www.')
if any(domain == d or domain.endswith('.' + d) for d in self.BLOCKED_DOMAINS):
return False
if self.domain_allowlist:
if not any(domain == d or domain.endswith('.' + d) for d in self.domain_allowlist):
return False
return not any(re.search(p, url, re.I) for p in self.BLOCKED_URL_PATTERNS)
def close(self):
"""Clean up browser session."""
if hasattr(self, 'context'):
self.context.clear_cookies()
self.context.close()
if hasattr(self, 'browser'):
self.browser.close()
if hasattr(self, 'playwright'):
self.playwright.stop()Pattern 2: Rate Limiting
模式2:速率限制
python
import time
class BrowserRateLimiter:
"""Rate limit browser operations."""
def __init__(self, requests_per_minute: int = 60):
self.requests_per_minute = requests_per_minute
self.request_times = []
def check_request(self):
"""Check if request is allowed."""
cutoff = time.time() - 60
self.request_times = [t for t in self.request_times if t > cutoff]
if len(self.request_times) >= self.requests_per_minute:
raise RateLimitError("Request rate limit exceeded")
self.request_times.append(time.time())python
import time
class BrowserRateLimiter:
"""Rate limit browser operations."""
def __init__(self, requests_per_minute: int = 60):
self.requests_per_minute = requests_per_minute
self.request_times = []
def check_request(self):
"""Check if request is allowed."""
cutoff = time.time() - 60
self.request_times = [t for t in self.request_times if t > cutoff]
if len(self.request_times) >= self.requests_per_minute:
raise RateLimitError("Request rate limit exceeded")
self.request_times.append(time.time())7. Security Standards
7. 安全标准
7.1 Critical Vulnerabilities
7.1 关键漏洞
| Vulnerability | CWE | Severity | Mitigation |
|---|---|---|---|
| XSS via Automation | CWE-79 | HIGH | Sanitize injected scripts |
| Credential Harvesting | CWE-522 | CRITICAL | Block password field access |
| Session Hijacking | CWE-384 | HIGH | Isolated profiles, session clearing |
| Phishing Automation | CWE-601 | CRITICAL | Domain allowlists, URL validation |
| 漏洞 | CWE编号 | 严重程度 | 缓解措施 |
|---|---|---|---|
| 自动化导致的XSS | CWE-79 | 高 | 对注入脚本进行 sanitize 处理 |
| 凭证收集 | CWE-522 | 关键 | 阻止密码字段访问 |
| 会话劫持 | CWE-384 | 高 | 隔离配置文件、清除会话 |
| 钓鱼自动化 | CWE-601 | 关键 | 域名白名单、URL验证 |
7.2 Common Mistakes
7.2 常见错误
python
undefinedpython
undefinedNever: Fill Password Fields
Never: Fill Password Fields
BAD
BAD
page.fill('input[type="password"]', password)
page.fill('input[type="password"]', password)
GOOD
GOOD
if element.get_attribute('type') == 'password':
raise SecurityError("Cannot fill password fields")
if element.get_attribute('type') == 'password':
raise SecurityError("Cannot fill password fields")
Never: Access Banking Sites
Never: Access Banking Sites
BAD
BAD
page.goto(user_url)
page.goto(user_url)
GOOD
GOOD
if not validate_url(user_url):
raise SecurityError("URL blocked")
page.goto(user_url)
---if not validate_url(user_url):
raise SecurityError("URL blocked")
page.goto(user_url)
---8. Pre-Implementation Checklist
8. 实施前检查清单
Before Writing Code
编写代码前
- Read security requirements from PRD Section 8
- Write failing tests for new automation features
- Define domain allowlist for target sites
- Identify sensitive elements to block/redact
- 阅读PRD第8节的安全要求
- 为新自动化功能编写失败的测试用例
- 定义目标站点的域名白名单
- 识别需要阻止/脱敏的敏感元素
During Implementation
实施过程中
- Implement URL validation before navigation
- Add audit logging for all actions
- Configure request interception and blocking
- Set appropriate timeouts for all operations
- Reuse browser contexts for performance
- 在导航前实现URL验证
- 为所有操作添加审计日志
- 配置请求拦截与阻止规则
- 为所有操作设置合适的超时时间
- 复用浏览器上下文以提升性能
Before Committing
提交代码前
- All tests pass:
pytest tests/test_browser_automation.py - Security tests pass:
pytest -k security - No credentials in code or logs
- Session cleanup verified
- Rate limiting configured and tested
- 所有测试通过:
pytest tests/test_browser_automation.py - 安全测试通过:
pytest -k security - 代码或日志中无凭证信息
- 会话清理机制验证通过
- 速率限制已配置并测试通过
9. Summary
9. 总结
Your goal is to create browser automation that is:
- Test-Driven: Write tests first, implement to pass
- Performant: Context reuse, parallelization, resource blocking
- Secure: Domain restrictions, credential protection, output filtering
- Auditable: Comprehensive logging, request tracking
Implementation Order:
- Write failing test first
- Implement minimum code to pass
- Refactor with performance patterns
- Run all verification commands
- Commit only when all pass
您的目标是创建具备以下特性的浏览器自动化方案:
- 测试驱动:先编写测试,再实现代码以通过测试
- 高性能:上下文复用、并行执行、资源阻止
- 安全可靠:域名限制、凭证保护、输出过滤
- 可审计:全面日志、请求追踪
实施顺序:
- 先编写失败的测试用例
- 实现满足测试通过的最小代码
- 结合性能模式进行重构
- 运行所有验证命令
- 仅当所有测试通过后再提交代码
References
参考资料
- See - Complete SecureBrowserAutomation class
references/secure-session-full.md - See - Additional security patterns
references/security-examples.md - See - Full threat analysis
references/threat-model.md
- 参见 - 完整的SecureBrowserAutomation类
references/secure-session-full.md - 参见 - 额外的安全模式
references/security-examples.md - 参见 - 完整的威胁分析
references/threat-model.md