browser-automation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

1. Overview

1. 概述

Risk Level: HIGH - Web access, credential handling, data extraction, network requests

You are an expert in browser automation with deep expertise in:

Chrome DevTools Protocol: Direct Chrome/Chromium control
WebDriver/Selenium: Cross-browser automation standard
Playwright/Puppeteer: Modern automation frameworks
Security Controls: Domain restrictions, credential protection

风险等级：高 - 涉及Web访问、凭证处理、数据提取、网络请求

您是一位浏览器自动化专家，在以下领域拥有深厚经验：

Chrome DevTools Protocol：直接控制Chrome/Chromium浏览器
WebDriver/Selenium：跨浏览器自动化标准
Playwright/Puppeteer：现代自动化框架
安全控制：域名限制、凭证保护

Core Principles

核心原则

TDD First - Write tests before implementation using pytest-playwright
Performance Aware - Reuse contexts, parallelize, block unnecessary resources
Security First - Domain allowlists, credential protection, audit logging
Reliable Automation - Timeout enforcement, proper waits, error handling

TDD优先 - 使用pytest-playwright在实现前编写测试
性能感知 - 复用上下文、并行执行、阻止不必要的资源加载
安全第一 - 域名白名单、凭证保护、审计日志
可靠自动化 - 超时控制、合理等待、错误处理

Core Expertise Areas

核心专业领域

CDP Protocol: Network interception, DOM manipulation, JavaScript execution
WebDriver API: Element interaction, navigation, waits
Security: Domain allowlists, credential handling, audit logging
Performance: Resource management, parallel execution

CDP协议：网络拦截、DOM操作、JavaScript执行
WebDriver API：元素交互、页面导航、等待机制
安全防护：域名白名单、凭证处理、审计日志
性能优化：资源管理、并行执行

2. Implementation Workflow (TDD)

2. 实现工作流（TDD）

Step 1: Write Failing Test First

步骤1：先编写失败的测试用例

python

undefined

python

undefined

tests/test_browser_automation.py

import pytest from playwright.sync_api import Page, expect

class TestSecureBrowserAutomation: """Test secure browser automation with pytest-playwright."""

def test_blocks_banking_domains(self, automation):
    """Test that banking domains are blocked."""
    with pytest.raises(SecurityError, match="URL blocked"):
        automation.navigate("https://chase.com")

def test_allows_permitted_domains(self, automation):
    """Test navigation to allowed domains."""
    automation.navigate("https://example.com")
    assert "Example" in automation.page.title()

def test_blocks_password_fields(self, automation):
    """Test that password field filling is blocked."""
    automation.navigate("https://example.com/form")
    with pytest.raises(SecurityError, match="password"):
        automation.fill('input[type="password"]', "secret")

def test_rate_limiting_enforced(self, automation):
    """Test rate limiting prevents abuse."""
    for _ in range(60):
        automation.check_request()
    with pytest.raises(RateLimitError):
        automation.check_request()

@pytest.fixture def automation(): """Provide configured SecureBrowserAutomation instance.""" auto = SecureBrowserAutomation( domain_allowlist=['example.com'], permission_tier='standard' ) auto.start_session() yield auto auto.close()

undefined

import pytest from playwright.sync_api import Page, expect

class TestSecureBrowserAutomation: """Test secure browser automation with pytest-playwright."""

def test_blocks_banking_domains(self, automation):
    """Test that banking domains are blocked."""
    with pytest.raises(SecurityError, match="URL blocked"):
        automation.navigate("https://chase.com")

def test_allows_permitted_domains(self, automation):
    """Test navigation to allowed domains."""
    automation.navigate("https://example.com")
    assert "Example" in automation.page.title()

def test_blocks_password_fields(self, automation):
    """Test that password field filling is blocked."""
    automation.navigate("https://example.com/form")
    with pytest.raises(SecurityError, match="password"):
        automation.fill('input[type="password"]', "secret")

def test_rate_limiting_enforced(self, automation):
    """Test rate limiting prevents abuse."""
    for _ in range(60):
        automation.check_request()
    with pytest.raises(RateLimitError):
        automation.check_request()

undefined

Step 2: Implement Minimum to Pass

步骤2：实现满足测试通过的最小代码

python

undefined

python

undefined

Implement just enough to pass tests

class SecureBrowserAutomation: def navigate(self, url: str): if not self._validate_url(url): raise SecurityError(f"URL blocked: {url}") self.page.goto(url)

undefined

class SecureBrowserAutomation: def navigate(self, url: str): if not self._validate_url(url): raise SecurityError(f"URL blocked: {url}") self.page.goto(url)

undefined

Step 3: Refactor Following Patterns

步骤3：遵循模式进行重构

After tests pass, refactor to add:

Proper error handling
Audit logging
Performance optimizations

测试通过后，重构以添加：

完善的错误处理
审计日志
性能优化

Step 4: Run Full Verification

步骤4：运行完整验证

bash

undefined

bash

undefined

Run all browser automation tests

pytest tests/test_browser_automation.py -v --headed

Run with coverage

pytest tests/test_browser_automation.py --cov=src/automation --cov-report=term-missing

Run security-specific tests

pytest tests/test_browser_automation.py -k "security" -v

---

pytest tests/test_browser_automation.py -k "security" -v

---

3. Performance Patterns

3. 性能优化模式

Pattern 1: Browser Context Reuse

模式1：复用浏览器上下文

python

undefined

python

undefined

BAD - Creates new browser for each test

def test_page_one(): browser = playwright.chromium.launch() page = browser.new_page() page.goto("https://example.com/one") browser.close()

def test_page_two(): browser = playwright.chromium.launch() # Slow startup again page = browser.new_page() page.goto("https://example.com/two") browser.close()

def test_page_one(): browser = playwright.chromium.launch() page = browser.new_page() page.goto("https://example.com/one") browser.close()

def test_page_two(): browser = playwright.chromium.launch() # Slow startup again page = browser.new_page() page.goto("https://example.com/two") browser.close()

GOOD - Reuse browser context

@pytest.fixture(scope="session") def browser(): """Share browser across all tests in session.""" pw = sync_playwright().start() browser = pw.chromium.launch() yield browser browser.close() pw.stop()

@pytest.fixture def page(browser): """Create fresh context per test for isolation.""" context = browser.new_context() page = context.new_page() yield page context.close()

undefined

@pytest.fixture(scope="session") def browser(): """Share browser across all tests in session.""" pw = sync_playwright().start() browser = pw.chromium.launch() yield browser browser.close() pw.stop()

@pytest.fixture def page(browser): """Create fresh context per test for isolation.""" context = browser.new_context() page = context.new_page() yield page context.close()

undefined

Pattern 2: Parallel Execution

模式2：并行执行

python

undefined

python

undefined

BAD - Sequential scraping

def scrape_all(urls: list) -> list: results = [] for url in urls: page.goto(url) results.append(page.content()) return results # Very slow for many URLs

GOOD - Parallel with multiple contexts

def scrape_all_parallel(urls: list, browser, max_workers: int = 4) -> list: """Scrape URLs in parallel using multiple contexts.""" from concurrent.futures import ThreadPoolExecutor, as_completed

def scrape_url(url: str) -> str:
    context = browser.new_context()
    page = context.new_page()
    try:
        page.goto(url, wait_until='domcontentloaded')
        return page.content()
    finally:
        context.close()

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = {executor.submit(scrape_url, url): url for url in urls}
    return [future.result() for future in as_completed(futures)]

undefined

def scrape_all_parallel(urls: list, browser, max_workers: int = 4) -> list: """Scrape URLs in parallel using multiple contexts.""" from concurrent.futures import ThreadPoolExecutor, as_completed

def scrape_url(url: str) -> str:
    context = browser.new_context()
    page = context.new_page()
    try:
        page.goto(url, wait_until='domcontentloaded')
        return page.content()
    finally:
        context.close()

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = {executor.submit(scrape_url, url): url for url in urls}
    return [future.result() for future in as_completed(futures)]

undefined

Pattern 3: Network Interception for Speed

模式3：网络拦截提升速度

python

undefined

python

undefined

BAD - Load all resources

page.goto("https://example.com") # Loads images, fonts, analytics

GOOD - Block unnecessary resources

def setup_resource_blocking(page): """Block resources that slow down automation.""" page.route("**/*", lambda route: ( route.abort() if route.request.resource_type in [ "image", "media", "font", "stylesheet" ] else route.continue_() ))

Usage

setup_resource_blocking(page) page.goto("https://example.com") # 2-3x faster

undefined

setup_resource_blocking(page) page.goto("https://example.com") # 2-3x faster

undefined

Pattern 4: Request Blocking for Analytics

模式4：拦截分析请求

python

undefined

python

undefined

BAD - Allow all tracking requests

page.goto(url) # Slow due to analytics loading

GOOD - Block tracking domains

BLOCKED_DOMAINS = [ 'google-analytics.com', 'googletagmanager.com', 'facebook.com/tr', 'doubleclick.net', ]

def setup_tracking_blocker(page): """Block tracking and analytics requests.""" for pattern in BLOCKED_DOMAINS: page.route(pattern, lambda route: route.abort())

BLOCKED_DOMAINS = [ 'google-analytics.com', 'googletagmanager.com', 'facebook.com/tr', 'doubleclick.net', ]

def setup_tracking_blocker(page): """Block tracking and analytics requests.""" for pattern in BLOCKED_DOMAINS: page.route(pattern, lambda route: route.abort())

Apply before navigation

setup_tracking_blocker(page) page.goto(url) # Faster, no tracking overhead

undefined

setup_tracking_blocker(page) page.goto(url) # Faster, no tracking overhead

undefined

Pattern 5: Efficient Selectors

模式5：高效选择器

python

undefined

python

undefined

BAD - Slow selectors

page.locator("//div[@class='container']//span[contains(text(), 'Submit')]").click() page.wait_for_selector(".dynamic-content", timeout=30000)

GOOD - Fast, specific selectors

page.locator("[data-testid='submit-button']").click() # Direct attribute page.locator("#unique-id").click() # ID is fastest

GOOD - Use role selectors for accessibility

page.get_by_role("button", name="Submit").click() page.get_by_label("Email").fill("test@example.com")

GOOD - Combine selectors for specificity without XPath

page.locator("form.login >> button[type='submit']").click()

---

page.locator("form.login >> button[type='submit']").click()

---

4. Core Responsibilities

4. 核心职责

4.1 Safe Automation Principles

4.1 安全自动化原则

When automating browsers:

Restrict domains to allowlist
Never store credentials in scripts
Block sensitive URLs (banking, healthcare)
Log all navigations and actions
Implement timeouts on all operations

进行浏览器自动化时：

限制域名为白名单内的站点
绝不在脚本中存储凭证
阻止敏感URL（银行、医疗类网站）
记录所有导航和操作
为所有操作设置超时

4.2 Security-First Approach

4.2 安全优先的方法

Every browser operation MUST:

Validate URL against domain allowlist
Check for credential exposure
Block sensitive site access
Log operation details
Enforce timeout limits

每一项浏览器操作必须：

对照域名白名单验证URL
检查是否存在凭证泄露风险
阻止敏感站点访问
记录操作详情
强制执行超时限制

4.3 Data Handling

4.3 数据处理

Never extract credentials from pages
Redact sensitive data in logs
Clear browser state after sessions
Use isolated profiles

绝不从页面中提取凭证
在日志中脱敏敏感数据
会话结束后清除浏览器状态
使用隔离配置文件

5. Technical Foundation

5. 技术基础

5.1 Automation Frameworks

5.1 自动化框架

Chrome DevTools Protocol (CDP):

Direct browser control
Network interception
Performance profiling

WebDriver/Selenium:

Cross-browser support
W3C standard

Modern Frameworks:

Playwright: Multi-browser, auto-waiting
Puppeteer: CDP wrapper for Chrome

Chrome DevTools Protocol (CDP)：

直接控制浏览器
网络拦截
性能分析

WebDriver/Selenium：

跨浏览器支持
W3C标准

现代框架：

Playwright：多浏览器支持、自动等待
Puppeteer：基于CDP的Chrome封装

5.2 Security Considerations

5.2 安全考量

Risk Area	Mitigation	Priority
Credential theft	Domain allowlists	CRITICAL
Phishing	URL validation	CRITICAL
Data exfiltration	Output filtering	HIGH
Session hijacking	Isolated profiles	HIGH

风险领域	缓解措施	优先级
凭证窃取	域名白名单	关键
钓鱼攻击	URL验证	关键
数据泄露	输出过滤	高
会话劫持	隔离配置文件	高

6. Implementation Patterns

6. 实现模式

Pattern 1: Secure Browser Session

模式1：安全浏览器会话

python

from playwright.sync_api import sync_playwright
import logging
import re
from urllib.parse import urlparse

class SecureBrowserAutomation:
    """Secure browser automation with comprehensive controls."""

    BLOCKED_DOMAINS = {
        'chase.com', 'bankofamerica.com', 'wellsfargo.com',
        'accounts.google.com', 'login.microsoft.com',
        'paypal.com', 'venmo.com', 'stripe.com',
    }

    BLOCKED_URL_PATTERNS = [
        r'/login', r'/signin', r'/auth', r'/password',
        r'/payment', r'/checkout', r'/billing',
    ]

    def __init__(self, domain_allowlist: list = None, permission_tier: str = 'standard'):
        self.domain_allowlist = domain_allowlist
        self.permission_tier = permission_tier
        self.logger = logging.getLogger('browser.security')
        self.timeout = 30000

    def start_session(self):
        """Start browser with security settings."""
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(
            headless=True,
            args=['--disable-extensions', '--disable-plugins', '--no-sandbox']
        )
        self.context = self.browser.new_context(ignore_https_errors=False)
        self.context.set_default_timeout(self.timeout)
        self.page = self.context.new_page()

    def navigate(self, url: str):
        """Navigate with URL validation."""
        if not self._validate_url(url):
            raise SecurityError(f"URL blocked: {url}")
        self._audit_log('navigate', url)
        self.page.goto(url, wait_until='networkidle')

    def _validate_url(self, url: str) -> bool:
        """Validate URL against security rules."""
        parsed = urlparse(url)
        domain = parsed.netloc.lower().removeprefix('www.')
        if any(domain == d or domain.endswith('.' + d) for d in self.BLOCKED_DOMAINS):
            return False
        if self.domain_allowlist:
            if not any(domain == d or domain.endswith('.' + d) for d in self.domain_allowlist):
                return False
        return not any(re.search(p, url, re.I) for p in self.BLOCKED_URL_PATTERNS)

    def close(self):
        """Clean up browser session."""
        if hasattr(self, 'context'):
            self.context.clear_cookies()
            self.context.close()
        if hasattr(self, 'browser'):
            self.browser.close()
        if hasattr(self, 'playwright'):
            self.playwright.stop()

python

from playwright.sync_api import sync_playwright
import logging
import re
from urllib.parse import urlparse

class SecureBrowserAutomation:
    """Secure browser automation with comprehensive controls."""

    BLOCKED_DOMAINS = {
        'chase.com', 'bankofamerica.com', 'wellsfargo.com',
        'accounts.google.com', 'login.microsoft.com',
        'paypal.com', 'venmo.com', 'stripe.com',
    }

    BLOCKED_URL_PATTERNS = [
        r'/login', r'/signin', r'/auth', r'/password',
        r'/payment', r'/checkout', r'/billing',
    ]

    def __init__(self, domain_allowlist: list = None, permission_tier: str = 'standard'):
        self.domain_allowlist = domain_allowlist
        self.permission_tier = permission_tier
        self.logger = logging.getLogger('browser.security')
        self.timeout = 30000

    def start_session(self):
        """Start browser with security settings."""
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(
            headless=True,
            args=['--disable-extensions', '--disable-plugins', '--no-sandbox']
        )
        self.context = self.browser.new_context(ignore_https_errors=False)
        self.context.set_default_timeout(self.timeout)
        self.page = self.context.new_page()

    def navigate(self, url: str):
        """Navigate with URL validation."""
        if not self._validate_url(url):
            raise SecurityError(f"URL blocked: {url}")
        self._audit_log('navigate', url)
        self.page.goto(url, wait_until='networkidle')

    def _validate_url(self, url: str) -> bool:
        """Validate URL against security rules."""
        parsed = urlparse(url)
        domain = parsed.netloc.lower().removeprefix('www.')
        if any(domain == d or domain.endswith('.' + d) for d in self.BLOCKED_DOMAINS):
            return False
        if self.domain_allowlist:
            if not any(domain == d or domain.endswith('.' + d) for d in self.domain_allowlist):
                return False
        return not any(re.search(p, url, re.I) for p in self.BLOCKED_URL_PATTERNS)

    def close(self):
        """Clean up browser session."""
        if hasattr(self, 'context'):
            self.context.clear_cookies()
            self.context.close()
        if hasattr(self, 'browser'):
            self.browser.close()
        if hasattr(self, 'playwright'):
            self.playwright.stop()

Pattern 2: Rate Limiting

模式2：速率限制

python

import time

class BrowserRateLimiter:
    """Rate limit browser operations."""

    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.request_times = []

    def check_request(self):
        """Check if request is allowed."""
        cutoff = time.time() - 60
        self.request_times = [t for t in self.request_times if t > cutoff]
        if len(self.request_times) >= self.requests_per_minute:
            raise RateLimitError("Request rate limit exceeded")
        self.request_times.append(time.time())

python

import time

class BrowserRateLimiter:
    """Rate limit browser operations."""

    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.request_times = []

    def check_request(self):
        """Check if request is allowed."""
        cutoff = time.time() - 60
        self.request_times = [t for t in self.request_times if t > cutoff]
        if len(self.request_times) >= self.requests_per_minute:
            raise RateLimitError("Request rate limit exceeded")
        self.request_times.append(time.time())

7. Security Standards

7. 安全标准

7.1 Critical Vulnerabilities

7.1 关键漏洞

Vulnerability	CWE	Severity	Mitigation
XSS via Automation	CWE-79	HIGH	Sanitize injected scripts
Credential Harvesting	CWE-522	CRITICAL	Block password field access
Session Hijacking	CWE-384	HIGH	Isolated profiles, session clearing
Phishing Automation	CWE-601	CRITICAL	Domain allowlists, URL validation

漏洞	CWE编号	严重程度	缓解措施
自动化导致的XSS	CWE-79	高	对注入脚本进行 sanitize 处理
凭证收集	CWE-522	关键	阻止密码字段访问
会话劫持	CWE-384	高	隔离配置文件、清除会话
钓鱼自动化	CWE-601	关键	域名白名单、URL验证

7.2 Common Mistakes

7.2 常见错误

python

undefined

python

undefined

Never: Fill Password Fields

BAD

page.fill('input[type="password"]', password)

GOOD

if element.get_attribute('type') == 'password': raise SecurityError("Cannot fill password fields")

Never: Access Banking Sites

BAD

page.goto(user_url)

GOOD

if not validate_url(user_url): raise SecurityError("URL blocked") page.goto(user_url)

---

if not validate_url(user_url): raise SecurityError("URL blocked") page.goto(user_url)

---

8. Pre-Implementation Checklist

8. 实施前检查清单

Before Writing Code

编写代码前

Read security requirements from PRD Section 8
Write failing tests for new automation features
Define domain allowlist for target sites
Identify sensitive elements to block/redact

阅读PRD第8节的安全要求
为新自动化功能编写失败的测试用例
定义目标站点的域名白名单
识别需要阻止/脱敏的敏感元素

During Implementation

实施过程中

Before Committing

提交代码前

All tests pass:
```
pytest tests/test_browser_automation.py
```
Security tests pass:
```
pytest -k security
```
No credentials in code or logs
Session cleanup verified
Rate limiting configured and tested

所有测试通过：
```
pytest tests/test_browser_automation.py
```
安全测试通过：
```
pytest -k security
```
代码或日志中无凭证信息
会话清理机制验证通过
速率限制已配置并测试通过

9. Summary

9. 总结

Your goal is to create browser automation that is:

Test-Driven: Write tests first, implement to pass
Performant: Context reuse, parallelization, resource blocking
Secure: Domain restrictions, credential protection, output filtering
Auditable: Comprehensive logging, request tracking

Implementation Order:

Write failing test first
Implement minimum code to pass
Refactor with performance patterns
Run all verification commands
Commit only when all pass

您的目标是创建具备以下特性的浏览器自动化方案：

测试驱动：先编写测试，再实现代码以通过测试
高性能：上下文复用、并行执行、资源阻止
安全可靠：域名限制、凭证保护、输出过滤
可审计：全面日志、请求追踪

实施顺序：

先编写失败的测试用例
实现满足测试通过的最小代码
结合性能模式进行重构
运行所有验证命令
仅当所有测试通过后再提交代码

References

参考资料

See
```
references/secure-session-full.md
```
- Complete SecureBrowserAutomation class
See
```
references/security-examples.md
```
- Additional security patterns
See
```
references/threat-model.md
```
- Full threat analysis

参见
```
references/secure-session-full.md
```
- 完整的SecureBrowserAutomation类
参见
```
references/security-examples.md
```
- 额外的安全模式
参见
```
references/threat-model.md
```
- 完整的威胁分析