web-scraping-automation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

网站爬取与 API 自动化

Website Scraping and API Automation

功能说明

Feature Description

此技能专门用于自动化网站数据爬取和 API 接口调用,包括:
  • 分析和爬取网站结构
  • 调用和测试 REST/GraphQL API
  • 创建自动化爬虫脚本
  • 数据解析和清洗
  • 处理反爬虫机制
  • 定时任务和数据存储
This skill is specifically designed for automated website data crawling and API interface calling, including:
  • Analyze and crawl website structures
  • Call and test REST/GraphQL APIs
  • Create automated crawler scripts
  • Data parsing and cleaning
  • Handle anti-crawling mechanisms
  • Scheduled tasks and data storage

使用场景

Usage Scenarios

  • "爬取这个网站的产品信息"
  • "帮我调用这个 API 并解析返回数据"
  • "创建一个脚本定时抓取新闻"
  • "分析这个网站的 API 接口文档"
  • "绕过这个网站的反爬虫限制"
  • "Scrape product information from this website"
  • "Help me call this API and parse the returned data"
  • "Create a script to regularly scrape news"
  • "Analyze the API interface documentation of this website"
  • "Bypass the anti-crawling restrictions of this website"

技术栈

Technology Stack

Python 爬虫

Python Crawlers

  • requests:HTTP 请求库
  • BeautifulSoup4:HTML 解析
  • Scrapy:专业爬虫框架
  • Selenium:浏览器自动化
  • Playwright:现代浏览器自动化
  • requests: HTTP request library
  • BeautifulSoup4: HTML parsing
  • Scrapy: Professional crawler framework
  • Selenium: Browser automation
  • Playwright: Modern browser automation

JavaScript 爬虫

JavaScript Crawlers

  • axios:HTTP 客户端
  • cheerio:服务端 jQuery
  • puppeteer:Chrome 自动化
  • node-fetch:Fetch API
  • axios: HTTP client
  • cheerio: Server-side jQuery
  • puppeteer: Chrome automation
  • node-fetch: Fetch API

工作流程

Workflow

  1. 目标分析
    • 检查网站结构和数据位置
    • 分析 API 接口和认证方式
    • 评估反爬虫机制
  2. 方案设计
    • 选择合适的技术栈
    • 设计数据提取策略
    • 规划错误处理和重试机制
  3. 脚本开发
    • 编写爬虫代码
    • 实现数据解析逻辑
    • 添加日志和监控
  4. 测试优化
    • 验证数据准确性
    • 优化性能和稳定性
    • 处理边界情况
  1. Target Analysis:
    • Check website structure and data locations
    • Analyze API interfaces and authentication methods
    • Evaluate anti-crawling mechanisms
  2. Solution Design:
    • Select appropriate technology stack
    • Design data extraction strategies
    • Plan error handling and retry mechanisms
  3. Script Development:
    • Write crawler code
    • Implement data parsing logic
    • Add logging and monitoring
  4. Testing and Optimization:
    • Verify data accuracy
    • Optimize performance and stability
    • Handle edge cases

最佳实践

Best Practices

  • 遵守 robots.txt 规则
  • 设置合理的请求间隔
  • 使用 User-Agent 和请求头
  • 实现错误重试机制
  • 数据去重和验证
  • 使用代理池(如需要)
  • 保存原始数据和日志
  • Comply with robots.txt rules
  • Set reasonable request intervals
  • Use User-Agent and request headers
  • Implement error retry mechanisms
  • Data deduplication and verification
  • Use proxy pools (if needed)
  • Save raw data and logs

常见场景示例

Common Scenario Examples

1. 简单网页爬取

1. Simple Web Scraping

python
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取数据
    data = []
    for item in soup.select('.product'):
        data.append({
            'title': item.select_one('.title').text,
            'price': item.select_one('.price').text
        })
    return data
python
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取数据
    data = []
    for item in soup.select('.product'):
        data.append({
            'title': item.select_one('.title').text,
            'price': item.select_one('.price').text
        })
    return data

2. API 调用

2. API Calling

python
import requests

def call_api(endpoint, params=None):
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN',
        'Content-Type': 'application/json'
    }
    response = requests.get(endpoint, headers=headers, params=params)
    return response.json()
python
import requests

def call_api(endpoint, params=None):
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN',
        'Content-Type': 'application/json'
    }
    response = requests.get(endpoint, headers=headers, params=params)
    return response.json()

3. 动态网页爬取

3. Dynamic Web Scraping

python
from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_dynamic_page(url):
    driver = webdriver.Chrome()
    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(10)

    # 提取数据
    elements = driver.find_elements(By.CLASS_NAME, 'item')
    data = [elem.text for elem in elements]

    driver.quit()
    return data
python
from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_dynamic_page(url):
    driver = webdriver.Chrome()
    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(10)

    # 提取数据
    elements = driver.find_elements(By.CLASS_NAME, 'item')
    data = [elem.text for elem in elements]

    driver.quit()
    return data

反爬虫应对策略

Anti-Crawling Countermeasures

  • 请求头伪装:模拟真实浏览器
  • 代理轮换:使用代理池
  • 验证码处理:OCR 或第三方服务
  • Cookie 管理:维护会话状态
  • 请求频率控制:避免触发限制
  • JavaScript 渲染:使用 Selenium/Playwright
  • Request Header Spoofing: Simulate real browsers
  • Proxy Rotation: Use proxy pools
  • Captcha Handling: OCR or third-party services
  • Cookie Management: Maintain session state
  • Request Frequency Control: Avoid triggering restrictions
  • JavaScript Rendering: Use Selenium/Playwright

数据存储方案

Data Storage Solutions

  • CSV/Excel:简单数据导出
  • JSON:结构化数据存储
  • 数据库:MySQL、PostgreSQL、MongoDB
  • 云存储:S3、OSS
  • 数据仓库:用于大规模数据分析
  • CSV/Excel: Simple data export
  • JSON: Structured data storage
  • Databases: MySQL, PostgreSQL, MongoDB
  • Cloud Storage: S3, OSS
  • Data Warehouses: For large-scale data analysis