web-scraping-automation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese网站爬取与 API 自动化
Website Scraping and API Automation
功能说明
Feature Description
此技能专门用于自动化网站数据爬取和 API 接口调用,包括:
- 分析和爬取网站结构
- 调用和测试 REST/GraphQL API
- 创建自动化爬虫脚本
- 数据解析和清洗
- 处理反爬虫机制
- 定时任务和数据存储
This skill is specifically designed for automated website data crawling and API interface calling, including:
- Analyze and crawl website structures
- Call and test REST/GraphQL APIs
- Create automated crawler scripts
- Data parsing and cleaning
- Handle anti-crawling mechanisms
- Scheduled tasks and data storage
使用场景
Usage Scenarios
- "爬取这个网站的产品信息"
- "帮我调用这个 API 并解析返回数据"
- "创建一个脚本定时抓取新闻"
- "分析这个网站的 API 接口文档"
- "绕过这个网站的反爬虫限制"
- "Scrape product information from this website"
- "Help me call this API and parse the returned data"
- "Create a script to regularly scrape news"
- "Analyze the API interface documentation of this website"
- "Bypass the anti-crawling restrictions of this website"
技术栈
Technology Stack
Python 爬虫
Python Crawlers
- requests:HTTP 请求库
- BeautifulSoup4:HTML 解析
- Scrapy:专业爬虫框架
- Selenium:浏览器自动化
- Playwright:现代浏览器自动化
- requests: HTTP request library
- BeautifulSoup4: HTML parsing
- Scrapy: Professional crawler framework
- Selenium: Browser automation
- Playwright: Modern browser automation
JavaScript 爬虫
JavaScript Crawlers
- axios:HTTP 客户端
- cheerio:服务端 jQuery
- puppeteer:Chrome 自动化
- node-fetch:Fetch API
- axios: HTTP client
- cheerio: Server-side jQuery
- puppeteer: Chrome automation
- node-fetch: Fetch API
工作流程
Workflow
-
目标分析:
- 检查网站结构和数据位置
- 分析 API 接口和认证方式
- 评估反爬虫机制
-
方案设计:
- 选择合适的技术栈
- 设计数据提取策略
- 规划错误处理和重试机制
-
脚本开发:
- 编写爬虫代码
- 实现数据解析逻辑
- 添加日志和监控
-
测试优化:
- 验证数据准确性
- 优化性能和稳定性
- 处理边界情况
-
Target Analysis:
- Check website structure and data locations
- Analyze API interfaces and authentication methods
- Evaluate anti-crawling mechanisms
-
Solution Design:
- Select appropriate technology stack
- Design data extraction strategies
- Plan error handling and retry mechanisms
-
Script Development:
- Write crawler code
- Implement data parsing logic
- Add logging and monitoring
-
Testing and Optimization:
- Verify data accuracy
- Optimize performance and stability
- Handle edge cases
最佳实践
Best Practices
- 遵守 robots.txt 规则
- 设置合理的请求间隔
- 使用 User-Agent 和请求头
- 实现错误重试机制
- 数据去重和验证
- 使用代理池(如需要)
- 保存原始数据和日志
- Comply with robots.txt rules
- Set reasonable request intervals
- Use User-Agent and request headers
- Implement error retry mechanisms
- Data deduplication and verification
- Use proxy pools (if needed)
- Save raw data and logs
常见场景示例
Common Scenario Examples
1. 简单网页爬取
1. Simple Web Scraping
python
import requests
from bs4 import BeautifulSoup
def scrape_website(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取数据
data = []
for item in soup.select('.product'):
data.append({
'title': item.select_one('.title').text,
'price': item.select_one('.price').text
})
return datapython
import requests
from bs4 import BeautifulSoup
def scrape_website(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取数据
data = []
for item in soup.select('.product'):
data.append({
'title': item.select_one('.title').text,
'price': item.select_one('.price').text
})
return data2. API 调用
2. API Calling
python
import requests
def call_api(endpoint, params=None):
headers = {
'Authorization': 'Bearer YOUR_TOKEN',
'Content-Type': 'application/json'
}
response = requests.get(endpoint, headers=headers, params=params)
return response.json()python
import requests
def call_api(endpoint, params=None):
headers = {
'Authorization': 'Bearer YOUR_TOKEN',
'Content-Type': 'application/json'
}
response = requests.get(endpoint, headers=headers, params=params)
return response.json()3. 动态网页爬取
3. Dynamic Web Scraping
python
from selenium import webdriver
from selenium.webdriver.common.by import By
def scrape_dynamic_page(url):
driver = webdriver.Chrome()
driver.get(url)
# 等待页面加载
driver.implicitly_wait(10)
# 提取数据
elements = driver.find_elements(By.CLASS_NAME, 'item')
data = [elem.text for elem in elements]
driver.quit()
return datapython
from selenium import webdriver
from selenium.webdriver.common.by import By
def scrape_dynamic_page(url):
driver = webdriver.Chrome()
driver.get(url)
# 等待页面加载
driver.implicitly_wait(10)
# 提取数据
elements = driver.find_elements(By.CLASS_NAME, 'item')
data = [elem.text for elem in elements]
driver.quit()
return data反爬虫应对策略
Anti-Crawling Countermeasures
- 请求头伪装:模拟真实浏览器
- 代理轮换:使用代理池
- 验证码处理:OCR 或第三方服务
- Cookie 管理:维护会话状态
- 请求频率控制:避免触发限制
- JavaScript 渲染:使用 Selenium/Playwright
- Request Header Spoofing: Simulate real browsers
- Proxy Rotation: Use proxy pools
- Captcha Handling: OCR or third-party services
- Cookie Management: Maintain session state
- Request Frequency Control: Avoid triggering restrictions
- JavaScript Rendering: Use Selenium/Playwright
数据存储方案
Data Storage Solutions
- CSV/Excel:简单数据导出
- JSON:结构化数据存储
- 数据库:MySQL、PostgreSQL、MongoDB
- 云存储:S3、OSS
- 数据仓库:用于大规模数据分析
- CSV/Excel: Simple data export
- JSON: Structured data storage
- Databases: MySQL, PostgreSQL, MongoDB
- Cloud Storage: S3, OSS
- Data Warehouses: For large-scale data analysis