web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraper
网页爬虫
Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.
使用BeautifulSoup和requests从网站提取结构化数据——将任意网页转换为可用数据。
When to Use This Skill
何时使用该技能
- Competitor research - Scrape pricing, features, positioning
- Lead generation - Extract contact info from directories
- Content audit - Pull headings, links, meta data
- Price monitoring - Track competitor pricing changes
- Data collection - Gather research data from multiple sources
- 竞品调研 - 抓取定价、功能、定位信息
- 线索生成 - 从名录中提取联系信息
- 内容审计 - 拉取标题、链接、元数据
- 价格监控 - 追踪竞品定价变动
- 数据收集 - 从多个来源收集研究数据
What Claude Does vs What You Decide
Claude负责事项与您的决策范围
| Claude Does | You Decide |
|---|---|
| Structures analysis frameworks | Strategic priorities |
| Synthesizes market data | Competitive positioning |
| Identifies opportunities | Resource allocation |
| Creates strategic options | Final strategy selection |
| Suggests implementation approaches | Execution decisions |
| Claude负责 | 您来决策 |
|---|---|
| 构建分析框架 | 战略优先级 |
| 整合市场数据 | 竞争定位 |
| 识别机会 | 资源分配 |
| 创建战略选项 | 最终战略选择 |
| 建议实施方法 | 执行决策 |
Dependencies
依赖
bash
pip install beautifulsoup4 requests pandas click lxmlbash
pip install beautifulsoup4 requests pandas click lxmlCommands
命令
Scrape Elements
抓取元素
bash
python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"bash
python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"Extract Links
提取链接
bash
python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-onlybash
python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-onlyExtract Emails
提取邮箱
bash
python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2bash
python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2Extract Structured Data
提取结构化数据
bash
python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema productbash
python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema productExamples
示例
Example 1: Scrape Competitor Pricing
示例1:抓取竞品定价
bash
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"bash
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"Output:
Output:
Extracted 6 elements
Extracted 6 elements
1. Starter - $29/mo
1. Starter - $29/mo
2. Pro - $99/mo
2. Pro - $99/mo
3. Enterprise - Contact us
3. Enterprise - Contact us
undefinedundefinedExample 2: Extract Article Content
示例2:提取文章内容
bash
python scripts/main.py structured https://blog.example.com/post --schema articlebash
python scripts/main.py structured https://blog.example.com/post --schema articleOutput: article_data.json
Output: article_data.json
{
{
"title": "How to Scale Your Startup",
"title": "How to Scale Your Startup",
"author": "Jane Doe",
"author": "Jane Doe",
"date": "2024-01-15",
"date": "2024-01-15",
"content": "...",
"content": "...",
"word_count": 1523
"word_count": 1523
}
}
undefinedundefinedCSS Selector Reference
CSS选择器参考
| Selector | Description | Example |
|---|---|---|
| Element type | |
| Class name | |
| Element ID | |
| Tag with class | |
| Has attribute | |
| Direct child | |
| Multiple | |
| 选择器 | 描述 | 示例 |
|---|---|---|
| 元素类型 | |
| 类名 | |
| 元素ID | |
| 携带指定类的标签 | |
| 携带指定属性的标签 | |
| 直接子元素 | |
| 匹配多个标签 | |
Ethical Scraping Guidelines
道德爬虫准则
- Check robots.txt - Respect site's scraping policy
- Rate limit - Don't overload servers (1-2 req/sec)
- Identify yourself - Use descriptive User-Agent
- Cache requests - Don't re-scrape unchanged pages
- Terms of Service - Check if scraping is allowed
- 检查robots.txt - 遵守网站的爬虫政策
- 速率限制 - 不要过载服务器(1-2次请求/秒)
- 标识身份 - 使用描述性的User-Agent
- 缓存请求 - 不要重复抓取未改动的页面
- 服务条款 - 确认网站是否允许爬虫
Skill Boundaries
技能边界
What This Skill Does Well
该技能擅长的事项
- Structuring strategic analysis
- Identifying market opportunities
- Creating strategic frameworks
- Synthesizing competitive data
- 构建战略分析
- 识别市场机会
- 创建战略框架
- 整合竞争数据
What This Skill Cannot Do
该技能无法实现的事项
- Replace market research
- Guarantee strategic success
- Know proprietary competitor info
- Make executive decisions
- 替代市场调研
- 保证战略成功
- 获取竞品专有信息
- 做出执行决策
Related Skills
相关技能
- competitor-monitor - Monitor competitor changes
- pdf-extractor - Extract from PDFs
- competitor-monitor - 监控竞品变动
- pdf-extractor - 从PDF中提取内容
Skill Metadata
技能元数据
- Mode: centaur
yaml
category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week- Mode: centaur
yaml
category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week