web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraper

网页爬虫

Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.
使用BeautifulSoup和requests从网站提取结构化数据——将任意网页转换为可用数据。

When to Use This Skill

何时使用该技能

  • Competitor research - Scrape pricing, features, positioning
  • Lead generation - Extract contact info from directories
  • Content audit - Pull headings, links, meta data
  • Price monitoring - Track competitor pricing changes
  • Data collection - Gather research data from multiple sources
  • 竞品调研 - 抓取定价、功能、定位信息
  • 线索生成 - 从名录中提取联系信息
  • 内容审计 - 拉取标题、链接、元数据
  • 价格监控 - 追踪竞品定价变动
  • 数据收集 - 从多个来源收集研究数据

What Claude Does vs What You Decide

Claude负责事项与您的决策范围

Claude DoesYou Decide
Structures analysis frameworksStrategic priorities
Synthesizes market dataCompetitive positioning
Identifies opportunitiesResource allocation
Creates strategic optionsFinal strategy selection
Suggests implementation approachesExecution decisions
Claude负责您来决策
构建分析框架战略优先级
整合市场数据竞争定位
识别机会资源分配
创建战略选项最终战略选择
建议实施方法执行决策

Dependencies

依赖

bash
pip install beautifulsoup4 requests pandas click lxml
bash
pip install beautifulsoup4 requests pandas click lxml

Commands

命令

Scrape Elements

抓取元素

bash
python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"
bash
python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"

Extract Links

提取链接

bash
python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only
bash
python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only

Extract Emails

提取邮箱

bash
python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2
bash
python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2

Extract Structured Data

提取结构化数据

bash
python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product
bash
python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product

Examples

示例

Example 1: Scrape Competitor Pricing

示例1:抓取竞品定价

bash
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"
bash
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

Output:

Output:

Extracted 6 elements

Extracted 6 elements

1. Starter - $29/mo

1. Starter - $29/mo

2. Pro - $99/mo

2. Pro - $99/mo

3. Enterprise - Contact us

3. Enterprise - Contact us

undefined
undefined

Example 2: Extract Article Content

示例2:提取文章内容

bash
python scripts/main.py structured https://blog.example.com/post --schema article
bash
python scripts/main.py structured https://blog.example.com/post --schema article

Output: article_data.json

Output: article_data.json

{

{

"title": "How to Scale Your Startup",

"title": "How to Scale Your Startup",

"author": "Jane Doe",

"author": "Jane Doe",

"date": "2024-01-15",

"date": "2024-01-15",

"content": "...",

"content": "...",

"word_count": 1523

"word_count": 1523

}

}

undefined
undefined

CSS Selector Reference

CSS选择器参考

SelectorDescriptionExample
tag
Element type
h1
,
p
,
div
.class
Class name
.price
,
.title
#id
Element ID
#main-content
tag.class
Tag with class
div.product
tag[attr]
Has attribute
a[href]
parent > child
Direct child
ul > li
tag1, tag2
Multiple
h1, h2, h3
选择器描述示例
tag
元素类型
h1
,
p
,
div
.class
类名
.price
,
.title
#id
元素ID
#main-content
tag.class
携带指定类的标签
div.product
tag[attr]
携带指定属性的标签
a[href]
parent > child
直接子元素
ul > li
tag1, tag2
匹配多个标签
h1, h2, h3

Ethical Scraping Guidelines

道德爬虫准则

  1. Check robots.txt - Respect site's scraping policy
  2. Rate limit - Don't overload servers (1-2 req/sec)
  3. Identify yourself - Use descriptive User-Agent
  4. Cache requests - Don't re-scrape unchanged pages
  5. Terms of Service - Check if scraping is allowed
  1. 检查robots.txt - 遵守网站的爬虫政策
  2. 速率限制 - 不要过载服务器(1-2次请求/秒)
  3. 标识身份 - 使用描述性的User-Agent
  4. 缓存请求 - 不要重复抓取未改动的页面
  5. 服务条款 - 确认网站是否允许爬虫

Skill Boundaries

技能边界

What This Skill Does Well

该技能擅长的事项

  • Structuring strategic analysis
  • Identifying market opportunities
  • Creating strategic frameworks
  • Synthesizing competitive data
  • 构建战略分析
  • 识别市场机会
  • 创建战略框架
  • 整合竞争数据

What This Skill Cannot Do

该技能无法实现的事项

  • Replace market research
  • Guarantee strategic success
  • Know proprietary competitor info
  • Make executive decisions
  • 替代市场调研
  • 保证战略成功
  • 获取竞品专有信息
  • 做出执行决策

Related Skills

相关技能

  • competitor-monitor - Monitor competitor changes
  • pdf-extractor - Extract from PDFs
  • competitor-monitor - 监控竞品变动
  • pdf-extractor - 从PDF中提取内容

Skill Metadata

技能元数据

  • Mode: centaur
yaml
category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week
  • Mode: centaur
yaml
category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week