web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraper

网页爬虫

Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.

使用BeautifulSoup和requests从网站提取结构化数据——将任意网页转换为可用数据。

When to Use This Skill

何时使用该技能

Competitor research - Scrape pricing, features, positioning
Lead generation - Extract contact info from directories
Content audit - Pull headings, links, meta data
Price monitoring - Track competitor pricing changes
Data collection - Gather research data from multiple sources

竞品调研 - 抓取定价、功能、定位信息
线索生成 - 从名录中提取联系信息
内容审计 - 拉取标题、链接、元数据
价格监控 - 追踪竞品定价变动
数据收集 - 从多个来源收集研究数据

What Claude Does vs What You Decide

Claude负责事项与您的决策范围

Claude Does	You Decide
Structures analysis frameworks	Strategic priorities
Synthesizes market data	Competitive positioning
Identifies opportunities	Resource allocation
Creates strategic options	Final strategy selection
Suggests implementation approaches	Execution decisions

Claude负责	您来决策
构建分析框架	战略优先级
整合市场数据	竞争定位
识别机会	资源分配
创建战略选项	最终战略选择
建议实施方法	执行决策

Dependencies

依赖

bash

pip install beautifulsoup4 requests pandas click lxml

bash

pip install beautifulsoup4 requests pandas click lxml

Commands

命令

Scrape Elements

抓取元素

bash

python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"

bash

python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"

Extract Links

提取链接

bash

python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only

bash

python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only

Extract Emails

提取邮箱

bash

python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2

bash

python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2

Extract Structured Data

提取结构化数据

bash

python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product

bash

python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product

Examples

示例

Example 1: Scrape Competitor Pricing

示例1：抓取竞品定价

bash

python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

bash

python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

Output:

Extracted 6 elements

1. Starter - $29/mo

2. Pro - $99/mo

3. Enterprise - Contact us

undefined

undefined

Example 2: Extract Article Content

示例2：提取文章内容

bash

python scripts/main.py structured https://blog.example.com/post --schema article

bash

python scripts/main.py structured https://blog.example.com/post --schema article

Output: article_data.json

{

"title": "How to Scale Your Startup",

"author": "Jane Doe",

"date": "2024-01-15",

"content": "...",

"word_count": 1523

}

undefined

undefined

CSS Selector Reference

CSS选择器参考

Selector	Description	Example
`tag`	Element type	`h1` , `p` , `div`
`.class`	Class name	`.price` , `.title`
`#id`	Element ID	`#main-content`
`tag.class`	Tag with class	`div.product`
`tag[attr]`	Has attribute	`a[href]`
`parent > child`	Direct child	`ul > li`
`tag1, tag2`	Multiple	`h1, h2, h3`

选择器	描述	示例
`tag`	元素类型	`h1` , `p` , `div`
`.class`	类名	`.price` , `.title`
`#id`	元素ID	`#main-content`
`tag.class`	携带指定类的标签	`div.product`
`tag[attr]`	携带指定属性的标签	`a[href]`
`parent > child`	直接子元素	`ul > li`
`tag1, tag2`	匹配多个标签	`h1, h2, h3`

Ethical Scraping Guidelines

道德爬虫准则

Check robots.txt - Respect site's scraping policy
Rate limit - Don't overload servers (1-2 req/sec)
Identify yourself - Use descriptive User-Agent
Cache requests - Don't re-scrape unchanged pages
Terms of Service - Check if scraping is allowed

检查robots.txt - 遵守网站的爬虫政策
速率限制 - 不要过载服务器（1-2次请求/秒）
标识身份 - 使用描述性的User-Agent
缓存请求 - 不要重复抓取未改动的页面
服务条款 - 确认网站是否允许爬虫

Skill Boundaries

技能边界

What This Skill Does Well

该技能擅长的事项

Structuring strategic analysis
Identifying market opportunities
Creating strategic frameworks
Synthesizing competitive data

构建战略分析
识别市场机会
创建战略框架
整合竞争数据

What This Skill Cannot Do

该技能无法实现的事项

Replace market research
Guarantee strategic success
Know proprietary competitor info
Make executive decisions

替代市场调研
保证战略成功
获取竞品专有信息
做出执行决策

Related Skills

Skill Metadata

技能元数据

Mode: centaur

yaml

category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week

Mode: centaur

yaml

category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week

web-scraper

Original

Translation

Web Scraper

网页爬虫

When to Use This Skill

何时使用该技能

What Claude Does vs What You Decide

Claude负责事项与您的决策范围

Dependencies

依赖

Commands

命令

Scrape Elements

抓取元素

Extract Links

提取链接

Extract Emails

提取邮箱

Extract Structured Data

提取结构化数据

Examples

示例

Example 1: Scrape Competitor Pricing

示例1：抓取竞品定价

Output:

Output:

Extracted 6 elements

Extracted 6 elements

1. Starter - $29/mo

1. Starter - $29/mo

2. Pro - $99/mo

2. Pro - $99/mo

3. Enterprise - Contact us

3. Enterprise - Contact us

Example 2: Extract Article Content

示例2：提取文章内容

Output: article_data.json

Output: article_data.json

{

{

"title": "How to Scale Your Startup",

"title": "How to Scale Your Startup",

"author": "Jane Doe",

"author": "Jane Doe",

"date": "2024-01-15",

"date": "2024-01-15",

"content": "...",

"content": "...",

"word_count": 1523

"word_count": 1523

}

}

CSS Selector Reference

CSS选择器参考

Ethical Scraping Guidelines

道德爬虫准则

Skill Boundaries

技能边界

What This Skill Does Well

该技能擅长的事项

What This Skill Cannot Do

该技能无法实现的事项

Related Skills

相关技能

Skill Metadata

技能元数据