data-base
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMental Model
心智模型
Data acquisition is converting unstructured web content into structured data. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.
数据获取是将非结构化网页内容转换为结构化数据的过程。根据页面复杂度选择工具:JS密集型页面→使用chrome-devtools MCP,静态页面→使用Python requests。
Tool Selection
工具选择
| Page Type | Tool | When to Use |
|---|---|---|
| Dynamic (JS-rendered, SPAs) | chrome-devtools MCP | React/Vue apps, infinite scroll, login gates |
| Static HTML | Python requests | Blogs, news sites, simple pages |
| Complex/reusable logic | Python script | Multi-step scraping, rate limiting, proxies |
| 页面类型 | 工具 | 使用场景 |
|---|---|---|
| 动态页面(JS渲染、SPA) | chrome-devtools MCP | React/Vue应用、无限滚动、登录验证页面 |
| 静态HTML页面 | Python requests | 博客、新闻网站、简单页面 |
| 复杂/可复用逻辑场景 | Python脚本 | 多步骤爬取、速率限制、代理使用 |
Anti-Patterns (NEVER)
反模式(绝对禁止)
- Don't scrape without checking robots.txt
- Don't overload servers (default: 1 req/sec)
- Don't scrape personal data without consent
- Don't use Chinese characters in output filenames (ASCII only)
- Don't forget to identify bot with User-Agent
- 未检查robots.txt时不要进行爬取
- 不要给服务器造成过载(默认:每秒1次请求)
- 未经同意不要爬取个人数据
- 输出文件名中不要使用中文字符(仅使用ASCII)
- 不要忘记通过User-Agent标识爬虫
Output Format
输出格式
- JSON: Nested/hierarchical data
- CSV: Tabular data
- Filename: (ASCII only, e.g.,
{source}_{timestamp}.{ext})news_20250115.csv
- JSON:嵌套/层级化数据
- CSV:表格型数据
- 文件名:(仅使用ASCII,例如:
{source}_{timestamp}.{ext})news_20250115.csv
Workflow
工作流程
- Ask: What data? Which sites? How much?
- Select tool based on page type
- Extract and save structured data
- Deliver file path to user or pass to data-analysis
- 询问:需要什么数据?来自哪些网站?数据量多少?
- 选择工具:根据页面类型选择合适的工具
- 提取并保存:提取数据并保存为结构化格式
- 交付:将文件路径提供给用户,或传递给数据分析环节
Python Environment
Python环境
Auto-initialize virtual environment if needed, then execute:
bash
cd skills/data-base
if [ ! -f ".venv/bin/python" ]; then
echo "Creating Python environment..."
./setup.sh
fi
.venv/bin/python your_script.pyThe setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.
如果需要,自动初始化虚拟环境,然后执行:
bash
cd skills/data-base
if [ ! -f ".venv/bin/python" ]; then
echo "Creating Python environment..."
./setup.sh
fi
.venv/bin/python your_script.py该安装脚本会自动安装:requests、beautifulsoup4、pandas、网页爬取工具。
References (load on demand)
参考资料(按需加载)
For detailed APIs and templates, load: ,
references/REFERENCE.mdreferences/templates.md如需详细API和模板,请加载:、
references/REFERENCE.mdreferences/templates.md