data-base

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Mental Model

心智模型

Data acquisition is converting unstructured web content into structured data. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.

数据获取是将非结构化网页内容转换为结构化数据的过程。根据页面复杂度选择工具：JS密集型页面→使用chrome-devtools MCP，静态页面→使用Python requests。

Tool Selection

工具选择

Page Type	Tool	When to Use
Dynamic (JS-rendered, SPAs)	chrome-devtools MCP	React/Vue apps, infinite scroll, login gates
Static HTML	Python requests	Blogs, news sites, simple pages
Complex/reusable logic	Python script	Multi-step scraping, rate limiting, proxies

页面类型	工具	使用场景
动态页面（JS渲染、SPA）	chrome-devtools MCP	React/Vue应用、无限滚动、登录验证页面
静态HTML页面	Python requests	博客、新闻网站、简单页面
复杂/可复用逻辑场景	Python脚本	多步骤爬取、速率限制、代理使用

Anti-Patterns (NEVER)

反模式（绝对禁止）

Don't scrape without checking robots.txt
Don't overload servers (default: 1 req/sec)
Don't scrape personal data without consent
Don't use Chinese characters in output filenames (ASCII only)
Don't forget to identify bot with User-Agent

未检查robots.txt时不要进行爬取
不要给服务器造成过载（默认：每秒1次请求）
未经同意不要爬取个人数据
输出文件名中不要使用中文字符（仅使用ASCII）
不要忘记通过User-Agent标识爬虫

Output Format

输出格式

JSON: Nested/hierarchical data
CSV: Tabular data

Filename:

{source}_{timestamp}.{ext}

(ASCII only, e.g.,

news_20250115.csv

)

JSON：嵌套/层级化数据
CSV：表格型数据
文件名：
```
{source}_{timestamp}.{ext}
```
（仅使用ASCII，例如：
```
news_20250115.csv
```
）

Workflow

工作流程

Ask: What data? Which sites? How much?
Select tool based on page type
Extract and save structured data
Deliver file path to user or pass to data-analysis

询问：需要什么数据？来自哪些网站？数据量多少？
选择工具：根据页面类型选择合适的工具
提取并保存：提取数据并保存为结构化格式
交付：将文件路径提供给用户，或传递给数据分析环节

Python Environment

Python环境

Auto-initialize virtual environment if needed, then execute:

bash

cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then
    echo "Creating Python environment..."
    ./setup.sh
fi

.venv/bin/python your_script.py

The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.

如果需要，自动初始化虚拟环境，然后执行：

bash

cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then
    echo "Creating Python environment..."
    ./setup.sh
fi

.venv/bin/python your_script.py

该安装脚本会自动安装：requests、beautifulsoup4、pandas、网页爬取工具。

References (load on demand)

参考资料（按需加载）

For detailed APIs and templates, load:

references/REFERENCE.md

references/templates.md

如需详细API和模板，请加载：

references/REFERENCE.md

、

references/templates.md