data-base

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Mental Model

心智模型

Data acquisition is converting unstructured web content into structured data. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.
数据获取是将非结构化网页内容转换为结构化数据的过程。根据页面复杂度选择工具:JS密集型页面→使用chrome-devtools MCP,静态页面→使用Python requests。

Tool Selection

工具选择

Page TypeToolWhen to Use
Dynamic (JS-rendered, SPAs)chrome-devtools MCPReact/Vue apps, infinite scroll, login gates
Static HTMLPython requestsBlogs, news sites, simple pages
Complex/reusable logicPython scriptMulti-step scraping, rate limiting, proxies
页面类型工具使用场景
动态页面(JS渲染、SPA)chrome-devtools MCPReact/Vue应用、无限滚动、登录验证页面
静态HTML页面Python requests博客、新闻网站、简单页面
复杂/可复用逻辑场景Python脚本多步骤爬取、速率限制、代理使用

Anti-Patterns (NEVER)

反模式(绝对禁止)

  • Don't scrape without checking robots.txt
  • Don't overload servers (default: 1 req/sec)
  • Don't scrape personal data without consent
  • Don't use Chinese characters in output filenames (ASCII only)
  • Don't forget to identify bot with User-Agent
  • 未检查robots.txt时不要进行爬取
  • 不要给服务器造成过载(默认:每秒1次请求)
  • 未经同意不要爬取个人数据
  • 输出文件名中不要使用中文字符(仅使用ASCII)
  • 不要忘记通过User-Agent标识爬虫

Output Format

输出格式

  • JSON: Nested/hierarchical data
  • CSV: Tabular data
  • Filename:
    {source}_{timestamp}.{ext}
    (ASCII only, e.g.,
    news_20250115.csv
    )
  • JSON:嵌套/层级化数据
  • CSV:表格型数据
  • 文件名:
    {source}_{timestamp}.{ext}
    (仅使用ASCII,例如:
    news_20250115.csv

Workflow

工作流程

  1. Ask: What data? Which sites? How much?
  2. Select tool based on page type
  3. Extract and save structured data
  4. Deliver file path to user or pass to data-analysis
  1. 询问:需要什么数据?来自哪些网站?数据量多少?
  2. 选择工具:根据页面类型选择合适的工具
  3. 提取并保存:提取数据并保存为结构化格式
  4. 交付:将文件路径提供给用户,或传递给数据分析环节

Python Environment

Python环境

Auto-initialize virtual environment if needed, then execute:
bash
cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then
    echo "Creating Python environment..."
    ./setup.sh
fi

.venv/bin/python your_script.py
The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.
如果需要,自动初始化虚拟环境,然后执行:
bash
cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then
    echo "Creating Python environment..."
    ./setup.sh
fi

.venv/bin/python your_script.py
该安装脚本会自动安装:requests、beautifulsoup4、pandas、网页爬取工具。

References (load on demand)

参考资料(按需加载)

For detailed APIs and templates, load:
references/REFERENCE.md
,
references/templates.md
如需详细API和模板,请加载:
references/REFERENCE.md
references/templates.md