data-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Analysis

数据分析

Generate rigorous statistical analysis code with multi-round review.
生成经过多轮评审的严谨统计分析代码。

Input

输入

  • $0
    — Data source (CSV, JSON, pickle, or experiment logs)
  • $1
    — Research goal or hypothesis to test
  • $0
    — 数据源(CSV、JSON、pickle或实验日志)
  • $1
    — 研究目标或待检验的假设

References

参考资料

  • 4-round code review prompts:
    ~/.claude/skills/data-analysis/references/review-prompts.md
  • 4轮代码评审提示词:
    ~/.claude/skills/data-analysis/references/review-prompts.md

Scripts

脚本

Statistical summary and comparison

统计摘要与对比

bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describe
Detects data types, recommends tests, runs comparisons, outputs effect sizes and significance stars. Requires numpy, scipy.
bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describe
自动检测数据类型,推荐合适的检验方法,执行对比分析,输出效应量和显著性标记。需要依赖numpy、scipy库。

Format p-values

p值格式化

bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latex
Formats p-values with stars, LaTeX notation, or plain text. Stdlib-only.
bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latex
将p值格式化为带显著性标记、LaTeX格式或纯文本形式。仅依赖Python标准库。

Workflow

工作流程

Step 1: Generate Analysis Code

步骤1:生成分析代码

Structure the code with these sections:
  1. # IMPORT
    — pandas, numpy, scipy, statsmodels, sklearn
  2. # LOAD DATA
    — Load from original data files
  3. # DATASET PREPARATIONS
    — Missing values, units, exclusion criteria
  4. # DESCRIPTIVE STATISTICS
    — Summary tables if needed
  5. # PREPROCESSING
    — Dummy variables, normalization
  6. # ANALYSIS
    — Statistical tests per hypothesis
  7. # SAVE ADDITIONAL RESULTS
    — Extra results to pickle
代码需包含以下结构模块:
  1. # IMPORT
    — 导入pandas、numpy、scipy、statsmodels、sklearn库
  2. # LOAD DATA
    — 从原始数据文件加载数据
  3. # DATASET PREPARATIONS
    — 缺失值处理、单位统一、样本排除规则
  4. # DESCRIPTIVE STATISTICS
    — 按需生成摘要统计表
  5. # PREPROCESSING
    — 虚拟变量转换、数据归一化
  6. # ANALYSIS
    — 根据假设执行对应统计检验
  7. # SAVE ADDITIONAL RESULTS
    — 将额外结果保存为pickle格式

Step 2: 4-Round Code Review

步骤2:4轮代码评审

  1. Round 1 — Code Flaws: Mathematical/statistical errors, wrong calculations, trivial tests
  2. Round 2 — Data Handling: Missing values, units, preprocessing, test choice
  3. Round 3 — Per-Table: Sensible values, measures of uncertainty, missing data
  4. Round 4 — Cross-Table: Completeness, consistency, missing variables
  1. 第1轮 — 代码缺陷检查:排查数学/统计错误、计算失误、不合理的检验方法
  2. 第2轮 — 数据处理检查:检查缺失值处理、单位一致性、数据预处理、检验方法选择合理性
  3. 第3轮 — 单表结果检查:验证结果数值合理性、不确定性指标完整性、缺失数据处理情况
  4. 第4轮 — 跨表一致性检查:验证结果完整性、一致性、变量覆盖情况

Step 3: Produce Results

步骤3:生成最终结果

  • Every nominal value must have uncertainty (CI, STD, or p-value)
  • Statistical tests must be appropriate for the data type
  • Results must match actual data — never hallucinate
  • 所有名义值必须附带不确定性指标(置信区间CI、标准差STD或p值)
  • 统计检验方法必须与数据类型匹配
  • 结果必须与实际数据一致——绝对不能虚构数据

Allowed Packages

允许使用的库

pandas
,
numpy
,
scipy
,
statsmodels
,
sklearn
,
pickle
pandas
,
numpy
,
scipy
,
statsmodels
,
sklearn
,
pickle

Statistical Test Selection

统计检验方法选择

Data TypeTest
Two groups, normalIndependent t-test
Two groups, non-normalMann-Whitney U
Paired samplesPaired t-test / Wilcoxon
Multiple groupsANOVA / Kruskal-Wallis
CategoricalChi-square / Fisher's exact
CorrelationPearson / Spearman
RegressionOLS / Logistic / Mixed effects
数据类型适用检验方法
两组数据,正态分布独立样本t检验
两组数据,非正态分布Mann-Whitney U检验
配对样本配对t检验 / Wilcoxon符号秩检验
多组数据ANOVA / Kruskal-Wallis检验
分类数据卡方检验 / Fisher精确检验
相关性分析Pearson相关系数 / Spearman秩相关系数
回归分析OLS回归 / 逻辑回归 / 混合效应回归

Rules

规则

  • Always report p-values for statistical tests
  • Account for relevant confounding variables
  • Use inherent package functionality (e.g.,
    formula = "y ~ a * b"
    for interactions)
  • Do not manually implement available statistical functions
  • Access dataframes using string-based column names, not integer indices
  • 统计检验结果必须始终报告p值
  • 需考虑相关的混杂变量
  • 使用库的原生功能(例如,使用
    formula = "y ~ a * b"
    来处理交互效应)
  • 不要手动实现已有库提供的统计函数
  • 使用字符串型列名访问DataFrame,而非整数索引

Related Skills

相关技能

  • Upstream: experiment-code, experiment-design
  • Downstream: table-generation, figure-generation, backward-traceability
  • See also: math-reasoning
  • 上游技能:experiment-codeexperiment-design
  • 下游技能:table-generationfigure-generationbackward-traceability
  • 相关技能:math-reasoning