data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis
数据分析
Generate rigorous statistical analysis code with multi-round review.
生成经过多轮评审的严谨统计分析代码。
Input
输入
- — Data source (CSV, JSON, pickle, or experiment logs)
$0 - — Research goal or hypothesis to test
$1
- — 数据源(CSV、JSON、pickle或实验日志)
$0 - — 研究目标或待检验的假设
$1
References
参考资料
- 4-round code review prompts:
~/.claude/skills/data-analysis/references/review-prompts.md
- 4轮代码评审提示词:
~/.claude/skills/data-analysis/references/review-prompts.md
Scripts
脚本
Statistical summary and comparison
统计摘要与对比
bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describeDetects data types, recommends tests, runs comparisons, outputs effect sizes and significance stars. Requires numpy, scipy.
bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describe自动检测数据类型,推荐合适的检验方法,执行对比分析,输出效应量和显著性标记。需要依赖numpy、scipy库。
Format p-values
p值格式化
bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latexFormats p-values with stars, LaTeX notation, or plain text. Stdlib-only.
bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latex将p值格式化为带显著性标记、LaTeX格式或纯文本形式。仅依赖Python标准库。
Workflow
工作流程
Step 1: Generate Analysis Code
步骤1:生成分析代码
Structure the code with these sections:
- — pandas, numpy, scipy, statsmodels, sklearn
# IMPORT - — Load from original data files
# LOAD DATA - — Missing values, units, exclusion criteria
# DATASET PREPARATIONS - — Summary tables if needed
# DESCRIPTIVE STATISTICS - — Dummy variables, normalization
# PREPROCESSING - — Statistical tests per hypothesis
# ANALYSIS - — Extra results to pickle
# SAVE ADDITIONAL RESULTS
代码需包含以下结构模块:
- — 导入pandas、numpy、scipy、statsmodels、sklearn库
# IMPORT - — 从原始数据文件加载数据
# LOAD DATA - — 缺失值处理、单位统一、样本排除规则
# DATASET PREPARATIONS - — 按需生成摘要统计表
# DESCRIPTIVE STATISTICS - — 虚拟变量转换、数据归一化
# PREPROCESSING - — 根据假设执行对应统计检验
# ANALYSIS - — 将额外结果保存为pickle格式
# SAVE ADDITIONAL RESULTS
Step 2: 4-Round Code Review
步骤2:4轮代码评审
- Round 1 — Code Flaws: Mathematical/statistical errors, wrong calculations, trivial tests
- Round 2 — Data Handling: Missing values, units, preprocessing, test choice
- Round 3 — Per-Table: Sensible values, measures of uncertainty, missing data
- Round 4 — Cross-Table: Completeness, consistency, missing variables
- 第1轮 — 代码缺陷检查:排查数学/统计错误、计算失误、不合理的检验方法
- 第2轮 — 数据处理检查:检查缺失值处理、单位一致性、数据预处理、检验方法选择合理性
- 第3轮 — 单表结果检查:验证结果数值合理性、不确定性指标完整性、缺失数据处理情况
- 第4轮 — 跨表一致性检查:验证结果完整性、一致性、变量覆盖情况
Step 3: Produce Results
步骤3:生成最终结果
- Every nominal value must have uncertainty (CI, STD, or p-value)
- Statistical tests must be appropriate for the data type
- Results must match actual data — never hallucinate
- 所有名义值必须附带不确定性指标(置信区间CI、标准差STD或p值)
- 统计检验方法必须与数据类型匹配
- 结果必须与实际数据一致——绝对不能虚构数据
Allowed Packages
允许使用的库
pandasnumpyscipystatsmodelssklearnpicklepandasnumpyscipystatsmodelssklearnpickleStatistical Test Selection
统计检验方法选择
| Data Type | Test |
|---|---|
| Two groups, normal | Independent t-test |
| Two groups, non-normal | Mann-Whitney U |
| Paired samples | Paired t-test / Wilcoxon |
| Multiple groups | ANOVA / Kruskal-Wallis |
| Categorical | Chi-square / Fisher's exact |
| Correlation | Pearson / Spearman |
| Regression | OLS / Logistic / Mixed effects |
| 数据类型 | 适用检验方法 |
|---|---|
| 两组数据,正态分布 | 独立样本t检验 |
| 两组数据,非正态分布 | Mann-Whitney U检验 |
| 配对样本 | 配对t检验 / Wilcoxon符号秩检验 |
| 多组数据 | ANOVA / Kruskal-Wallis检验 |
| 分类数据 | 卡方检验 / Fisher精确检验 |
| 相关性分析 | Pearson相关系数 / Spearman秩相关系数 |
| 回归分析 | OLS回归 / 逻辑回归 / 混合效应回归 |
Rules
规则
- Always report p-values for statistical tests
- Account for relevant confounding variables
- Use inherent package functionality (e.g., for interactions)
formula = "y ~ a * b" - Do not manually implement available statistical functions
- Access dataframes using string-based column names, not integer indices
- 统计检验结果必须始终报告p值
- 需考虑相关的混杂变量
- 使用库的原生功能(例如,使用来处理交互效应)
formula = "y ~ a * b" - 不要手动实现已有库提供的统计函数
- 使用字符串型列名访问DataFrame,而非整数索引
Related Skills
相关技能
- Upstream: experiment-code, experiment-design
- Downstream: table-generation, figure-generation, backward-traceability
- See also: math-reasoning
- 上游技能:experiment-code、experiment-design
- 下游技能:table-generation、figure-generation、backward-traceability
- 相关技能:math-reasoning