data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis
数据分析
Generate rigorous statistical analysis code with multi-round review.
生成经过多轮评审的严谨统计分析代码。
Input
输入
- — Data source (CSV, JSON, pickle, or experiment logs)
$0 - — Research goal or hypothesis to test
$1
- — 数据源(CSV、JSON、pickle或实验日志)
$0 - — 要验证的研究目标或假设
$1
References
参考资料
- 4-round code review prompts:
~/.claude/skills/data-analysis/references/review-prompts.md
- 4轮代码评审提示词:
~/.claude/skills/data-analysis/references/review-prompts.md
Scripts
脚本
Statistical summary and comparison
统计摘要与对比
bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describeDetects data types, recommends tests, runs comparisons, outputs effect sizes and significance stars. Requires numpy, scipy.
bash
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --compare method --metric accuracy --output summary.json
python ~/.claude/skills/data-analysis/scripts/stat_summary.py --input results.csv --describe自动检测数据类型,推荐检验方法,执行对比分析,输出效应量和显著性标记。需要依赖numpy、scipy库。
Format p-values
格式化p值
bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latexFormats p-values with stars, LaTeX notation, or plain text. Stdlib-only.
bash
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --values "0.001 0.05 0.23" --format stars
python ~/.claude/skills/data-analysis/scripts/format_pvalue.py --csv results.csv --column pvalue --format latex将p值格式化为带显著性标记、LaTeX格式或纯文本形式。仅依赖标准库。
Workflow
工作流
Step 1: Generate Analysis Code
步骤1:生成分析代码
Structure the code with these sections:
- — pandas, numpy, scipy, statsmodels, sklearn
# IMPORT - — Load from original data files
# LOAD DATA - — Missing values, units, exclusion criteria
# DATASET PREPARATIONS - — Summary tables if needed
# DESCRIPTIVE STATISTICS - — Dummy variables, normalization
# PREPROCESSING - — Statistical tests per hypothesis
# ANALYSIS - — Extra results to pickle
# SAVE ADDITIONAL RESULTS
代码需包含以下结构模块:
- — pandas、numpy、scipy、statsmodels、sklearn
# IMPORT - — 从原始数据文件加载数据
# LOAD DATA - — 缺失值处理、单位统一、排除标准设置
# DATASET PREPARATIONS - — 按需生成摘要表格
# DESCRIPTIVE STATISTICS - — 虚拟变量转换、归一化处理
# PREPROCESSING - — 针对假设执行统计检验
# ANALYSIS - — 将额外结果保存为pickle格式
# SAVE ADDITIONAL RESULTS
Step 2: 4-Round Code Review
步骤2:4轮代码评审
- Round 1 — Code Flaws: Mathematical/statistical errors, wrong calculations, trivial tests
- Round 2 — Data Handling: Missing values, units, preprocessing, test choice
- Round 3 — Per-Table: Sensible values, measures of uncertainty, missing data
- Round 4 — Cross-Table: Completeness, consistency, missing variables
- 第一轮 — 代码缺陷检查:数学/统计错误、计算错误、无关检验
- 第二轮 — 数据处理检查:缺失值处理、单位问题、预处理步骤、检验方法选择
- 第三轮 — 单表检查:数值合理性、不确定性度量、数据缺失情况
- 第四轮 — 跨表检查:内容完整性、一致性、变量遗漏情况
Step 3: Produce Results
步骤3:生成结果
- Every nominal value must have uncertainty (CI, STD, or p-value)
- Statistical tests must be appropriate for the data type
- Results must match actual data — never hallucinate
- 所有标称值必须附带不确定性信息(置信区间CI、标准差STD或p值)
- 统计检验方法必须与数据类型匹配
- 结果必须与实际数据一致 — 绝对不能虚构数据
Allowed Packages
允许使用的包
pandasnumpyscipystatsmodelssklearnpicklepandasnumpyscipystatsmodelssklearnpickleStatistical Test Selection
统计检验方法选择
| Data Type | Test |
|---|---|
| Two groups, normal | Independent t-test |
| Two groups, non-normal | Mann-Whitney U |
| Paired samples | Paired t-test / Wilcoxon |
| Multiple groups | ANOVA / Kruskal-Wallis |
| Categorical | Chi-square / Fisher's exact |
| Correlation | Pearson / Spearman |
| Regression | OLS / Logistic / Mixed effects |
| 数据类型 | 检验方法 |
|---|---|
| 两组正态分布数据 | 独立样本t检验 |
| 两组非正态分布数据 | Mann-Whitney U检验 |
| 配对样本 | 配对t检验 / Wilcoxon符号秩检验 |
| 多组数据 | ANOVA / Kruskal-Wallis检验 |
| 分类数据 | 卡方检验 / Fisher精确检验 |
| 相关性分析 | Pearson相关系数 / Spearman秩相关系数 |
| 回归分析 | OLS回归 / 逻辑回归 / 混合效应模型 |
Rules
规则
- Always report p-values for statistical tests
- Account for relevant confounding variables
- Use inherent package functionality (e.g., for interactions)
formula = "y ~ a * b" - Do not manually implement available statistical functions
- Access dataframes using string-based column names, not integer indices
- 统计检验必须报告p值
- 需考虑相关混淆变量
- 优先使用内置包功能(例如使用处理交互项)
formula = "y ~ a * b" - 不要手动实现已有统计函数
- 使用字符串列名访问数据框,而非整数索引
Related Skills
相关技能
- Upstream: experiment-code, experiment-design
- Downstream: table-generation, figure-generation, backward-traceability
- See also: math-reasoning
- 上游技能:experiment-code、experiment-design
- 下游技能:table-generation、figure-generation、backward-traceability
- 其他相关:math-reasoning