stat-eda
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExploratory Data Analysis (EDA)
探索性数据分析(EDA)
Framework
框架
IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future
Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.
Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future
Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.
Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.EDA Workflow
EDA 工作流
Standard five-phase flow (structure → quality → univariate → bivariate →
findings summary). Assume the agent already knows these steps. Focus on
the non-obvious traps below instead.
Critical additions most EDA guides miss:
- Split BEFORE explore (see IRON LAW above)
- Missing data pattern matters more than count: MCAR is safe to impute; MNAR (e.g. high-income respondents skip income question) requires domain modeling, not mean-fill
- Simpson's paradox check: If a trend holds in the aggregate but reverses within subgroups, the aggregate trend is misleading. Always stratify by the most obvious confound before reporting a bivariate finding
- Data leakage in features: A feature that perfectly correlates with the target is usually derived FROM the target (e.g. "refund_amount" predicting churn — it's an effect, not a cause). Flag any feature with r > 0.95 for causal review
For the visualization selection guide, see .
references/missing-data.md标准的五阶段流程(结构→质量→单变量→双变量→发现总结)。假设Agent已了解这些步骤,下面重点介绍容易被忽略的陷阱。
大多数EDA指南遗漏的关键补充内容:
- 先拆分再探索(参见上述铁则)
- 缺失数据的模式比数量更重要:MCAR(完全随机缺失)可以安全地进行插补;MNAR(非随机缺失,例如高收入受访者跳过收入问题)需要领域建模,而不是简单的均值填充
- 辛普森悖论检查:如果整体数据中存在某种趋势,但在子组中趋势反转,那么整体趋势具有误导性。在报告双变量发现之前,务必按最明显的混杂因素进行分层分析
- 特征中的数据泄露:与目标变量完全相关的特征通常是从目标变量衍生而来的(例如用“退款金额”预测客户流失——这是结果,而非原因)。标记任何相关系数r>0.95的特征以进行因果关系审查
可视化选择指南请参见。
references/missing-data.mdOutput Format
输出格式
markdown
undefinedmarkdown
undefinedEDA Report: {Dataset Name}
EDA 报告:{数据集名称}
Dataset Overview
数据集概览
- Rows: {N}, Columns: {N}
- Date range: {if applicable}
- Key columns: {description}
- 行数:{N},列数:{N}
- 日期范围:{如适用}
- 关键列:{描述}
Data Quality
数据质量
| Issue | Columns Affected | Count/% | Action |
|---|---|---|---|
| Missing values | {cols} | {N / %} | {drop / impute / investigate} |
| Outliers | {cols} | {N} | {cap / remove / keep} |
| Duplicates | — | {N} | {remove} |
| 问题 | 受影响列 | 数量/占比 | 操作 |
|---|---|---|---|
| 缺失值 | {cols} | {N / %} | {删除 / 插补 / 调查} |
| 异常值 | {cols} | {N} | {截断 / 删除 / 保留} |
| 重复值 | — | {N} | {删除} |
Key Statistics
关键统计信息
| Variable | Mean | Median | Std | Min | Max | Distribution |
|---|---|---|---|---|---|---|
| {var} | ... | ... | ... | ... | ... | {normal/skewed/bimodal} |
| 变量 | 均值 | 中位数 | 标准差 | 最小值 | 最大值 | 分布 |
|---|---|---|---|---|---|---|
| {var} | ... | ... | ... | ... | ... | {正态/偏态/双峰} |
Key Findings
关键发现
- {insight with supporting data}
- {insight}
- {insight}
- {带支撑数据的洞察}
- {洞察}
- {洞察}
Recommendations
建议
- {next analysis step or data issue to resolve}
undefined- {下一步分析步骤或需解决的数据问题}
undefinedGotchas
注意事项
- Correlation ≠ causation: EDA finds associations. Establishing causation requires controlled experiments or causal inference methods.
- Outliers can be data errors OR real signal: Don't auto-remove. Investigate. A transaction amount of $1M might be a typo or your biggest customer.
- Missing data has meaning: Data missing from one column may be related to values in another. "Missing income" may mean "unemployed", not random. Check patterns.
- Visualization lies: Truncated Y-axes, cherry-picked time ranges, and misleading scales can distort insights. Always use appropriate scales and note limitations.
- 相关性≠因果性:EDA发现的是关联关系。确立因果关系需要对照实验或因果推断方法。
- 异常值可能是数据错误也可能是真实信号:不要自动删除,先调查。100万美元的交易金额可能是输入错误,也可能来自你的最大客户。
- 缺失数据具有含义:某一列的缺失数据可能与另一列的值相关。“收入缺失”可能意味着“失业”,而非随机缺失。请检查模式。
- 可视化可能存在误导:截断Y轴、刻意挑选的时间范围和误导性的刻度会扭曲洞察。务必使用合适的刻度并注明局限性。
References
参考资料
- For missing data handling strategies, see
references/missing-data.md
- 缺失数据处理策略请参见
references/missing-data.md