stat-eda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploratory Data Analysis (EDA)

探索性数据分析(EDA)

Framework

框架

IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future

Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.

Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.
IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future

Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.

Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.

EDA Workflow

EDA 工作流

Standard five-phase flow (structure → quality → univariate → bivariate → findings summary). Assume the agent already knows these steps. Focus on the non-obvious traps below instead.
Critical additions most EDA guides miss:
  1. Split BEFORE explore (see IRON LAW above)
  2. Missing data pattern matters more than count: MCAR is safe to impute; MNAR (e.g. high-income respondents skip income question) requires domain modeling, not mean-fill
  3. Simpson's paradox check: If a trend holds in the aggregate but reverses within subgroups, the aggregate trend is misleading. Always stratify by the most obvious confound before reporting a bivariate finding
  4. Data leakage in features: A feature that perfectly correlates with the target is usually derived FROM the target (e.g. "refund_amount" predicting churn — it's an effect, not a cause). Flag any feature with r > 0.95 for causal review
For the visualization selection guide, see
references/missing-data.md
.
标准的五阶段流程(结构→质量→单变量→双变量→发现总结)。假设Agent已了解这些步骤,下面重点介绍容易被忽略的陷阱。
大多数EDA指南遗漏的关键补充内容:
  1. 先拆分再探索(参见上述铁则)
  2. 缺失数据的模式比数量更重要:MCAR(完全随机缺失)可以安全地进行插补;MNAR(非随机缺失,例如高收入受访者跳过收入问题)需要领域建模,而不是简单的均值填充
  3. 辛普森悖论检查:如果整体数据中存在某种趋势,但在子组中趋势反转,那么整体趋势具有误导性。在报告双变量发现之前,务必按最明显的混杂因素进行分层分析
  4. 特征中的数据泄露:与目标变量完全相关的特征通常是从目标变量衍生而来的(例如用“退款金额”预测客户流失——这是结果,而非原因)。标记任何相关系数r>0.95的特征以进行因果关系审查
可视化选择指南请参见
references/missing-data.md

Output Format

输出格式

markdown
undefined
markdown
undefined

EDA Report: {Dataset Name}

EDA 报告:{数据集名称}

Dataset Overview

数据集概览

  • Rows: {N}, Columns: {N}
  • Date range: {if applicable}
  • Key columns: {description}
  • 行数:{N},列数:{N}
  • 日期范围:{如适用}
  • 关键列:{描述}

Data Quality

数据质量

IssueColumns AffectedCount/%Action
Missing values{cols}{N / %}{drop / impute / investigate}
Outliers{cols}{N}{cap / remove / keep}
Duplicates{N}{remove}
问题受影响列数量/占比操作
缺失值{cols}{N / %}{删除 / 插补 / 调查}
异常值{cols}{N}{截断 / 删除 / 保留}
重复值{N}{删除}

Key Statistics

关键统计信息

VariableMeanMedianStdMinMaxDistribution
{var}...............{normal/skewed/bimodal}
变量均值中位数标准差最小值最大值分布
{var}...............{正态/偏态/双峰}

Key Findings

关键发现

  1. {insight with supporting data}
  2. {insight}
  3. {insight}
  1. {带支撑数据的洞察}
  2. {洞察}
  3. {洞察}

Recommendations

建议

  • {next analysis step or data issue to resolve}
undefined
  • {下一步分析步骤或需解决的数据问题}
undefined

Gotchas

注意事项

  • Correlation ≠ causation: EDA finds associations. Establishing causation requires controlled experiments or causal inference methods.
  • Outliers can be data errors OR real signal: Don't auto-remove. Investigate. A transaction amount of $1M might be a typo or your biggest customer.
  • Missing data has meaning: Data missing from one column may be related to values in another. "Missing income" may mean "unemployed", not random. Check patterns.
  • Visualization lies: Truncated Y-axes, cherry-picked time ranges, and misleading scales can distort insights. Always use appropriate scales and note limitations.
  • 相关性≠因果性:EDA发现的是关联关系。确立因果关系需要对照实验或因果推断方法。
  • 异常值可能是数据错误也可能是真实信号:不要自动删除,先调查。100万美元的交易金额可能是输入错误,也可能来自你的最大客户。
  • 缺失数据具有含义:某一列的缺失数据可能与另一列的值相关。“收入缺失”可能意味着“失业”,而非随机缺失。请检查模式。
  • 可视化可能存在误导:截断Y轴、刻意挑选的时间范围和误导性的刻度会扭曲洞察。务必使用合适的刻度并注明局限性。

References

参考资料

  • For missing data handling strategies, see
    references/missing-data.md
  • 缺失数据处理策略请参见
    references/missing-data.md