tabular-eda

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Tabular EDA — Done Right

表格数据EDA — 正确的做法

Whenever you get handed a new tabular dataset, stop. Do not jump straight to

XGBClassifier()

. Ten minutes of EDA will catch problems that would otherwise destroy your downstream model — target leakage, high-cardinality explosions, MAR missing data, non-linear features that Pearson correlation says are useless. This skill is the workflow.

每当拿到一份新的表格数据集时，先停下来。不要直接上手使用

XGBClassifier()

。花十分钟做EDA能发现那些会毁掉后续模型的问题——target leakage、高基数特征爆炸、MAR缺失数据、皮尔逊相关系数认为无用的非线性特征。本技能就是对应的工作流。

When to use this skill

何时使用本技能

You just received a new dataset and have no idea what's in it
You're about to train a model and want to validate the data first
A model is performing suspiciously well (or poorly) and you suspect a data quality issue
The user asks "what should I do with this dataset?"

刚收到一份新数据集，完全不清楚其中内容
即将训练模型，想要先验证数据质量
模型表现异常好（或差），怀疑存在数据质量问题
用户询问“我该如何处理这个数据集？”

When NOT to use this skill

何时不使用本技能

You already deeply know the dataset and have profiled it before
The dataset is image / text / audio / time-series — different rules
The user just wants a model trained, fast, and is OK with risk

已经非常熟悉该数据集，并且之前做过探查
数据集是图像/文本/音频/时间序列——需遵循不同规则
用户只想快速训练模型，愿意承担风险

The workflow

工作流

1. Load → shape, dtypes, memory
2. Identify the target → infer problem type (binary / multiclass / regression)
3. Missing data → per-column %, overall %, patterns
4. Numeric distributions → skew, outliers, scale mismatches
5. Categorical cardinality → flag high-cardinality (OHE explosion risk)
6. Near-constant features → flag and consider dropping
7. Redundant pairs → flag features with > 0.95 mutual correlation
8. **Target leakage detection** → flag features with > 0.95 |Pearson| to target
9. **Mutual information vs Pearson** → catch non-linear features Pearson misses
10. Optional: PCA / UMAP for low-dim visualization

The output is a findings report: a list of suspicious things, each with a feature name, the metric that flagged it, and a recommended action. Don't just print plots. A list of problems with names is what you act on.

1. 加载数据 → 查看形状、数据类型、内存占用
2. 确定目标列 → 推断问题类型（二分类/多分类/回归）
3. 缺失数据 → 按列统计占比、整体占比、缺失模式
4. 数值特征分布 → 偏度、异常值、尺度不匹配
5. 分类特征基数 → 标记高基数特征（独热编码爆炸风险）
6. 近常值特征 → 标记并考虑删除
7. 冗余特征对 → 标记互相关系数>0.95的特征对
8. **Target leakage检测** → 标记与目标列皮尔逊绝对值>0.95的特征
9. **互信息vs皮尔逊相关系数** → 捕捉皮尔逊无法发现的非线性特征
10. 可选：PCA/UMAP降维可视化

输出是一份发现报告：列出所有可疑问题，每个问题包含特征名称、触发标记的指标，以及建议操作。不要只输出图表。带有名称的问题列表才是需要采取行动的依据。

Five things that separate this from a tutorial

与普通教程的五大区别

1. Target leakage detection — the single most valuable EDA check

1. Target leakage检测 —— EDA中最有价值的检查

A "leakage" feature is one that contains information about the target that wouldn't actually be available at prediction time. The classic examples:

account_balance_after_payment

for predicting

made_payment

```
total_charges
```
(cumulative) for predicting
```
churned
```
```
claim_paid_amount
```
for predicting
```
claim_was_filed
```

These features are computed after the prediction time. Train on them and you get 99% test accuracy and a model that completely fails in production. The signature is suspiciously high correlation with the target — anything > 0.95 is a leak suspect, anything > 0.99 is almost certainly a leak.

python

def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
    out = []
    for col in numeric_cols:
        if col == target_col:
            continue
        corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
        if abs(corr) > threshold:
            out.append({"feature": col, "pearson": round(corr, 4)})
    return out

When you find a leakage candidate, always confirm with the data owner before dropping it. Sometimes a feature is legitimately almost perfectly correlated with the target (e.g. an upstream model's prediction). But the default assumption is "this is a leak."

“泄露”特征是指包含了预测时实际无法获取的目标信息的特征。经典例子：

用于预测
```
made_payment
```
（是否付款）的
```
account_balance_after_payment
```
（付款后账户余额）
用于预测
```
churned
```
（是否流失）的
```
total_charges
```
（累计费用）
用于预测
```
claim_was_filed
```
（是否提交理赔）的
```
claim_paid_amount
```
（理赔支付金额）

这些特征是在预测时间点之后计算的。用它们训练模型会得到99%的测试准确率，但模型在生产环境中完全失效。这类特征的标志是与目标列的相关系数异常高——任何>0.95的值都是疑似泄露，>0.99则几乎可以确定是泄露。

python

def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
    out = []
    for col in numeric_cols:
        if col == target_col:
            continue
        corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
        if abs(corr) > threshold:
            out.append({"feature": col, "pearson": round(corr, 4)})
    return out

当发现疑似泄露的特征时，务必先与数据所有者确认再删除。有时某个特征确实与目标列高度相关（例如上游模型的预测结果）。但默认假设应为“这是一个泄露特征”。

2. Mutual information vs Pearson — catch non-linear signal

2. 互信息vs皮尔逊相关系数 —— 捕捉非线性信号

Pearson correlation only catches linear relationships. A feature that drives the target via

sin(x)

(x - 0.5)²

will have Pearson ≈ 0 and Pearson alone will mark it as useless. Mutual information catches both.

python

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

皮尔逊相关系数只能捕捉线性关系。通过

sin(x)

或

(x - 0.5)²

驱动目标的特征，其皮尔逊相关系数≈0，仅用皮尔逊会被标记为无用。而互信息能同时捕捉线性和非线性关系。

python

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

For classification

针对分类任务

mi_scores = mutual_info_classif(X, y, random_state=0)

For regression

针对回归任务

mi_scores = mutual_info_regression(X, y, random_state=0)


Plot |Pearson| and MI side by side as a bar chart. Features where
**MI is high but |Pearson| is low** are non-linear signal hiding from
your linear EDA. They'll be invisible to a linear model and powerful
in XGBoost.

This is the same lesson the regression bundle teaches with Friedman1's
`sin(π·x₀·x₁)` term: zero linear correlation, large mutual information,
huge contribution to the target.

mi_scores = mutual_info_regression(X, y, random_state=0)


将|皮尔逊相关系数|和互信息绘制成并排柱状图。那些**互信息高但|皮尔逊相关系数|低**的特征，就是线性EDA无法发现的非线性信号。它们对线性模型不可见，但对XGBoost模型非常有用。

这与回归教程中Friedman1的`sin(π·x₀·x₁)`项所传达的道理一致：线性相关为零，但互信息很高，对目标值的贡献巨大。

3. High-cardinality categorical detection

3. 高基数分类特征检测

A column like

user_id

with thousands of unique values will explode a

OneHotEncoder

into thousands of sparse columns. Flag any categorical with > 50 unique values:

python

def find_high_cardinality(df, cat_cols, threshold=50):
    return [
        {"feature": c, "n_unique": int(df[c].nunique())}
        for c in cat_cols if df[c].nunique() > threshold
    ]

Recommended action for high-cardinality categoricals:

Target encoding (smoothed mean of the target per category) — works well, but leaks during cross-validation if you're careless
Frequency encoding — replace each category with its frequency
Hash encoding — fixed-size hash buckets
Just drop it —
```
user_id
```
is rarely a useful feature anyway

像

user_id

这样包含数千个唯一值的列，会让

OneHotEncoder

生成数千个稀疏列。标记任何唯一值数量>50的分类特征：

python

def find_high_cardinality(df, cat_cols, threshold=50):
    return [
        {"feature": c, "n_unique": int(df[c].nunique())}
        for c in cat_cols if df[c].nunique() > threshold
    ]

针对高基数分类特征的建议操作：

目标编码（每个类别对应的目标值平滑均值）——效果好，但如果不小心会在交叉验证中造成泄露
频率编码——用类别出现的频率替换该类别
哈希编码——固定大小的哈希桶
直接删除——
```
user_id
```
很少是有用的特征

4. Near-constant feature detection

4. 近常值特征检测

A column where one value covers > 98% of the rows has essentially no signal. It's not always wrong to keep it (some signal beats no signal), but it's often indicative of a data collection issue and worth flagging:

python

def find_near_constant(df, threshold=0.98):
    return [
        {"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
        for c in df.columns
        if df[c].value_counts(normalize=True).iloc[0] > threshold
    ]

某个值覆盖>98%行的列基本没有信号。保留它并非总是错误（有信号总比没信号好），但这通常表明存在数据收集问题，值得标记：

python

def find_near_constant(df, threshold=0.98):
    return [
        {"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
        for c in df.columns
        if df[c].value_counts(normalize=True).iloc[0] > threshold
    ]

5. Redundant feature detection

5. 冗余特征检测

Features with mutual correlation > 0.95 carry the same information. Drop one of each pair to reduce multicollinearity (which messes up linear models more than tree models, but is still wasted compute):

python

def find_redundant_pairs(df, numeric_cols, threshold=0.95):
    corr = df[numeric_cols].corr().abs()
    out = []
    for i, c1 in enumerate(numeric_cols):
        for c2 in numeric_cols[i + 1:]:
            if float(corr.loc[c1, c2]) > threshold:
                out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
    return out

互相关系数>0.95的特征携带相同信息。每对中删除一个以减少多重共线性（对线性模型的影响比对树模型大，但仍会浪费计算资源）：

python

def find_redundant_pairs(df, numeric_cols, threshold=0.95):
    corr = df[numeric_cols].corr().abs()
    out = []
    for i, c1 in enumerate(numeric_cols):
        for c2 in numeric_cols[i + 1:]:
            if float(corr.loc[c1, c2]) > threshold:
                out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
    return out

Visual checks (always include in the output)

可视化检查（输出中务必包含）

These six plots together answer "what's in this data?" in 30 seconds:

Missing data bar chart — sorted by % missing, descending
Numeric distributions grid — histograms with skew annotated
Categorical cardinality bar chart — red bars > 50 unique values
Correlation heatmap — feature × target Pearson, with target column annotated with the actual numbers
Mutual info vs Pearson side-by-side bar chart — surfaces the non-linear signal Pearson misses
Outlier box plots — per-numeric-column with IQR-based outlier counts

Output a findings.json file alongside the plots. Each finding has a feature name, the metric that flagged it, and a recommended action. The list is what gets actioned; the plots are the supporting evidence.

以下六个图表能在30秒内回答“这份数据里有什么？”：

缺失数据柱状图——按缺失占比降序排列
数值特征分布网格——带偏度标注的直方图
分类特征基数柱状图——唯一值>50的特征用红色标注
相关系数热力图——特征与目标列的皮尔逊相关系数，目标列标注具体数值
互信息vs皮尔逊相关系数并排柱状图——凸显皮尔逊无法发现的非线性信号
异常值箱线图——每个数值特征基于IQR的异常值计数

在图表旁输出findings.json文件。每个发现包含特征名称、触发标记的指标，以及建议操作。列表是行动依据；图表是支撑证据。

Type inference for the target

目标列类型推断

Before any modeling, infer the target type heuristically:

python

def infer_target_type(y):
    if y.dtype.kind in "biu":  # bool / int
        n_unique = y.nunique()
        if n_unique == 2:
            return "binary"
        if n_unique <= 20:
            return "multiclass"
        return "regression"
    if y.dtype.kind == "f":
        return "regression"
    return "categorical"

This tells you which downstream skill to invoke next:

```
binary
```
→ binary-classification skill
```
multiclass
```
→ multiclass-classification skill
```
regression
```
→ regression skill
```
categorical
```
(no obvious target) → unsupervised skill

建模前，通过启发式方法推断目标列类型：

python

def infer_target_type(y):
    if y.dtype.kind in "biu":  # bool / int
        n_unique = y.nunique()
        if n_unique == 2:
            return "binary"
        if n_unique <= 20:
            return "multiclass"
        return "regression"
    if y.dtype.kind == "f":
        return "regression"
    return "categorical"

这会告诉你接下来应该调用哪个技能：

```
binary
```
→ 二分类技能
```
multiclass
```
→ 多分类技能
```
regression
```
→ 回归技能
```
categorical
```
（无明确目标）→ 无监督技能

Common pitfalls

常见误区

Skipping EDA entirely. "I'll just throw it at XGBoost." This is how target leakage and 99%-test-accuracy-then-broken-in-prod happen.
Pearson-only correlation. Misses sin / quadratic / categorical relationships. Always pair with mutual information.
Dropping a "leakage" feature without confirming with the data owner. Sometimes the feature is legitimate (an upstream model's prediction). Confirm before deleting.
OneHotEncoding a high-cardinality categorical. Explodes feature count, drowns the model in noise, slows training. Use target encoding, frequency encoding, or just drop the column.
Not checking for duplicates. A dataset with 50% duplicate rows will show inflated test metrics if duplicates land in both train and test.
Imputing missing values without thinking. Fill with the median for MCAR, but for MAR/MNAR you may need to model the missingness itself. Always flag the missingness pattern; don't silently impute.
Treating the EDA report as ephemeral. Log the findings JSON and the plots to MLflow (or wherever your experiment tracker lives). When a model fails six months later, you want to be able to look at the EDA report from when the data was first profiled.

完全跳过EDA。“我直接用XGBoost就行。”这就是target leakage和“测试准确率99%但生产环境失效”问题的根源。
仅依赖皮尔逊相关系数。会错过正弦/二次/分类关系。务必搭配互信息使用。
未与数据所有者确认就删除“泄露”特征。有时该特征是合法的（例如上游模型的预测结果）。删除前请确认。
对高基数分类特征使用独热编码。会导致特征数量爆炸，让模型淹没在噪声中，减慢训练速度。使用目标编码、频率编码，或直接删除该列。
不检查重复数据。如果重复行同时出现在训练集和测试集中，数据集有50%重复行会导致测试指标虚高。
不假思索地填充缺失值。MCAR（完全随机缺失）用中位数填充，但MAR/MNAR（随机缺失/非随机缺失）可能需要对缺失本身建模。务必标记缺失模式；不要静默填充。
将EDA报告视为临时文件。将findings.json和图表记录到MLflow（或你的实验跟踪工具）中。当模型六个月后失效时，你需要查看首次探查数据时的EDA报告。

Worked example

示例演示

See

demo.py

(marimo notebook). It generates a deliberately messy synthetic binary classification dataset with seven planted issues (target leakage, high-cardinality categorical, near-constant feature, 30% missing data, log-normal skew, 2% outliers, redundant pair) and walks through detecting each one. The notebook ends with a findings table summarizing what the EDA pipeline caught — and that table is the input to "what model do I train next?"

查看

demo.py

（marimo笔记本）。它生成了一个刻意设置了七个问题的合成二分类数据集（target leakage、高基数分类特征、近常值特征、30%缺失数据、对数正态偏度、2%异常值、冗余特征对），并逐步演示如何检测每个问题。笔记本最后会生成一个发现表格，总结EDA pipeline捕捉到的问题——该表格就是“接下来我该训练什么模型？”的输入依据。

After EDA: what to do next

EDA之后：下一步做什么

Based on the findings, decide:

Drop: leakage features, near-constant features, one of each redundant pair
Encode: high-cardinality categoricals via target/frequency/hash
Impute: missing data (median for numeric, "missing" sentinel for categorical)
Transform: skewed features (log, Box-Cox), outliers (winsorize or robust scaler)
Then: invoke the appropriate problem-type skill (binary-classification, regression, multiclass-classification, etc.)

根据发现结果，决定：

删除：泄露特征、近常值特征、每对冗余特征中的一个
编码：对高基数分类特征使用目标/频率/哈希编码
填充：缺失数据（数值特征用中位数，分类特征用“missing”标记值）
转换：偏态特征（对数、Box-Cox转换）、异常值（缩尾或鲁棒缩放）
然后：调用对应问题类型的技能（二分类、回归、多分类等）