tabular-eda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 打包文件(可通过${CLAUDE_SKILL_DIR}访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo笔记本,包含示例演示 -->

Tabular EDA — Done Right

表格数据EDA — 正确的做法

Whenever you get handed a new tabular dataset, stop. Do not jump straight to
XGBClassifier()
. Ten minutes of EDA will catch problems that would otherwise destroy your downstream model — target leakage, high-cardinality explosions, MAR missing data, non-linear features that Pearson correlation says are useless. This skill is the workflow.
每当拿到一份新的表格数据集时,先停下来。不要直接上手使用
XGBClassifier()
。花十分钟做EDA能发现那些会毁掉后续模型的问题——target leakage、高基数特征爆炸、MAR缺失数据、皮尔逊相关系数认为无用的非线性特征。本技能就是对应的工作流。

When to use this skill

何时使用本技能

  • You just received a new dataset and have no idea what's in it
  • You're about to train a model and want to validate the data first
  • A model is performing suspiciously well (or poorly) and you suspect a data quality issue
  • The user asks "what should I do with this dataset?"
  • 刚收到一份新数据集,完全不清楚其中内容
  • 即将训练模型,想要先验证数据质量
  • 模型表现异常好(或差),怀疑存在数据质量问题
  • 用户询问“我该如何处理这个数据集?”

When NOT to use this skill

何时不使用本技能

  • You already deeply know the dataset and have profiled it before
  • The dataset is image / text / audio / time-series — different rules
  • The user just wants a model trained, fast, and is OK with risk
  • 已经非常熟悉该数据集,并且之前做过探查
  • 数据集是图像/文本/音频/时间序列——需遵循不同规则
  • 用户只想快速训练模型,愿意承担风险

The workflow

工作流

1. Load → shape, dtypes, memory
2. Identify the target → infer problem type (binary / multiclass / regression)
3. Missing data → per-column %, overall %, patterns
4. Numeric distributions → skew, outliers, scale mismatches
5. Categorical cardinality → flag high-cardinality (OHE explosion risk)
6. Near-constant features → flag and consider dropping
7. Redundant pairs → flag features with > 0.95 mutual correlation
8. **Target leakage detection** → flag features with > 0.95 |Pearson| to target
9. **Mutual information vs Pearson** → catch non-linear features Pearson misses
10. Optional: PCA / UMAP for low-dim visualization
The output is a findings report: a list of suspicious things, each with a feature name, the metric that flagged it, and a recommended action. Don't just print plots. A list of problems with names is what you act on.
1. 加载数据 → 查看形状、数据类型、内存占用
2. 确定目标列 → 推断问题类型(二分类/多分类/回归)
3. 缺失数据 → 按列统计占比、整体占比、缺失模式
4. 数值特征分布 → 偏度、异常值、尺度不匹配
5. 分类特征基数 → 标记高基数特征(独热编码爆炸风险)
6. 近常值特征 → 标记并考虑删除
7. 冗余特征对 → 标记互相关系数>0.95的特征对
8. **Target leakage检测** → 标记与目标列皮尔逊绝对值>0.95的特征
9. **互信息vs皮尔逊相关系数** → 捕捉皮尔逊无法发现的非线性特征
10. 可选:PCA/UMAP降维可视化
输出是一份发现报告:列出所有可疑问题,每个问题包含特征名称、触发标记的指标,以及建议操作。不要只输出图表。带有名称的问题列表才是需要采取行动的依据。

Five things that separate this from a tutorial

与普通教程的五大区别

1. Target leakage detection — the single most valuable EDA check

1. Target leakage检测 —— EDA中最有价值的检查

A "leakage" feature is one that contains information about the target that wouldn't actually be available at prediction time. The classic examples:
  • account_balance_after_payment
    for predicting
    made_payment
  • total_charges
    (cumulative) for predicting
    churned
  • claim_paid_amount
    for predicting
    claim_was_filed
These features are computed after the prediction time. Train on them and you get 99% test accuracy and a model that completely fails in production. The signature is suspiciously high correlation with the target — anything > 0.95 is a leak suspect, anything > 0.99 is almost certainly a leak.
python
def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
    out = []
    for col in numeric_cols:
        if col == target_col:
            continue
        corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
        if abs(corr) > threshold:
            out.append({"feature": col, "pearson": round(corr, 4)})
    return out
When you find a leakage candidate, always confirm with the data owner before dropping it. Sometimes a feature is legitimately almost perfectly correlated with the target (e.g. an upstream model's prediction). But the default assumption is "this is a leak."
“泄露”特征是指包含了预测时实际无法获取的目标信息的特征。经典例子:
  • 用于预测
    made_payment
    (是否付款)的
    account_balance_after_payment
    (付款后账户余额)
  • 用于预测
    churned
    (是否流失)的
    total_charges
    (累计费用)
  • 用于预测
    claim_was_filed
    (是否提交理赔)的
    claim_paid_amount
    (理赔支付金额)
这些特征是在预测时间点之后计算的。用它们训练模型会得到99%的测试准确率,但模型在生产环境中完全失效。这类特征的标志是与目标列的相关系数异常高——任何>0.95的值都是疑似泄露,>0.99则几乎可以确定是泄露。
python
def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
    out = []
    for col in numeric_cols:
        if col == target_col:
            continue
        corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
        if abs(corr) > threshold:
            out.append({"feature": col, "pearson": round(corr, 4)})
    return out
当发现疑似泄露的特征时,务必先与数据所有者确认再删除。有时某个特征确实与目标列高度相关(例如上游模型的预测结果)。但默认假设应为“这是一个泄露特征”。

2. Mutual information vs Pearson — catch non-linear signal

2. 互信息vs皮尔逊相关系数 —— 捕捉非线性信号

Pearson correlation only catches linear relationships. A feature that drives the target via
sin(x)
or
(x - 0.5)²
will have Pearson ≈ 0 and Pearson alone will mark it as useless. Mutual information catches both.
python
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
皮尔逊相关系数只能捕捉线性关系。通过
sin(x)
(x - 0.5)²
驱动目标的特征,其皮尔逊相关系数≈0,仅用皮尔逊会被标记为无用。而互信息能同时捕捉线性和非线性关系。
python
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

For classification

针对分类任务

mi_scores = mutual_info_classif(X, y, random_state=0)
mi_scores = mutual_info_classif(X, y, random_state=0)

For regression

针对回归任务

mi_scores = mutual_info_regression(X, y, random_state=0)

Plot |Pearson| and MI side by side as a bar chart. Features where
**MI is high but |Pearson| is low** are non-linear signal hiding from
your linear EDA. They'll be invisible to a linear model and powerful
in XGBoost.

This is the same lesson the regression bundle teaches with Friedman1's
`sin(π·x₀·x₁)` term: zero linear correlation, large mutual information,
huge contribution to the target.
mi_scores = mutual_info_regression(X, y, random_state=0)

将|皮尔逊相关系数|和互信息绘制成并排柱状图。那些**互信息高但|皮尔逊相关系数|低**的特征,就是线性EDA无法发现的非线性信号。它们对线性模型不可见,但对XGBoost模型非常有用。

这与回归教程中Friedman1的`sin(π·x₀·x₁)`项所传达的道理一致:线性相关为零,但互信息很高,对目标值的贡献巨大。

3. High-cardinality categorical detection

3. 高基数分类特征检测

A column like
user_id
with thousands of unique values will explode a
OneHotEncoder
into thousands of sparse columns. Flag any categorical with > 50 unique values:
python
def find_high_cardinality(df, cat_cols, threshold=50):
    return [
        {"feature": c, "n_unique": int(df[c].nunique())}
        for c in cat_cols if df[c].nunique() > threshold
    ]
Recommended action for high-cardinality categoricals:
  • Target encoding (smoothed mean of the target per category) — works well, but leaks during cross-validation if you're careless
  • Frequency encoding — replace each category with its frequency
  • Hash encoding — fixed-size hash buckets
  • Just drop it
    user_id
    is rarely a useful feature anyway
user_id
这样包含数千个唯一值的列,会让
OneHotEncoder
生成数千个稀疏列。标记任何唯一值数量>50的分类特征:
python
def find_high_cardinality(df, cat_cols, threshold=50):
    return [
        {"feature": c, "n_unique": int(df[c].nunique())}
        for c in cat_cols if df[c].nunique() > threshold
    ]
针对高基数分类特征的建议操作:
  • 目标编码(每个类别对应的目标值平滑均值)——效果好,但如果不小心会在交叉验证中造成泄露
  • 频率编码——用类别出现的频率替换该类别
  • 哈希编码——固定大小的哈希桶
  • 直接删除——
    user_id
    很少是有用的特征

4. Near-constant feature detection

4. 近常值特征检测

A column where one value covers > 98% of the rows has essentially no signal. It's not always wrong to keep it (some signal beats no signal), but it's often indicative of a data collection issue and worth flagging:
python
def find_near_constant(df, threshold=0.98):
    return [
        {"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
        for c in df.columns
        if df[c].value_counts(normalize=True).iloc[0] > threshold
    ]
某个值覆盖>98%行的列基本没有信号。保留它并非总是错误(有信号总比没信号好),但这通常表明存在数据收集问题,值得标记:
python
def find_near_constant(df, threshold=0.98):
    return [
        {"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
        for c in df.columns
        if df[c].value_counts(normalize=True).iloc[0] > threshold
    ]

5. Redundant feature detection

5. 冗余特征检测

Features with mutual correlation > 0.95 carry the same information. Drop one of each pair to reduce multicollinearity (which messes up linear models more than tree models, but is still wasted compute):
python
def find_redundant_pairs(df, numeric_cols, threshold=0.95):
    corr = df[numeric_cols].corr().abs()
    out = []
    for i, c1 in enumerate(numeric_cols):
        for c2 in numeric_cols[i + 1:]:
            if float(corr.loc[c1, c2]) > threshold:
                out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
    return out
互相关系数>0.95的特征携带相同信息。每对中删除一个以减少多重共线性(对线性模型的影响比对树模型大,但仍会浪费计算资源):
python
def find_redundant_pairs(df, numeric_cols, threshold=0.95):
    corr = df[numeric_cols].corr().abs()
    out = []
    for i, c1 in enumerate(numeric_cols):
        for c2 in numeric_cols[i + 1:]:
            if float(corr.loc[c1, c2]) > threshold:
                out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
    return out

Visual checks (always include in the output)

可视化检查(输出中务必包含)

These six plots together answer "what's in this data?" in 30 seconds:
  1. Missing data bar chart — sorted by % missing, descending
  2. Numeric distributions grid — histograms with skew annotated
  3. Categorical cardinality bar chart — red bars > 50 unique values
  4. Correlation heatmap — feature × target Pearson, with target column annotated with the actual numbers
  5. Mutual info vs Pearson side-by-side bar chart — surfaces the non-linear signal Pearson misses
  6. Outlier box plots — per-numeric-column with IQR-based outlier counts
Output a findings.json file alongside the plots. Each finding has a feature name, the metric that flagged it, and a recommended action. The list is what gets actioned; the plots are the supporting evidence.
以下六个图表能在30秒内回答“这份数据里有什么?”:
  1. 缺失数据柱状图——按缺失占比降序排列
  2. 数值特征分布网格——带偏度标注的直方图
  3. 分类特征基数柱状图——唯一值>50的特征用红色标注
  4. 相关系数热力图——特征与目标列的皮尔逊相关系数,目标列标注具体数值
  5. 互信息vs皮尔逊相关系数并排柱状图——凸显皮尔逊无法发现的非线性信号
  6. 异常值箱线图——每个数值特征基于IQR的异常值计数
在图表旁输出findings.json文件。每个发现包含特征名称、触发标记的指标,以及建议操作。列表是行动依据;图表是支撑证据。

Type inference for the target

目标列类型推断

Before any modeling, infer the target type heuristically:
python
def infer_target_type(y):
    if y.dtype.kind in "biu":  # bool / int
        n_unique = y.nunique()
        if n_unique == 2:
            return "binary"
        if n_unique <= 20:
            return "multiclass"
        return "regression"
    if y.dtype.kind == "f":
        return "regression"
    return "categorical"
This tells you which downstream skill to invoke next:
  • binary
    → binary-classification skill
  • multiclass
    → multiclass-classification skill
  • regression
    → regression skill
  • categorical
    (no obvious target) → unsupervised skill
建模前,通过启发式方法推断目标列类型:
python
def infer_target_type(y):
    if y.dtype.kind in "biu":  # bool / int
        n_unique = y.nunique()
        if n_unique == 2:
            return "binary"
        if n_unique <= 20:
            return "multiclass"
        return "regression"
    if y.dtype.kind == "f":
        return "regression"
    return "categorical"
这会告诉你接下来应该调用哪个技能:
  • binary
    → 二分类技能
  • multiclass
    → 多分类技能
  • regression
    → 回归技能
  • categorical
    (无明确目标)→ 无监督技能

Common pitfalls

常见误区

  1. Skipping EDA entirely. "I'll just throw it at XGBoost." This is how target leakage and 99%-test-accuracy-then-broken-in-prod happen.
  2. Pearson-only correlation. Misses sin / quadratic / categorical relationships. Always pair with mutual information.
  3. Dropping a "leakage" feature without confirming with the data owner. Sometimes the feature is legitimate (an upstream model's prediction). Confirm before deleting.
  4. OneHotEncoding a high-cardinality categorical. Explodes feature count, drowns the model in noise, slows training. Use target encoding, frequency encoding, or just drop the column.
  5. Not checking for duplicates. A dataset with 50% duplicate rows will show inflated test metrics if duplicates land in both train and test.
  6. Imputing missing values without thinking. Fill with the median for MCAR, but for MAR/MNAR you may need to model the missingness itself. Always flag the missingness pattern; don't silently impute.
  7. Treating the EDA report as ephemeral. Log the findings JSON and the plots to MLflow (or wherever your experiment tracker lives). When a model fails six months later, you want to be able to look at the EDA report from when the data was first profiled.
  1. 完全跳过EDA。“我直接用XGBoost就行。”这就是target leakage和“测试准确率99%但生产环境失效”问题的根源。
  2. 仅依赖皮尔逊相关系数。会错过正弦/二次/分类关系。务必搭配互信息使用。
  3. 未与数据所有者确认就删除“泄露”特征。有时该特征是合法的(例如上游模型的预测结果)。删除前请确认。
  4. 对高基数分类特征使用独热编码。会导致特征数量爆炸,让模型淹没在噪声中,减慢训练速度。使用目标编码、频率编码,或直接删除该列。
  5. 不检查重复数据。如果重复行同时出现在训练集和测试集中,数据集有50%重复行会导致测试指标虚高。
  6. 不假思索地填充缺失值。MCAR(完全随机缺失)用中位数填充,但MAR/MNAR(随机缺失/非随机缺失)可能需要对缺失本身建模。务必标记缺失模式;不要静默填充。
  7. 将EDA报告视为临时文件。将findings.json和图表记录到MLflow(或你的实验跟踪工具)中。当模型六个月后失效时,你需要查看首次探查数据时的EDA报告。

Worked example

示例演示

See
demo.py
(marimo notebook). It generates a deliberately messy synthetic binary classification dataset with seven planted issues (target leakage, high-cardinality categorical, near-constant feature, 30% missing data, log-normal skew, 2% outliers, redundant pair) and walks through detecting each one. The notebook ends with a findings table summarizing what the EDA pipeline caught — and that table is the input to "what model do I train next?"
查看
demo.py
(marimo笔记本)。它生成了一个刻意设置了七个问题的合成二分类数据集(target leakage、高基数分类特征、近常值特征、30%缺失数据、对数正态偏度、2%异常值、冗余特征对),并逐步演示如何检测每个问题。笔记本最后会生成一个发现表格,总结EDA pipeline捕捉到的问题——该表格就是“接下来我该训练什么模型?”的输入依据。

After EDA: what to do next

EDA之后:下一步做什么

Based on the findings, decide:
  • Drop: leakage features, near-constant features, one of each redundant pair
  • Encode: high-cardinality categoricals via target/frequency/hash
  • Impute: missing data (median for numeric, "missing" sentinel for categorical)
  • Transform: skewed features (log, Box-Cox), outliers (winsorize or robust scaler)
  • Then: invoke the appropriate problem-type skill (binary-classification, regression, multiclass-classification, etc.)
根据发现结果,决定:
  • 删除:泄露特征、近常值特征、每对冗余特征中的一个
  • 编码:对高基数分类特征使用目标/频率/哈希编码
  • 填充:缺失数据(数值特征用中位数,分类特征用“missing”标记值)
  • 转换:偏态特征(对数、Box-Cox转换)、异常值(缩尾或鲁棒缩放)
  • 然后:调用对应问题类型的技能(二分类、回归、多分类等)