data-science-eda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploratory Data Analysis (EDA)

探索性数据分析(EDA)

Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.
在建模前了解数据集时可使用该技能:剖析分布、检测异常、识别变量关系以及评估数据质量。

When to use this skill

何时使用该技能

  • New dataset — need orientation on structure, types, distributions
  • Before feature engineering — understand variable relationships
  • Data quality investigation — find anomalies, missing patterns, outliers
  • Model preparation — validate assumptions about data
  • 面对新数据集——需要熟悉其结构、类型、分布
  • 特征工程之前——了解变量间的关系
  • 数据质量调查——发现异常值、缺失值模式、极端值
  • 建模准备——验证关于数据的假设

Core EDA workflow

核心EDA工作流

  1. Profile structure
    • Schema, types, cardinality
    • Missing value patterns
  2. Analyze distributions
    • Numerical: histograms, boxplots, skewness
    • Categorical: frequencies, rare categories
  3. Explore relationships
    • Correlation matrix (numerical)
    • Cross-tabulations (categorical)
    • Target-variable relationships
  4. Identify issues
    • Outliers, duplicates, inconsistencies
    • Class imbalance (classification)
    • Temporal patterns (time series)
  1. 剖析数据结构
    • 模式、数据类型、基数
    • 缺失值模式
  2. 分析数据分布
    • 数值型:直方图、箱线图、偏度
    • 类别型:频率分布、稀有类别
  3. 探索变量关系
    • 相关性矩阵(数值型变量)
    • 交叉表(类别型变量)
    • 目标变量与其他变量的关系
  4. 识别数据问题
    • 异常值、重复值、不一致性
    • 类别不平衡(分类任务)
    • 时间模式(时间序列数据)

Quick tool selection

快速工具选择

TaskDefault choiceNotes
Automated profilingydata-profiling / pandas-profilingFast comprehensive reports
Interactive explorationipywidgets + plotlyDrill-down capability
Statistical testsscipy.statsNormality, correlations
Large datasetsPolars + lazyMemory-efficient
任务默认选择说明
自动化剖析ydata-profiling / pandas-profiling生成快速全面的报告
交互式探索ipywidgets + plotly具备下钻分析能力
统计检验scipy.stats正态性检验、相关性分析
大型数据集Polars + lazy内存高效

Core implementation rules

核心实施规则

1) Start with automated profiling

1) 从自动化剖析开始

python
import polars as pl
from ydata_profiling import ProfileReport

df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")
python
import polars as pl
from ydata_profiling import ProfileReport

df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")

2) Focus on actionable insights

2) 聚焦可落地的洞察

  • Document outliers worth investigating (not all outliers are problems)
  • Flag features with high cardinality or rare categories
  • Note strong correlations that may cause multicollinearity
  • 记录值得深入调查的异常值(并非所有异常值都是问题)
  • 标记高基数或包含稀有类别的特征
  • 注意可能导致多重共线性的强相关性

3) Visualize for communication

3) 通过可视化进行沟通

  • Distribution plots for key variables
  • Correlation heatmap
  • Missing value patterns
  • Target relationship plots
  • 关键变量的分布图
  • 相关性热力图
  • 缺失值模式图
  • 目标变量关系图

4) Validate assumptions

4) 验证假设

  • Check for expected ranges/business rules
  • Verify temporal consistency
  • Confirm key relationships match domain knowledge
  • 检查是否符合预期范围/业务规则
  • 验证时间一致性
  • 确认关键关系与领域知识匹配

Common anti-patterns

常见反模式

  • ❌ Skipping EDA and jumping to modeling
  • ❌ Treating all outliers as errors
  • ❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
  • ❌ Over-plotting large datasets without sampling
  • ❌ Not documenting findings for team
  • ❌ 跳过EDA直接进入建模阶段
  • ❌ 将所有异常值视为错误
  • ❌ 忽略缺失值机制(MCAR/MAR/MNAR)
  • ❌ 不对大型数据集采样就过度绘图
  • ❌ 未向团队记录分析结果

Progressive disclosure

进阶参考

  • ../references/automated-profiling.md
    — ydata-profiling, Sweetviz, D-Tale
  • ../references/visualization-patterns.md
    — Matplotlib, Seaborn, Plotly patterns
  • ../references/statistical-tests.md
    — Scipy statistical tests guide
  • ../references/large-dataset-eda.md
    — Sampling, Polars, Dask approaches
  • ../references/automated-profiling.md
    — ydata-profiling、Sweetviz、D-Tale相关内容
  • ../references/visualization-patterns.md
    — Matplotlib、Seaborn、Plotly使用模式
  • ../references/statistical-tests.md
    — Scipy统计检验指南
  • ../references/large-dataset-eda.md
    — 采样、Polars、Dask处理方法

Related skills

相关技能

  • @data-science-feature-engineering
    — Next step after EDA
  • @data-science-model-evaluation
    — Validate modeling assumptions
  • @data-engineering-quality
    — Data validation frameworks
  • @data-science-feature-engineering
    — EDA之后的下一步
  • @data-science-model-evaluation
    — 验证建模假设
  • @data-engineering-quality
    — 数据验证框架

References

参考资料