exploratory-data-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploratory Data Analysis

探索性数据分析

Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
Supported formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
通过统计分析和可视化发现表格数据中的模式、异常情况及关联关系。
支持的格式:CSV、Excel (.xlsx, .xls)、JSON、Parquet、TSV、Feather、HDF5、Pickle

Standard Workflow

标准工作流

  1. Run statistical analysis:
bash
python scripts/eda_analyzer.py <data_file> -o <output_dir>
  1. Generate visualizations:
bash
python scripts/visualizer.py <data_file> -o <output_dir>
  1. Read analysis results from
    <output_dir>/eda_analysis.json
  2. Create report using
    assets/report_template.md
    structure
  3. Present findings with key insights and visualizations
  1. 运行统计分析:
bash
python scripts/eda_analyzer.py <data_file> -o <output_dir>
  1. 生成可视化内容:
bash
python scripts/visualizer.py <data_file> -o <output_dir>
  1. <output_dir>/eda_analysis.json
    读取分析结果
  2. 使用
    assets/report_template.md
    的结构创建报告
  3. 结合关键洞察和可视化内容展示分析结果

Analysis Capabilities

分析能力

Statistical Analysis

统计分析

Run
scripts/eda_analyzer.py
to generate comprehensive analysis:
bash
python scripts/eda_analyzer.py sales_data.csv -o ./output
Produces
output/eda_analysis.json
containing:
  • Dataset shape, types, memory usage
  • Missing data patterns and percentages
  • Summary statistics (numeric and categorical)
  • Outlier detection (IQR and Z-score methods)
  • Distribution analysis with normality tests
  • Correlation matrices (Pearson and Spearman)
  • Data quality metrics (completeness, duplicates)
  • Automated insights
运行
scripts/eda_analyzer.py
以生成全面的分析结果:
bash
python scripts/eda_analyzer.py sales_data.csv -o ./output
生成的
output/eda_analysis.json
包含以下内容:
  • 数据集规模、数据类型、内存占用情况
  • 缺失数据的模式及占比
  • 统计摘要(数值型与类别型数据)
  • 异常值检测(IQR和Z-score方法)
  • 分布分析及正态性检验
  • 相关矩阵(Pearson和Spearman方法)
  • 数据质量指标(完整性、重复值)
  • 自动化洞察

Visualizations

可视化

Run
scripts/visualizer.py
to generate plots:
bash
python scripts/visualizer.py sales_data.csv -o ./output
Creates high-resolution (300 DPI) PNG files in
output/eda_visualizations/
:
  • Missing data heatmaps and bar charts
  • Distribution plots (histograms with KDE)
  • Box plots and violin plots for outliers
  • Correlation heatmaps
  • Scatter matrices for numeric relationships
  • Categorical bar charts
  • Time series plots (if datetime columns detected)
运行
scripts/visualizer.py
以生成可视化图表:
bash
python scripts/visualizer.py sales_data.csv -o ./output
会在
output/eda_visualizations/
目录下生成300 DPI的高分辨率PNG文件,包括:
  • 缺失数据热力图和柱状图
  • 分布图表(带KDE的直方图)
  • 用于异常值分析的箱线图和小提琴图
  • 相关性热力图
  • 数值型变量关系的散点矩阵
  • 类别型数据柱状图
  • 时间序列图表(若检测到日期时间列)

Automated Insights

自动化洞察

Access generated insights from the
"insights"
key in the analysis JSON:
  • Dataset size considerations
  • Missing data warnings (when exceeding thresholds)
  • Strong correlations for feature engineering
  • High outlier rate flags
  • Skewness requiring transformations
  • Duplicate detection
  • Categorical imbalance warnings
从分析结果JSON文件的
"insights"
字段中获取生成的洞察内容:
  • 数据集规模注意事项
  • 缺失数据警告(当占比超过阈值时)
  • 可用于特征工程的强相关性变量
  • 高异常值占比标记
  • 需要进行转换的偏态分布
  • 重复值检测结果
  • 类别型数据不平衡警告

Reference Materials

参考资料

Statistical Interpretation

统计结果解读

See
references/statistical_tests_guide.md
for detailed guidance on:
  • Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
  • Distribution characteristics (skewness, kurtosis)
  • Correlation methods (Pearson, Spearman)
  • Outlier detection (IQR, Z-score)
  • Hypothesis testing and data transformations
Use when interpreting statistical results or explaining findings.
查看
references/statistical_tests_guide.md
获取详细指导,内容包括:
  • 正态性检验(Shapiro-Wilk、Anderson-Darling、Kolmogorov-Smirnov)
  • 分布特征(偏度、峰度)
  • 相关性分析方法(Pearson、Spearman)
  • 异常值检测(IQR、Z-score)
  • 假设检验与数据转换
解读统计结果或解释分析发现时可参考此文档。

Methodology

方法论

See
references/eda_best_practices.md
for comprehensive guidance on:
  • 6-step EDA process framework
  • Univariate, bivariate, multivariate analysis approaches
  • Visualization and statistical analysis guidelines
  • Common pitfalls and domain-specific considerations
  • Communication strategies for different audiences
Use when planning analysis or handling specific scenarios.
查看
references/eda_best_practices.md
获取全面指导,内容包括:
  • 6步EDA流程框架
  • 单变量、双变量、多变量分析方法
  • 可视化与统计分析指南
  • 常见误区及特定领域注意事项
  • 面向不同受众的沟通策略
规划分析工作或处理特定场景时可参考此文档。

Report Template

报告模板

Use
assets/report_template.md
to structure findings. Template includes:
  • Executive summary
  • Dataset overview
  • Data quality assessment
  • Univariate, bivariate, and multivariate analysis
  • Outlier analysis
  • Key insights and recommendations
  • Limitations and appendices
Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
使用
assets/report_template.md
来组织分析结果。模板包含以下部分:
  • 执行摘要
  • 数据集概述
  • 数据质量评估
  • 单变量、双变量、多变量分析
  • 异常值分析
  • 关键洞察与建议
  • 局限性与附录
使用分析结果JSON中的数据填充各部分,并通过Markdown图片语法嵌入可视化图表。

Example: Complete Analysis

示例:完整分析流程

User request: "Explore this sales_data.csv file"
bash
undefined
用户需求:“探索这份sales_data.csv文件”
bash
undefined

1. Run analysis

1. 运行分析

python scripts/eda_analyzer.py sales_data.csv -o ./output
python scripts/eda_analyzer.py sales_data.csv -o ./output

2. Generate visualizations

2. 生成可视化内容

python scripts/visualizer.py sales_data.csv -o ./output

```python
python scripts/visualizer.py sales_data.csv -o ./output

```python

3. Read results

3. 读取结果

import json with open('./output/eda_analysis.json') as f: results = json.load(f)
import json with open('./output/eda_analysis.json') as f: results = json.load(f)

4. Build report from assets/report_template.md

4. 基于assets/report_template.md构建报告

- Fill sections with results

- 用分析结果填充各部分

- Embed images: Missing Data

- 嵌入图片:缺失数据

- Include insights from results['insights']

- 包含results['insights']中的洞察内容

- Add recommendations

- 添加建议内容

undefined
undefined

Special Cases

特殊场景处理

Dataset Size Strategy

数据集规模策略

If < 100 rows: Note sample size limitations, use non-parametric methods
If 100-1M rows: Standard workflow applies
If > 1M rows: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
若行数<100条:注意样本量限制,使用非参数方法 若行数在100-100万条之间:适用标准工作流 若行数>100万条:先抽样进行快速探索,在报告中注明样本量,建议使用分布式计算进行全量分析

Data Characteristics

数据特征

High-dimensional (>50 columns): Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See
references/eda_best_practices.md
for guidance.
Time series: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
Imbalanced: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
高维度(>50列):先聚焦关键变量,使用相关性分析识别变量组,考虑PCA或特征选择。详情参考
references/eda_best_practices.md
时间序列数据:会自动检测日期时间列,并自动生成时间相关的可视化图表。需关注趋势、季节性和模式。 不平衡数据:类别型数据分析会自动标记不平衡情况。需在报告中突出展示分布情况,必要时建议使用分层抽样。

Output Guidelines

输出规范

Format findings as markdown:
  • Use headers, tables, and lists for structure
  • Embed visualizations:
    ![Description](path/to/image.png)
  • Include code blocks for suggested transformations
  • Highlight key insights
Make reports actionable:
  • Provide clear recommendations
  • Flag data quality issues requiring attention
  • Suggest next steps (modeling, feature engineering, further analysis)
  • Tailor communication to user's technical level
以Markdown格式呈现分析结果
  • 使用标题、表格和列表构建结构
  • 嵌入可视化图表:
    ![描述](path/to/image.png)
  • 包含建议转换操作的代码块
  • 突出关键洞察
让报告具备可操作性
  • 提供清晰的建议
  • 标记需要关注的数据质量问题
  • 建议下一步工作(建模、特征工程、进一步分析)
  • 根据用户的技术水平调整沟通方式

Error Handling

错误处理

Unsupported formats: Request conversion to supported format (CSV, Excel, JSON, Parquet)
Files too large: Recommend sampling or chunked processing
Corrupted data: Report specific errors, suggest cleaning steps, attempt partial analysis
Empty columns: Flag in data quality section, recommend removal or investigation
不支持的格式:请求转换为支持的格式(CSV、Excel、JSON、Parquet) 文件过大:建议抽样或分块处理 数据损坏:报告具体错误,建议清洗步骤,尝试进行部分分析 空列:在数据质量部分标记,建议删除或调查原因

Resources

资源

Scripts (handle all formats automatically):
  • scripts/eda_analyzer.py
    - Statistical analysis engine
  • scripts/visualizer.py
    - Visualization generator
References (load as needed):
  • references/statistical_tests_guide.md
    - Test interpretation and methodology
  • references/eda_best_practices.md
    - EDA process and best practices
Template:
  • assets/report_template.md
    - Professional report structure
脚本(自动处理所有支持格式):
  • scripts/eda_analyzer.py
    - 统计分析引擎
  • scripts/visualizer.py
    - 可视化生成器
参考文档(按需加载):
  • references/statistical_tests_guide.md
    - 检验方法解读与方法论
  • references/eda_best_practices.md
    - EDA流程与最佳实践
模板
  • assets/report_template.md
    - 专业报告结构

Key Points

关键点

  • Run both scripts for complete analysis
  • Structure reports using the template
  • Provide actionable insights, not just statistics
  • Use reference guides for detailed interpretations
  • Document data quality issues and limitations
  • Make clear recommendations for next steps
  • 同时运行两个脚本以获取完整分析结果
  • 使用模板构建报告
  • 提供可操作的洞察,而非仅统计数据
  • 参考指导文档进行详细解读
  • 记录数据质量问题与局限性
  • 明确给出下一步建议