exploratory-data-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Exploratory Data Analysis

探索性数据分析

Overview

概述

Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning.

Key Capabilities:

Automatic detection and analysis of 200+ scientific file formats
Comprehensive format-specific metadata extraction
Data quality and integrity assessment
Statistical summaries and distributions
Visualization recommendations
Downstream analysis suggestions
Markdown report generation

对多领域的科学数据文件进行全面的探索性数据分析（EDA）。此技能提供自动文件类型检测、特定格式分析、数据质量评估，并生成适用于文档记录和下游分析规划的详细Markdown报告。

核心功能：

自动检测和分析200多种科学文件格式
全面提取特定格式的元数据
数据质量与完整性评估
统计摘要与分布分析
可视化建议
下游分析建议
Markdown报告生成

When to Use This Skill

适用场景

Use this skill when:

User provides a path to a scientific data file for analysis
User asks to "explore", "analyze", or "summarize" a data file
User wants to understand the structure and content of scientific data
User needs a comprehensive report of a dataset before analysis
User wants to assess data quality or completeness
User asks what type of analysis is appropriate for a file

在以下场景中使用此技能：

用户提供科学数据文件路径以进行分析
用户要求“探索”“分析”或“总结”某个数据文件
用户希望了解科学数据的结构和内容
用户需要在分析前获取数据集的全面报告
用户希望评估数据质量或完整性
用户询问某类文件适合的分析类型

Supported File Categories

支持的文件分类

The skill has comprehensive coverage of scientific file formats organized into six major categories:

此技能全面覆盖六大类科学文件格式：

1. Chemistry and Molecular Formats (60+ extensions)

1. 化学与分子格式（60+种扩展名）

Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.

File types include:

.pdb

.cif

.mol

.mol2

.sdf

.xyz

.smi

.gro

.log

.fchk

.cube

.dcd

.xtc

.trr

.prmtop

.psf

, and more.

Reference file:

references/chemistry_molecular_formats.md

结构文件、计算化学输出文件、分子动力学轨迹文件及化学数据库文件。

包含的文件类型：

.pdb

.cif

.mol

.mol2

.sdf

.xyz

.smi

.gro

.log

.fchk

.cube

.dcd

.xtc

.trr

.prmtop

.psf

等。

参考文件：

references/chemistry_molecular_formats.md

2. Bioinformatics and Genomics Formats (50+ extensions)

2. 生物信息学与基因组学格式（50+种扩展名）

Sequence data, alignments, annotations, variants, and expression data.

File types include:

.fasta

.fastq

.sam

.bam

.vcf

.bed

.gff

.gtf

.bigwig

.h5ad

.loom

.counts

.mtx

, and more.

Reference file:

references/bioinformatics_genomics_formats.md

序列数据、比对数据、注释数据、变异数据及表达数据。

包含的文件类型：

.fasta

.fastq

.sam

.bam

.vcf

.bed

.gff

.gtf

.bigwig

.h5ad

.loom

.counts

.mtx

等。

参考文件：

references/bioinformatics_genomics_formats.md

3. Microscopy and Imaging Formats (45+ extensions)

3. 显微镜与成像格式（45+种扩展名）

Microscopy images, medical imaging, whole slide imaging, and electron microscopy.

File types include:

.tif

.nd2

.lif

.czi

.ims

.dcm

.nii

.mrc

.dm3

.vsi

.svs

.ome.tiff

, and more.

Reference file:

references/microscopy_imaging_formats.md

显微镜图像、医学影像、全切片成像及电子显微镜数据。

包含的文件类型：

.tif

.nd2

.lif

.czi

.ims

.dcm

.nii

.mrc

.dm3

.vsi

.svs

.ome.tiff

等。

参考文件：

references/microscopy_imaging_formats.md

4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)

4. 光谱学与分析化学格式（35+种扩展名）

NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.

File types include:

.fid

.mzML

.mzXML

.raw

.mgf

.spc

.jdx

.xy

.cif

(crystallography),

.wdf

, and more.

Reference file:

references/spectroscopy_analytical_formats.md

NMR、质谱、IR/Raman、UV-Vis、X射线、色谱及其他分析技术数据。

包含的文件类型：

.fid

.mzML

.mzXML

.raw

.mgf

.spc

.jdx

.xy

.cif

（晶体学）,

.wdf

等。

参考文件：

references/spectroscopy_analytical_formats.md

5. Proteomics and Metabolomics Formats (30+ extensions)

5. 蛋白质组学与代谢组学格式（30+种扩展名）

Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.

File types include:

.mzML

.pepXML

.protXML

.mzid

.mzTab

.sky

.mgf

.msp

.h5ad

, and more.

Reference file:

references/proteomics_metabolomics_formats.md

质谱蛋白质组学、代谢组学、脂质组学及多组学数据。

包含的文件类型：

.mzML

.pepXML

.protXML

.mzid

.mzTab

.sky

.mgf

.msp

.h5ad

等。

参考文件：

references/proteomics_metabolomics_formats.md

6. General Scientific Data Formats (30+ extensions)

6. 通用科学数据格式（30+种扩展名）

Arrays, tables, hierarchical data, compressed archives, and common scientific formats.

File types include:

.npy

.npz

.csv

.xlsx

.json

.hdf5

.zarr

.parquet

.mat

.fits

.nc

.xml

, and more.

Reference file:

references/general_scientific_formats.md

数组、表格、分层数据、压缩归档文件及常见科学格式。

包含的文件类型：

.npy

.npz

.csv

.xlsx

.json

.hdf5

.zarr

.parquet

.mat

.fits

.nc

.xml

等。

参考文件：

references/general_scientific_formats.md

Workflow

工作流程

Step 1: File Type Detection

步骤1：文件类型检测

When a user provides a file path, first identify the file type:

Extract the file extension
Look up the extension in the appropriate reference file
Identify the file category and format description
Load format-specific information

Example:

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

当用户提供文件路径时，首先识别文件类型：

提取文件扩展名
在对应参考文件中查找该扩展名
确定文件类别及格式说明
加载特定格式的相关信息

示例：

用户：“分析data.fastq文件”
→ 扩展名：.fastq
→ 类别：bioinformatics_genomics
→ 格式：FASTQ格式（带质量分数的序列数据）
→ 参考文件：references/bioinformatics_genomics_formats.md

Step 2: Load Format-Specific Information

步骤2：加载特定格式信息

Based on the file type, read the corresponding reference file to understand:

Typical Data: What kind of data this format contains
Use Cases: Common applications for this format
Python Libraries: How to read the file in Python
EDA Approach: What analyses are appropriate for this data type

Search the reference file for the specific extension (e.g., search for "### .fastq" in

bioinformatics_genomics_formats.md

根据文件类型，读取对应参考文件以了解：

典型数据： 该格式包含的数据类型
适用场景： 该格式的常见应用
Python库： 在Python中读取该文件的方法
EDA方法： 适用于该数据类型的分析方式

在参考文件中搜索特定扩展名（例如，在

bioinformatics_genomics_formats.md

中搜索“### .fastq”）。

Step 3: Perform Data Analysis

步骤3：执行数据分析

Use the

scripts/eda_analyzer.py

script OR implement custom analysis:

Option A: Use the analyzer script

python

undefined

使用

scripts/eda_analyzer.py

脚本或自定义分析：

选项A：使用分析脚本

python

undefined

The script automatically:

该脚本会自动执行以下操作：

1. Detects file type

1. 检测文件类型

2. Loads reference information

2. 加载参考信息

3. Performs format-specific analysis

3. 执行特定格式分析

4. Generates markdown report

4. 生成Markdown报告

python scripts/eda_analyzer.py <filepath> [output.md]


**Option B: Custom analysis in the conversation**
Based on the format information from the reference file, perform appropriate analysis:

For tabular data (CSV, TSV, Excel):
- Load with pandas
- Check dimensions, data types
- Analyze missing values
- Calculate summary statistics
- Identify outliers
- Check for duplicates

For sequence data (FASTA, FASTQ):
- Count sequences
- Analyze length distributions
- Calculate GC content
- Assess quality scores (FASTQ)

For images (TIFF, ND2, CZI):
- Check dimensions (X, Y, Z, C, T)
- Analyze bit depth and value range
- Extract metadata (channels, timestamps, spatial calibration)
- Calculate intensity statistics

For arrays (NPY, HDF5):
- Check shape and dimensions
- Analyze data type
- Calculate statistical summaries
- Check for missing/invalid values

python scripts/eda_analyzer.py <filepath> [output.md]


**选项B：对话中的自定义分析**
根据参考文件中的格式信息，执行相应分析：

对于表格数据（CSV、TSV、Excel）：
- 使用pandas加载数据
- 检查维度、数据类型
- 分析缺失值
- 计算统计摘要
- 识别异常值
- 检查重复数据

对于序列数据（FASTA、FASTQ）：
- 统计序列数量
- 分析长度分布
- 计算GC含量
- 评估质量分数（FASTQ）

对于图像数据（TIFF、ND2、CZI）：
- 检查维度（X, Y, Z, C, T）
- 分析位深度及数值范围
- 提取元数据（通道、时间戳、空间校准信息）
- 计算强度统计数据

对于数组数据（NPY、HDF5）：
- 检查形状与维度
- 分析数据类型
- 计算统计摘要
- 检查缺失/无效值

Step 4: Generate Comprehensive Report

步骤4：生成全面报告

Create a markdown report with the following sections:

创建包含以下章节的Markdown报告：

Required Sections:

必填章节：

Title and Metadata
- Filename and timestamp
- File size and location
Basic Information
- File properties
- Format identification
File Type Details
- Format description from reference
- Typical data content
- Common use cases
- Python libraries for reading
Data Analysis
- Structure and dimensions
- Statistical summaries
- Quality assessment
- Data characteristics
Key Findings
- Notable patterns
- Potential issues
- Quality metrics
Recommendations
- Preprocessing steps
- Appropriate analyses
- Tools and methods
- Visualization approaches

标题与元数据
- 文件名与时间戳
- 文件大小与存储位置
基本信息
- 文件属性
- 格式识别结果
文件类型详情
- 参考文件中的格式说明
- 典型数据内容
- 常见应用场景
- 用于读取的Python库
数据分析
- 结构与维度
- 统计摘要
- 质量评估
- 数据特征
关键发现
- 显著模式
- 潜在问题
- 质量指标
建议
- 预处理步骤
- 适用的分析方法
- 工具与手段
- 可视化方法

Template Location

模板位置

Use

assets/report_template.md

as a guide for report structure.

以

assets/report_template.md

作为报告结构的参考模板。

Step 5: Save Report

步骤5：保存报告

Save the markdown report with a descriptive filename:

Pattern:
```
{original_filename}_eda_report.md
```

Example:

experiment_data.fastq

→

experiment_data_eda_report.md

使用描述性文件名保存Markdown报告：

命名模式：
```
{original_filename}_eda_report.md
```

示例：

experiment_data.fastq

→

experiment_data_eda_report.md

Detailed Format References

详细格式参考

Each reference file contains comprehensive information for dozens of file types. To find information about a specific format:

Identify the category from the extension
Read the appropriate reference file
Search for the section heading matching the extension (e.g., "### .pdb")
Extract the format information

每个参考文件包含数十种文件类型的全面信息。若要查找特定格式的信息：

根据扩展名确定类别
读取对应参考文件
搜索与扩展名匹配的章节标题（例如“### .pdb”）
提取格式相关信息

Reference File Structure

参考文件结构

Each format entry includes:

Description: What the format is
Typical Data: What it contains
Use Cases: Common applications
Python Libraries: How to read it (with code examples)
EDA Approach: Specific analyses to perform

Example lookup:

markdown

undefined

每个格式条目包含：

说明： 该格式的定义
典型数据： 该格式包含的内容
适用场景： 常见应用
Python库： 读取该格式的方法（含代码示例）
EDA方法： 需执行的特定分析

示例查找：

markdown

undefined

.pdb - Protein Data Bank

.pdb - 蛋白质数据库格式

Description: Standard format for 3D structures of biological macromolecules Typical Data: Atomic coordinates, residue information, secondary structure Use Cases: Protein structure analysis, molecular visualization, docking Python Libraries:

```
Biopython
```
:
```
Bio.PDB
```

MDAnalysis

MDAnalysis.Universe('file.pdb')

EDA Approach:

Structure validation (bond lengths, angles)
B-factor distribution
Missing residues detection
Ramachandran plots

undefined

说明： 生物大分子3D结构的标准格式 典型数据： 原子坐标、残基信息、二级结构 适用场景： 蛋白质结构分析、分子可视化、对接模拟 Python库：

```
Biopython
```
:
```
Bio.PDB
```

MDAnalysis

MDAnalysis.Universe('file.pdb')

EDA方法：

结构验证（键长、键角）
B因子分布分析
缺失残基检测
拉氏构象图分析

undefined

Best Practices

最佳实践

Reading Reference Files

读取参考文件

Reference files are large (10,000+ words each). To efficiently use them:

Search by extension: Use grep to find the specific format

python

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information

参考文件内容较大（每个文件超过10000字）。为高效使用参考文件：

按扩展名搜索： 使用grep查找特定格式

python

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

提取相关章节： 无需将整个参考文件加载到上下文
缓存格式信息： 若分析多个同类型文件，可复用格式信息

Data Analysis

数据分析

Sample large files: For files with millions of records, analyze a representative sample
Handle errors gracefully: Many scientific formats require specific libraries; provide clear installation instructions
Validate metadata: Cross-check metadata consistency (e.g., stated dimensions vs actual data)
Consider data provenance: Note instrument, software versions, processing steps

大文件采样： 对于包含数百万条记录的文件，分析具有代表性的样本
优雅处理错误： 许多科学格式需要特定库；提供清晰的安装说明
验证元数据： 交叉检查元数据一致性（例如，声明的维度与实际数据是否匹配）
考虑数据来源： 记录仪器、软件版本、处理步骤

Report Generation

报告生成

Be comprehensive: Include all relevant information for downstream analysis
Be specific: Provide concrete recommendations based on the file type
Be actionable: Suggest specific next steps and tools
Include code examples: Show how to load and work with the data

内容全面： 包含下游分析所需的所有相关信息
针对性强： 根据文件类型提供具体建议
可执行性： 建议具体的后续步骤与工具
包含代码示例： 展示如何加载和处理数据

Examples

示例

Example 1: Analyzing a FASTQ file

示例1：分析FASTQ文件

python

undefined

python

undefined

User provides: "Analyze reads.fastq"

用户请求：“分析reads.fastq文件”

1. Detect file type

1. 检测文件类型

extension = '.fastq' category = 'bioinformatics_genomics'

2. Read reference info

2. 读取参考信息

Search references/bioinformatics_genomics_formats.md for "### .fastq"

在references/bioinformatics_genomics_formats.md中搜索“### .fastq”

3. Perform analysis

3. 执行分析

from Bio import SeqIO sequences = list(SeqIO.parse('reads.fastq', 'fastq'))

Calculate: read count, length distribution, quality scores, GC content

计算：读取计数、长度分布、质量分数、GC含量

4. Generate report

4. 生成报告

Include: format description, analysis results, QC recommendations

包含：格式说明、分析结果、质量控制建议

5. Save as: reads_eda_report.md

5. 保存为：reads_eda_report.md

undefined

undefined

Example 2: Analyzing a CSV dataset

示例2：分析CSV数据集

python

undefined

python

undefined

User provides: "Explore experiment_results.csv"

用户请求：“探索experiment_results.csv文件”

1. Detect: .csv → general_scientific

1. 检测：.csv → general_scientific

2. Load reference for CSV format

2. 加载CSV格式的参考信息

3. Analyze

3. 执行分析

import pandas as pd df = pd.read_csv('experiment_results.csv')

Dimensions, dtypes, missing values, statistics, correlations

分析：维度、数据类型、缺失值、统计数据、相关性

4. Generate report with:

4. 生成包含以下内容的报告：

- Data structure

- 数据结构

- Missing value patterns

- 缺失值模式

- Statistical summaries

- 统计摘要

- Correlation matrix

- 相关矩阵

- Outlier detection results

- 异常值检测结果

5. Save report

5. 保存报告

undefined

undefined

Example 3: Analyzing microscopy data

示例3：分析显微镜数据

python

undefined

python

undefined

User provides: "Analyze cells.nd2"

用户请求：“分析cells.nd2文件”

1. Detect: .nd2 → microscopy_imaging (Nikon format)

1. 检测：.nd2 → microscopy_imaging（尼康格式）

2. Read reference for ND2 format

2. 读取ND2格式的参考信息

Learn: multi-dimensional (XYZCT), requires nd2reader

了解：多维数据（XYZCT），需要nd2reader库

3. Analyze

3. 执行分析

from nd2reader import ND2Reader with ND2Reader('cells.nd2') as images: # Extract: dimensions, channels, timepoints, metadata # Calculate: intensity statistics, frame info

from nd2reader import ND2Reader with ND2Reader('cells.nd2') as images: # 提取：维度、通道、时间点、元数据 # 计算：强度统计数据、帧信息

4. Generate report with:

4. 生成包含以下内容的报告：

- Image dimensions (XY, Z-stacks, time, channels)

- 图像维度（XY、Z堆叠、时间、通道）

- Channel wavelengths

- 通道波长

- Pixel size and calibration

- 像素大小与校准信息

- Recommendations for image analysis

- 图像分析建议

5. Save report

5. 保存报告

undefined

undefined

Troubleshooting

故障排除

Missing Libraries

缺失依赖库

Many scientific formats require specialized libraries:

Problem: Import error when trying to read a file

Solution: Provide clear installation instructions

python

try:
    from Bio import SeqIO
except ImportError:
    print("Install Biopython: uv pip install biopython")

Common requirements by category:

Bioinformatics:
```
biopython
```
,
```
pysam
```
,
```
pyBigWig
```
Chemistry:
```
rdkit
```
,
```
mdanalysis
```
,
```
cclib
```
Microscopy:
```
tifffile
```
,
```
nd2reader
```
,
```
aicsimageio
```
,
```
pydicom
```
Spectroscopy:
```
nmrglue
```
,
```
pymzml
```
,
```
pyteomics
```
General:
```
pandas
```
,
```
numpy
```
,
```
h5py
```
,
```
scipy
```

许多科学格式需要专用库：

问题： 读取文件时出现导入错误

解决方案： 提供清晰的安装说明

python

try:
    from Bio import SeqIO
except ImportError:
    print("安装Biopython：uv pip install biopython")

各分类的常见依赖：

生物信息学：
```
biopython
```
,
```
pysam
```
,
```
pyBigWig
```
化学：
```
rdkit
```
,
```
mdanalysis
```
,
```
cclib
```
显微镜：
```
tifffile
```
,
```
nd2reader
```
,
```
aicsimageio
```
,
```
pydicom
```
光谱学：
```
nmrglue
```
,
```
pymzml
```
,
```
pyteomics
```
通用：
```
pandas
```
,
```
numpy
```
,
```
h5py
```
,
```
scipy
```

Unknown File Types

未知文件类型

If a file extension is not in the references:

Ask the user about the file format
Check if it's a vendor-specific variant
Attempt generic analysis based on file structure (text vs binary)
Provide general recommendations

若文件扩展名不在参考列表中：

询问用户文件格式的相关信息
检查是否为厂商特定的变体格式
根据文件结构（文本/二进制）尝试通用分析
提供通用建议

Large Files

大文件处理

For very large files:

Use sampling strategies (first N records)
Use memory-mapped access (for HDF5, NPY)
Process in chunks (for CSV, FASTQ)
Provide estimates based on samples

对于超大文件：

使用采样策略（取前N条记录）
使用内存映射访问（适用于HDF5、NPY）
分块处理（适用于CSV、FASTQ）
基于样本提供估算结果

Script Usage

脚本使用

The

scripts/eda_analyzer.py

can be used directly:

bash

undefined

可直接使用

scripts/eda_analyzer.py

脚本：

bash

undefined

Basic usage

基础用法

python scripts/eda_analyzer.py data.csv

Specify output file

指定输出文件

python scripts/eda_analyzer.py data.csv output_report.md

The script will:

该脚本会：

1. Auto-detect file type

1. 自动检测文件类型

2. Load format references

2. 加载格式参考信息

3. Perform appropriate analysis

3. 执行相应分析

4. Generate markdown report

4. 生成Markdown报告


The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.


该脚本支持对多种常见格式的自动分析，但对话中的自定义分析更具灵活性，可提供领域特定的洞察。

Advanced Usage

高级用法

Multi-File Analysis

多文件分析

When analyzing multiple related files:

Perform individual EDA on each file
Create a summary comparison report
Identify relationships and dependencies
Suggest integration strategies

当分析多个相关文件时：

对每个文件执行单独的EDA
创建汇总对比报告
识别文件间的关系与依赖
建议整合策略

Quality Control

质量控制

For data quality assessment:

Check format compliance
Validate metadata consistency
Assess completeness
Identify outliers and anomalies
Compare to expected ranges/distributions

数据质量评估：

检查格式合规性
验证元数据一致性
评估数据完整性
识别异常值与异常情况
与预期范围/分布进行对比

Preprocessing Recommendations

预处理建议

Based on data characteristics, recommend:

Normalization strategies
Missing value imputation
Outlier handling
Batch correction
Format conversions

根据数据特征，建议：

归一化策略
缺失值插补方法
异常值处理方式
批次校正
格式转换

Resources

资源

scripts/

```
eda_analyzer.py
```
: Comprehensive analysis script that can be run directly or imported

```
eda_analyzer.py
```
: 可直接运行或导入的综合分析脚本

references/

```
chemistry_molecular_formats.md
```
: 60+ chemistry/molecular file formats
```
bioinformatics_genomics_formats.md
```
: 50+ bioinformatics formats
```
microscopy_imaging_formats.md
```
: 45+ imaging formats
```
spectroscopy_analytical_formats.md
```
: 35+ spectroscopy formats
```
proteomics_metabolomics_formats.md
```
: 30+ omics formats
```
general_scientific_formats.md
```
: 30+ general formats

```
chemistry_molecular_formats.md
```
: 60+种化学/分子文件格式
```
bioinformatics_genomics_formats.md
```
: 50+种生物信息学格式
```
microscopy_imaging_formats.md
```
: 45+种成像格式
```
spectroscopy_analytical_formats.md
```
: 35+种光谱学格式
```
proteomics_metabolomics_formats.md
```
: 30+种组学格式
```
general_scientific_formats.md
```
: 30+种通用格式

assets/

```
report_template.md
```
: Comprehensive markdown template for EDA reports

```
report_template.md
```
: 探索性数据分析报告的综合Markdown模板