data-analyst

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Analyst

数据分析师

Overview

概述

This skill provides comprehensive capabilities for data analysis workflows on CSV datasets. It automatically analyzes missing value patterns, intelligently imputes missing data using appropriate statistical methods, and creates interactive Plotly Dash dashboards for visualizing trends and patterns. The skill combines automated missing value handling with rich interactive visualizations to support end-to-end exploratory data analysis.
此技能为CSV数据集的数据分析工作流提供全面能力。它会自动分析缺失值模式,使用合适的统计方法智能插补缺失数据,并创建交互式Plotly Dash仪表板以可视化趋势和模式。该技能结合了自动化缺失值处理与丰富的交互式可视化,支持端到端的探索性数据分析。

Core Capabilities

核心能力

The data-analyst skill provides three main capabilities that can be used independently or as a complete workflow:
data-analyst技能提供三项可独立使用或组成完整工作流的核心能力:

1. Missing Value Analysis

1. 缺失值分析

Automatically detect and analyze missing values in datasets, identifying patterns and suggesting optimal imputation strategies.
自动检测并分析数据集中的缺失值,识别模式并建议最优插补策略。

2. Intelligent Imputation

2. 智能插补

Apply sophisticated imputation methods tailored to each column's data type and distribution characteristics.
针对每列的数据类型和分布特征,应用复杂的插补方法。

3. Interactive Dashboard Creation

3. 交互式仪表板创建

Generate comprehensive Plotly Dash dashboards with multiple visualization types for trend analysis and exploration.
生成包含多种可视化类型的综合Plotly Dash仪表板,用于趋势分析和探索。

Complete Workflow

完整工作流

When a user requests complete data analysis with missing value handling and visualization, follow this workflow:
当用户请求包含缺失值处理和可视化的完整数据分析时,请遵循以下工作流:

Step 1: Analyze Missing Values

步骤1:分析缺失值

Run the missing value analysis script to understand the data quality:
bash
python3 scripts/analyze_missing_values.py <input_file.csv> <output_analysis.json>
What this does:
  • Detects missing values in each column
  • Identifies data types (numeric, categorical, temporal, etc.)
  • Calculates missing value statistics
  • Suggests appropriate imputation strategies per column
  • Generates detailed JSON report and console output
Review the output to understand:
  • Which columns have missing data
  • The percentage of missing values
  • The recommended imputation method for each column
  • Why each method was recommended
运行缺失值分析脚本以了解数据质量:
bash
python3 scripts/analyze_missing_values.py <input_file.csv> <output_analysis.json>
功能说明:
  • 检测每列中的缺失值
  • 识别数据类型(数值型、分类型、时间型等)
  • 计算缺失值统计数据
  • 为每列建议合适的插补策略
  • 生成详细的JSON报告和控制台输出
查看输出内容以了解:
  • 哪些列存在缺失数据
  • 缺失值的占比
  • 每列推荐的插补方法
  • 推荐各方法的原因

Step 2: Impute Missing Values

步骤2:插补缺失值

Apply automatic imputation based on the analysis:
bash
python3 scripts/impute_missing_values.py <input_file.csv> <analysis.json> <output_imputed.csv>
What this does:
  • Loads the analysis results (or performs analysis if not provided)
  • Applies the optimal imputation method to each column:
    • Mean: For normally distributed numeric data
    • Median: For skewed numeric data
    • Mode: For categorical variables
    • KNN: For multivariate numeric data with correlations
    • Forward fill: For time series data
    • Constant: For high-cardinality text fields
  • Handles edge cases (drops rows/columns when appropriate)
  • Generates imputation report with before/after statistics
  • Saves cleaned dataset
The script automatically:
  • Drops columns with >70% missing values
  • Drops rows where critical ID columns are missing
  • Performs batch KNN imputation for correlated variables
  • Creates detailed imputation log
基于分析结果应用自动插补:
bash
python3 scripts/impute_missing_values.py <input_file.csv> <analysis.json> <output_imputed.csv>
功能说明:
  • 加载分析结果(若未提供则自动执行分析)
  • 为每列应用最优插补方法:
    • 均值:适用于正态分布的数值型数据
    • 中位数:适用于偏态分布的数值型数据
    • 众数:适用于分类变量
    • KNN:适用于存在相关性的多变量数值型数据
    • 向前填充:适用于时间序列数据
    • 常量:适用于高基数文本字段
  • 处理边缘情况(在合适时删除行/列)
  • 生成包含插补前后统计数据的报告
  • 保存清洗后的数据集
脚本自动执行以下操作:
  • 删除缺失值占比>70%的列
  • 删除关键ID列存在缺失的行
  • 对相关变量执行批量KNN插补
  • 创建详细的插补日志

Step 3: Create Interactive Dashboard

步骤3:创建交互式仪表板

Generate an interactive Plotly Dash dashboard:
bash
python3 scripts/create_dashboard.py <imputed_file.csv> <output_dir> <port>
Example:
bash
python3 scripts/create_dashboard.py data_imputed.csv ./visualizations 8050
What this does:
  • Automatically detects column types (numeric, categorical, temporal)
  • Creates comprehensive visualizations:
    • Summary statistics table: Descriptive stats for all numeric columns
    • Time series plots: Trend analysis if date/time columns exist
    • Distribution plots: Histograms for understanding data distributions
    • Correlation heatmap: Relationships between numeric variables
    • Categorical analysis: Bar charts for categorical variables
    • Scatter plot matrix: Pairwise relationships between variables
  • Launches interactive Dash web server
  • Optionally saves static HTML visualizations
Access the dashboard at
http://127.0.0.1:8050
(or specified port)
生成交互式Plotly Dash仪表板:
bash
python3 scripts/create_dashboard.py <imputed_file.csv> <output_dir> <port>
示例:
bash
python3 scripts/create_dashboard.py data_imputed.csv ./visualizations 8050
功能说明:
  • 自动检测列类型(数值型、分类型、时间型)
  • 创建综合可视化内容:
    • 汇总统计表:所有数值列的描述性统计数据
    • 时间序列图:若存在日期/时间列则进行趋势分析
    • 分布图:用于了解数据分布的直方图
    • 相关性热力图:数值变量之间的关系
    • 分类分析:分类变量的柱状图
    • 散点图矩阵:变量间的两两关系
  • 启动交互式Dash Web服务器
  • 可选保存静态HTML可视化内容
访问仪表板
http://127.0.0.1:8050
(或指定端口)

Individual Use Cases

独立使用场景

Use Case A: Quick Missing Value Assessment

场景A:快速缺失值评估

When the user wants to understand data quality without imputation:
bash
python3 scripts/analyze_missing_values.py data.csv
Review the console output to understand missing value patterns and get recommendations.
当用户希望在不进行插补的情况下了解数据质量时:
bash
python3 scripts/analyze_missing_values.py data.csv
查看控制台输出以了解缺失值模式并获取建议。

Use Case B: Imputation Only

场景B:仅执行插补

When the user has a dataset with missing values and wants cleaned data:
bash
python3 scripts/impute_missing_values.py data.csv
This performs analysis and imputation in one step, producing
data_imputed.csv
.
当用户的数据集存在缺失值并需要清洗后的数据时:
bash
python3 scripts/impute_missing_values.py data.csv
此命令会一步完成分析和插补,生成
data_imputed.csv

Use Case C: Visualization Only

场景C:仅执行可视化

When the user has a clean dataset and wants interactive visualizations:
bash
python3 scripts/create_dashboard.py clean_data.csv ./visualizations 8050
This creates a full dashboard without any preprocessing.
当用户拥有干净的数据集并需要交互式可视化时:
bash
python3 scripts/create_dashboard.py clean_data.csv ./visualizations 8050
此命令会创建完整的仪表板,无需任何预处理。

Use Case D: Custom Imputation Strategy

场景D:自定义插补策略

When the user wants to review and adjust imputation strategies:
  1. Run analysis first:
    bash
    python3 scripts/analyze_missing_values.py data.csv analysis.json
  2. Review
    analysis.json
    and discuss strategies with the user
  3. If needed, modify the imputation logic or parameters in the script
  4. Run imputation:
    bash
    python3 scripts/impute_missing_values.py data.csv analysis.json data_imputed.csv
当用户希望查看并调整插补策略时:
  1. 先运行分析:
    bash
    python3 scripts/analyze_missing_values.py data.csv analysis.json
  2. 查看
    analysis.json
    并与用户讨论策略
  3. 如有需要,修改脚本中的插补逻辑或参数
  4. 运行插补:
    bash
    python3 scripts/impute_missing_values.py data.csv analysis.json data_imputed.csv

Understanding Imputation Methods

插补方法说明

The skill uses intelligent imputation strategies based on data characteristics. Key methods include:
  • Mean/Median: For numeric data (mean for normal distributions, median for skewed)
  • Mode: For categorical variables (most frequent value)
  • KNN (K-Nearest Neighbors): For multivariate numeric data where variables are correlated
  • Forward Fill: For time series data (carry last observation forward)
  • Interpolation: For smooth temporal trends
  • Constant Value: For high-cardinality text fields (e.g., "Unknown")
  • Drop: For columns with >70% missing or rows with missing IDs
For detailed information about when each method is appropriate, refer to
references/imputation_methods.md
.
此技能会根据数据特征使用智能插补策略。主要方法包括:
  • 均值/中位数:适用于数值型数据(正态分布用均值,偏态分布用中位数)
  • 众数:适用于分类变量(出现频率最高的值)
  • KNN(K近邻):适用于变量间存在相关性的多变量数值型数据
  • 向前填充:适用于时间序列数据(沿用最后一个观测值)
  • 插值:适用于平滑的时间趋势
  • 常量值:适用于高基数文本字段(例如"Unknown")
  • 删除:适用于缺失值占比>70%的列或存在缺失ID的行
如需了解每种方法的适用场景详情,请参考
references/imputation_methods.md

Dashboard Features

仪表板功能

The interactive dashboard includes:
交互式仪表板包含以下功能:

Summary Statistics

汇总统计

  • Count, mean, std, min, max, quartiles for all numeric columns
  • Missing value counts and percentages
  • Sortable table format
  • 所有数值列的计数、均值、标准差、最小值、最大值、四分位数
  • 缺失值计数和占比
  • 可排序的表格格式

Time Series Analysis

时间序列分析

  • Line plots with markers for temporal trends
  • Multiple series support (up to 4 primary metrics)
  • Hover details with exact values
  • Unified hover mode for easy comparison
  • 带标记的折线图,用于展示时间趋势
  • 支持多序列(最多4个主要指标)
  • 悬停显示精确数值详情
  • 统一悬停模式,便于比较

Distribution Analysis

分布分析

  • Histograms for all numeric variables
  • 30-bin default for granular distribution view
  • Multi-panel layout for easy comparison
  • 所有数值变量的直方图
  • 默认30个分箱,提供精细的分布视图
  • 多面板布局,便于比较

Correlation Analysis

相关性分析

  • Heatmap showing correlation coefficients
  • Color-coded from -1 (negative) to +1 (positive)
  • Annotated with exact correlation values
  • Useful for identifying relationships
  • 展示相关系数的热力图
  • 颜色从-1(负相关)到+1(正相关)渐变
  • 标注精确的相关系数值
  • 有助于识别变量间的关系

Categorical Analysis

分类分析

  • Bar charts for categorical variables
  • Top 10 categories shown (for high-cardinality variables)
  • Frequency counts displayed
  • 分类变量的柱状图
  • 显示前10个类别(针对高基数变量)
  • 展示频率计数

Scatter Plot Matrix

散点图矩阵

  • Pairwise scatter plots for numeric variables
  • Limited to 5 variables for readability
  • Lower triangle shown (avoiding redundancy)
  • 数值变量间的两两散点图
  • 为保证可读性,最多显示5个变量
  • 仅显示下三角区域(避免冗余)

Setup and Dependencies

安装与依赖

Before using the skill, ensure dependencies are installed:
bash
pip install -r requirements.txt
Required packages:
  • pandas
    - Data manipulation and analysis
  • numpy
    - Numerical computing
  • scikit-learn
    - KNN imputation
  • plotly
    - Interactive visualizations
  • dash
    - Web dashboard framework
  • dash-bootstrap-components
    - Dashboard styling
使用此技能前,请确保已安装依赖项:
bash
pip install -r requirements.txt
所需包:
  • pandas
    - 数据处理与分析
  • numpy
    - 数值计算
  • scikit-learn
    - KNN插补
  • plotly
    - 交互式可视化
  • dash
    - Web仪表板框架
  • dash-bootstrap-components
    - 仪表板样式

Best Practices

最佳实践

For Analysis:

分析环节:

  1. Always run analysis before imputation to understand data quality
  2. Review suggested imputation methods - they're recommendations, not mandates
  3. Pay attention to missing value percentages (>40% requires careful consideration)
  4. Check data types match expectations (e.g., numeric IDs detected as numeric)
  1. 插补前务必运行分析以了解数据质量
  2. 查看建议的插补方法——这些是推荐方案,而非强制要求
  3. 关注缺失值占比(>40%时需谨慎考虑)
  4. 检查数据类型是否符合预期(例如,数值型ID是否被正确识别为数值型)

For Imputation:

插补环节:

  1. Save the original dataset before imputation
  2. Review the imputation report to ensure methods make sense
  3. Check imputed values are within reasonable ranges
  4. Consider creating missing indicators for important variables
  5. Document which imputation methods were used for reproducibility
  1. 插补前保存原始数据集
  2. 查看插补报告,确保方法合理
  3. 检查插补值是否在合理范围内
  4. 考虑为重要变量创建缺失值指示器
  5. 记录所使用的插补方法,以保证可复现性

For Dashboards:

仪表板环节:

  1. Use imputed/cleaned data for most accurate visualizations
  2. Save static HTML plots if sharing with non-technical stakeholders
  3. Use different ports if running multiple dashboards simultaneously
  4. For large datasets (>100k rows), consider sampling for faster rendering
  1. 使用插补/清洗后的数据以获得最准确的可视化结果
  2. 若与非技术利益相关者分享,可保存静态HTML图表
  3. 若同时运行多个仪表板,请指定不同端口
  4. 对于大型数据集(>10万行),考虑抽样以加快渲染速度

Handling Edge Cases

边缘情况处理

High Missing Rates (>50%)

高缺失率(>50%)

The scripts automatically flag columns with >50% missing values. Options:
  • Drop the column if not critical
  • Create a missing indicator variable
  • Investigate why data is missing (may be informative)
脚本会自动标记缺失值占比>50%的列。可选处理方式:
  • 若列非关键则删除
  • 创建缺失值指示器变量
  • 调查数据缺失的原因(可能包含有用信息)

Mixed Data Types

混合数据类型

If a column contains mixed types (e.g., numbers and text):
  • The script detects the primary type
  • Consider cleaning the column before analysis
  • Use constant imputation for mixed-type text columns
若某列包含混合类型(例如数字和文本):
  • 脚本会检测主要类型
  • 考虑在分析前清洗该列
  • 对混合类型文本列使用常量插补

Small Datasets

小型数据集

For datasets with <50 rows:
  • Simple imputation (mean/median/mode) is more stable
  • Avoid KNN (requires sufficient neighbors)
  • Consider dropping rows instead of imputing
对于行数<50的数据集:
  • 简单插补(均值/中位数/众数)更稳定
  • 避免使用KNN(需要足够的邻居样本)
  • 考虑删除行而非插补

Time Series Gaps

时间序列间隙

For time series with irregular timestamps:
  • Use forward fill for short gaps
  • Use interpolation for longer gaps with smooth trends
  • Consider the sampling frequency when choosing methods
对于时间戳不规则的时间序列:
  • 短间隙使用向前填充
  • 长间隙且趋势平滑时使用插值
  • 选择方法时需考虑采样频率

Troubleshooting

故障排除

Script fails with "module not found"

脚本提示“module not found”错误

Install dependencies:
pip install -r requirements.txt
安装依赖项:
pip install -r requirements.txt

Dashboard won't start (port in use)

仪表板无法启动(端口被占用)

Specify a different port:
python3 scripts/create_dashboard.py data.csv ./viz 8051
指定其他端口:
python3 scripts/create_dashboard.py data.csv ./viz 8051

KNN imputation is slow

KNN插补速度慢

KNN is computationally intensive for large datasets. For >50k rows, consider:
  • Using simpler methods (mean/median)
  • Sampling the data first
  • Using fewer columns in KNN
KNN对大型数据集计算量较大。对于>5万行的数据集,可考虑:
  • 使用更简单的方法(均值/中位数)
  • 先对数据抽样
  • 在KNN中使用更少的列

Imputed values seem incorrect

插补值看起来不正确

  • Review the analysis report - check detected data types
  • Verify the column is being detected correctly (numeric vs categorical)
  • Consider manual adjustment or different imputation method
  • Check for outliers that may affect mean/median calculations
  • 查看分析报告——检查检测到的数据类型
  • 验证列是否被正确识别(数值型vs分类型)
  • 考虑手动调整或使用其他插补方法
  • 检查是否存在影响均值/中位数计算的异常值

Resources

资源

scripts/

scripts/

  • analyze_missing_values.py
    - Comprehensive missing value analysis with automatic strategy recommendation
  • impute_missing_values.py
    - Intelligent imputation using multiple methods tailored to data characteristics
  • create_dashboard.py
    - Interactive Plotly Dash dashboard generator with multiple visualization types
  • analyze_missing_values.py
    - 全面的缺失值分析,附带自动策略推荐
  • impute_missing_values.py
    - 智能插补,使用多种适配数据特征的方法
  • create_dashboard.py
    - 交互式Plotly Dash仪表板生成器,支持多种可视化类型

references/

references/

  • imputation_methods.md
    - Detailed guide to missing value imputation strategies, decision frameworks, and best practices
  • imputation_methods.md
    - 缺失值插补策略的详细指南、决策框架及最佳实践

Other Files

其他文件

  • requirements.txt
    - Python dependencies for the skill
  • requirements.txt
    - 此技能的Python依赖项列表