csv-analyzer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CSV Analyzer

CSV分析器

Overview

概述

Comprehensive CSV data analysis and visualization engine. Run the script, then use this guide to interpret results and provide insights to users.

全面的CSV数据分析与可视化引擎。运行脚本后，可使用本指南解读结果并为用户提供洞察。

Quick Start

快速开始

bash

cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv

bash

cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv

Chart Selection Decision Tree

图表选择决策树

IMPORTANT: Choose charts based on what the user needs to understand:

What is the user trying to understand?
│
├── "What does my data look like?" (Overview)
│   └── Run with defaults → overview_dashboard.png
│
├── "Is my data clean?" (Quality)
│   └── Check: quality_score, missing_values, duplicates
│   └── Show: missing_values.png if problems exist
│
├── "What's the distribution?" (Single Variable)
│   ├── Numeric → numeric_distributions.png (histogram + KDE)
│   ├── Categorical → categorical_distributions.png (bar chart)
│   └── Time-based → time_series.png
│
├── "Are there outliers?" (Anomalies)
│   └── box_plots.png → points beyond whiskers are outliers
│
├── "How are variables related?" (Relationships)
│   ├── 2 numeric vars → correlation_heatmap.png
│   ├── 2-6 numeric vars → pairplot.png (scatter matrix)
│   ├── Numeric vs Categorical → violin_plot.png
│   └── All numeric → correlation_heatmap.png
│
└── "Can I predict X from Y?" (Predictive)
    └── correlation_heatmap.png → |r| > 0.5 suggests predictive power

重要提示：根据用户的需求选择图表：

用户想要了解什么？
│
├── "我的数据是什么样的？"（概览）
│   └── 使用默认设置运行 → overview_dashboard.png
│
├── "我的数据是否干净？"（质量）
│   └── 查看：quality_score、missing_values、duplicates
│   └── 展示：若存在问题则显示missing_values.png
│
├── "数据分布情况如何？"（单变量）
│   ├── 数值型 → numeric_distributions.png（直方图+核密度估计图）
│   ├── 分类型 → categorical_distributions.png（柱状图）
│   └── 时间型 → time_series.png
│
├── "是否存在异常值？"（异常检测）
│   └── box_plots.png → 须线外的点即为异常值
│
├── "变量之间有何关联？"（关系分析）
│   ├── 2个数值型变量 → correlation_heatmap.png
│   ├── 2-6个数值型变量 → pairplot.png（散点矩阵）
│   ├── 数值型 vs 分类型 → violin_plot.png
│   └── 所有数值型变量 → correlation_heatmap.png
│
└── "我能否通过Y预测X？"（预测分析）
    └── correlation_heatmap.png → |r| > 0.5 表明具有预测能力

How to Interpret Results (For Claude)

结果解读指南（供Claude使用）

Quality Score Interpretation

质量分数解读

Score	Grade	What to Tell User
90-100	A	"Your data is excellent quality - ready for analysis"
80-89	B	"Good quality data with minor issues worth noting"
70-79	C	"Moderate quality - address missing values before critical analysis"
60-69	D	"Significant quality issues - recommend data cleaning first"
<60	F	"Critical issues - data needs substantial cleaning"

分数	等级	告知用户的内容
90-100	A	"您的数据质量极佳 - 可直接用于分析"
80-89	B	"数据质量良好，存在一些值得注意的小问题"
70-79	C	"数据质量中等 - 在进行关键分析前需处理缺失值"
60-69	D	"存在显著的质量问题 - 建议先进行数据清洗"
<60	F	"存在严重问题 - 数据需要大量清洗工作"

Correlation Interpretation

\|r\| Value	Strength	What to Say
0.9 - 1.0	Very Strong	"X and Y are very strongly related - almost deterministic"
0.7 - 0.9	Strong	"X and Y have a strong relationship - X could help predict Y"
0.5 - 0.7	Moderate	"X and Y are moderately correlated - some predictive value"
0.3 - 0.5	Weak	"X and Y have a weak relationship - limited predictive power"
0.0 - 0.3	Negligible	"X and Y appear unrelated"

\|r\| 值	强度	表述方式
0.9 - 1.0	极强	"X与Y存在极强的相关性 - 几乎呈确定性关系"
0.7 - 0.9	强	"X与Y存在强相关性 - X可用于辅助预测Y"
0.5 - 0.7	中等	"X与Y存在中等相关性 - 具有一定预测价值"
0.3 - 0.5	弱	"X与Y存在弱相关性 - 预测能力有限"
0.0 - 0.3	可忽略	"X与Y似乎不存在关联"

Skewness Interpretation

偏度解读

Skewness	Distribution Shape	Recommendation
< -1	Heavy left tail	"Most values are high, with some very low outliers"
-1 to -0.5	Mild left skew	"Slightly more low outliers than high"
-0.5 to 0.5	Symmetric	"Nicely balanced distribution - good for most analyses"
0.5 to 1	Mild right skew	"Slightly more high outliers than low"
> 1	Heavy right tail	"Most values are low, with some very high outliers. Consider log transform for modeling."

偏度值	分布形态	建议
< -1	左尾偏重	"大多数数值较高，存在一些极低的异常值"
-1 至 -0.5	轻度左偏	"低异常值略多于高异常值"
-0.5 至 0.5	对称分布	"分布均衡良好 - 适用于大多数分析场景"
0.5 至 1	轻度右偏	"高异常值略多于低异常值"
> 1	右尾偏重	"大多数数值较低，存在一些极高的异常值。建模时可考虑对数变换。"

Outlier Assessment

异常值评估

When reporting outliers:

Few outliers (<1%): "A few extreme values that may warrant investigation"
Moderate outliers (1-5%): "Notable outliers - check if they're errors or genuine extremes"
Many outliers (>5%): "High outlier rate suggests either data issues or a non-normal distribution"

报告异常值时：

少量异常值（<1%）："存在少数极端值，可能需要进一步调查"
中等数量异常值（1-5%）："存在明显的异常值 - 需检查这些值是错误数据还是真实的极端情况"
大量异常值（>5%）："异常值比例较高，表明可能存在数据问题或非正态分布"

Insight Generation Framework

洞察生成框架

After running analysis, provide insights in this order:

运行分析后，按以下顺序提供洞察：

1. Data Overview (Always)

1. 数据概览（必选）

"Your dataset has [rows] records and [cols] columns:
- [n] numeric columns: [list top 3]
- [n] categorical columns: [list top 3]
- Data quality score: [score]/100 ([grade])"

"您的数据集包含[行数]条记录和[列数]列：
- [n]个数值型列：[列出前3个]
- [n]个分类型列：[列出前3个]
- 数据质量分数：[分数]/100（[等级]）"

2. Key Findings (Pick most relevant)

2. 关键发现（选择最相关的内容）

If quality issues exist:

"I noticed some data quality concerns:
- [X]% missing values in [column] - [recommend: drop/impute/investigate]
- [N] duplicate rows detected - [recommend: keep first/remove all/investigate]"

If strong correlations found:

"Interesting relationships I found:
- [col1] and [col2] are strongly correlated (r=[value]) - [interpretation]
- This suggests [actionable insight]"

If outliers detected:

"I detected outliers in [columns]:
- [column]: [n] values beyond normal range ([min outlier] to [max outlier])
- These could be [data errors / genuine extremes / worth investigating]"

If skewed distributions:

"[Column] has a [right/left]-skewed distribution:
- Most values cluster around [median]
- But there are extreme values up to [max]
- For modeling, consider [log transform / robust methods]"

若存在质量问题：

"我发现一些数据质量问题：
- [列名]列存在[X]%的缺失值 - [建议：删除/填充/调查原因]
- 检测到[N]条重复记录 - [建议：保留第一条/全部删除/调查原因]"

若发现强相关性：

"我发现了一些有趣的关联：
- [列1]与[列2]存在强相关性（r=[数值]） - [解读内容]
- 这表明[可落地的洞察]"

若检测到异常值：

"我在[列名]中检测到异常值：
- [列名]：[n]个数值超出正常范围（[最小异常值]至[最大异常值]）
- 这些值可能是[数据错误/真实极端值/需要进一步调查]"

若存在偏态分布：

"[列名]存在[右/左]偏态分布：
- 大多数数值集中在[中位数]附近
- 但存在高达[最大值]的极端值
- 建模时可考虑[对数变换/稳健方法]"

3. Recommendations (Based on findings)

3. 建议（基于发现）

Finding	Recommendation
Missing >20% in column	"Consider dropping this column or investigating why it's missing"
Missing <5% scattered	"Safe to impute with median (numeric) or mode (categorical)"
High correlation (>0.9)	"These columns may be redundant - consider keeping only one"
Many outliers	"Use robust statistics (median instead of mean) or investigate data collection"
Highly skewed	"Apply log transform before linear modeling"
Low quality score	"Prioritize data cleaning before analysis"

发现	建议
某列缺失值占比>20%	"考虑删除该列或调查缺失原因"
缺失值占比<5%且分散	"可安全使用中位数（数值型）或众数（分类型）进行填充"
强相关性（>0.9）	"这些列可能存在冗余 - 考虑仅保留其中一列"
大量异常值	"使用稳健统计方法（用中位数替代均值）或调查数据收集过程"
严重偏态	"进行线性建模前先应用对数变换"
低质量分数	"在分析前优先进行数据清洗"

Multi-Chart Dashboard Requests

多图表仪表盘请求

When user asks for a "dashboard" or "comprehensive view":

bash

undefined

当用户要求生成“仪表盘”或“全面视图”时：

bash

undefined

Generate all visualizations

生成所有可视化图表

python3 analyze_csv.py data.csv --format html --max-charts 10


Then present charts in this order:
1. **overview_dashboard.png** - "Here's your data at a glance"
2. **correlation_heatmap.png** - "Key relationships between variables"
3. **numeric_distributions.png** - "How your numeric data is distributed"
4. **box_plots.png** - "Outlier analysis"
5. **categorical_distributions.png** - "Category breakdowns" (if applicable)

python3 analyze_csv.py data.csv --format html --max-charts 10


然后按以下顺序展示图表：
1. **overview_dashboard.png** - "这是您的数据概览"
2. **correlation_heatmap.png** - "变量间的关键关联"
3. **numeric_distributions.png** - "数值型数据的分布情况"
4. **box_plots.png** - "异常值分析"
5. **categorical_distributions.png** - "分类 breakdowns"（若适用）

Command Reference

命令参考

Basic Analysis

基础分析

bash

python3 analyze_csv.py data.csv

bash

python3 analyze_csv.py data.csv

Full Report with All Charts

包含所有图表的完整报告

bash

python3 analyze_csv.py data.csv --format markdown --max-charts 10

bash

python3 analyze_csv.py data.csv --format markdown --max-charts 10

Quick Analysis (No Charts)

快速分析（无图表）

bash

python3 analyze_csv.py data.csv --no-charts

bash

python3 analyze_csv.py data.csv --no-charts

Large Files (>100MB)

大文件（>100MB）

bash

python3 analyze_csv.py huge.csv --sample 50000

bash

python3 analyze_csv.py huge.csv --sample 50000

Specific Date Columns

指定日期列

bash

python3 analyze_csv.py data.csv --date-columns created_at updated_at

bash

python3 analyze_csv.py data.csv --date-columns created_at updated_at

JSON for Programmatic Use

供程序调用的JSON格式输出

bash

python3 analyze_csv.py data.csv --format json --no-charts

bash

python3 analyze_csv.py data.csv --format json --no-charts

Custom Output Location

自定义输出位置

bash

python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis

bash

python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis

Chart Descriptions (For Explaining to Users)

图表说明（用于向用户解释）

Chart	When to Show	How to Describe
overview_dashboard.png	Always for first look	"Here's a bird's eye view of your data"
missing_values.png	If missing data exists	"This shows where your data has gaps"
numeric_distributions.png	When exploring distributions	"This shows how your numeric values are spread out"
box_plots.png	When checking for outliers	"The dots outside the boxes are potential outliers"
correlation_heatmap.png	When exploring relationships	"Darker colors = stronger relationships"
categorical_distributions.png	For category analysis	"This shows the breakdown of your categories"
time_series.png	For temporal data	"Here's how your data changes over time"
pairplot.png	For multivariate exploration	"Each cell shows how two variables relate"
violin_plot.png	Comparing groups	"This shows how distributions differ across groups"

图表	展示场景	描述方式
overview_dashboard.png	首次查看数据时必展示	"这是您的数据全景视图"
missing_values.png	存在缺失数据时	"该图表展示了数据中的缺失位置"
numeric_distributions.png	探索分布情况时	"该图表展示了数值型数据的分布范围"
box_plots.png	检查异常值时	"箱线外的点即为潜在异常值"
correlation_heatmap.png	探索变量关联时	"颜色越深，相关性越强"
categorical_distributions.png	分类分析时	"该图表展示了各类别的分布情况"
time_series.png	时间序列数据分析时	"该图表展示了数据随时间的变化趋势"
pairplot.png	多变量探索时	"每个单元格展示了两个变量之间的关系"
violin_plot.png	组间比较时	"该图表展示了不同组之间的分布差异"

Common User Questions → Actions

常见用户问题 → 对应操作

User Says	Action
"Analyze this CSV"	Run full analysis, show overview + key insights
"Is my data clean?"	Focus on quality_score, missing values, duplicates
"Find patterns"	Show correlation_heatmap, highlight strong correlations
"Are there outliers?"	Show box_plots, list outlier counts per column
"Compare X across Y"	Generate violin_plot for numeric X vs categorical Y
"Show me trends"	Generate time_series if datetime column exists
"Create a dashboard"	Generate all charts, present organized summary
"What should I clean?"	List columns with missing >5%, duplicates, outliers

用户表述	操作
"分析这个CSV文件"	运行完整分析，展示概览+关键洞察
"我的数据干净吗？"	重点关注quality_score、缺失值和重复记录
"寻找数据模式"	展示correlation_heatmap，突出强相关性
"是否存在异常值？"	展示box_plots，列出每列的异常值数量
"比较不同Y分组下的X"	生成数值型X vs 分类型Y的violin_plot
"展示趋势"	若存在日期时间列则生成time_series
"创建仪表盘"	生成所有图表，呈现结构化的摘要
"我应该清洗哪些内容？"	列出缺失值>5%的列、重复记录和异常值情况

Output Locations

输出位置

Charts are saved to:

Default:

~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/

Custom: Use

--output-dir /path/to/project/.tmp/analysis

Always copy charts to user's project .tmp for visibility:

bash

cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/

图表将保存至：

默认路径：

~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/

自定义路径：使用

--output-dir /path/to/project/.tmp/analysis

参数指定

请务必将图表复制到用户项目的.tmp目录以确保可见性：

bash

cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/

Cost

成本

Free - runs entirely locally using pandas, matplotlib, seaborn, scipy.

完全免费 - 基于pandas、matplotlib、seaborn、scipy在本地运行。

Dependencies

依赖项

bash

pip install pandas matplotlib seaborn scipy numpy

bash

pip install pandas matplotlib seaborn scipy numpy

csv-analyzer

Original

Translation

CSV Analyzer

CSV分析器

Overview

概述

Quick Start

快速开始

Chart Selection Decision Tree

图表选择决策树

How to Interpret Results (For Claude)

结果解读指南（供Claude使用）

Quality Score Interpretation

质量分数解读

Correlation Interpretation

相关性解读

Skewness Interpretation

偏度解读

Outlier Assessment

异常值评估

Insight Generation Framework

洞察生成框架

1. Data Overview (Always)

1. 数据概览（必选）

2. Key Findings (Pick most relevant)

2. 关键发现（选择最相关的内容）

3. Recommendations (Based on findings)

3. 建议（基于发现）

Multi-Chart Dashboard Requests

多图表仪表盘请求

Generate all visualizations

生成所有可视化图表

Command Reference

命令参考

Basic Analysis

基础分析

Full Report with All Charts

包含所有图表的完整报告

Quick Analysis (No Charts)

快速分析（无图表）

Large Files (>100MB)

大文件（>100MB）

Specific Date Columns

指定日期列

JSON for Programmatic Use

供程序调用的JSON格式输出

Custom Output Location

自定义输出位置

Chart Descriptions (For Explaining to Users)

图表说明（用于向用户解释）

Common User Questions → Actions

常见用户问题 → 对应操作

Output Locations

输出位置

Cost

成本

Dependencies

依赖项