csv-analyzer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CSV Analyzer

CSV分析器

Overview

概述

Comprehensive CSV data analysis and visualization engine. Run the script, then use this guide to interpret results and provide insights to users.
全面的CSV数据分析与可视化引擎。运行脚本后,可使用本指南解读结果并为用户提供洞察。

Quick Start

快速开始

bash
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv
bash
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv

Chart Selection Decision Tree

图表选择决策树

IMPORTANT: Choose charts based on what the user needs to understand:
What is the user trying to understand?
├── "What does my data look like?" (Overview)
│   └── Run with defaults → overview_dashboard.png
├── "Is my data clean?" (Quality)
│   └── Check: quality_score, missing_values, duplicates
│   └── Show: missing_values.png if problems exist
├── "What's the distribution?" (Single Variable)
│   ├── Numeric → numeric_distributions.png (histogram + KDE)
│   ├── Categorical → categorical_distributions.png (bar chart)
│   └── Time-based → time_series.png
├── "Are there outliers?" (Anomalies)
│   └── box_plots.png → points beyond whiskers are outliers
├── "How are variables related?" (Relationships)
│   ├── 2 numeric vars → correlation_heatmap.png
│   ├── 2-6 numeric vars → pairplot.png (scatter matrix)
│   ├── Numeric vs Categorical → violin_plot.png
│   └── All numeric → correlation_heatmap.png
└── "Can I predict X from Y?" (Predictive)
    └── correlation_heatmap.png → |r| > 0.5 suggests predictive power
重要提示:根据用户的需求选择图表:
用户想要了解什么?
├── "我的数据是什么样的?"(概览)
│   └── 使用默认设置运行 → overview_dashboard.png
├── "我的数据是否干净?"(质量)
│   └── 查看:quality_score、missing_values、duplicates
│   └── 展示:若存在问题则显示missing_values.png
├── "数据分布情况如何?"(单变量)
│   ├── 数值型 → numeric_distributions.png(直方图+核密度估计图)
│   ├── 分类型 → categorical_distributions.png(柱状图)
│   └── 时间型 → time_series.png
├── "是否存在异常值?"(异常检测)
│   └── box_plots.png → 须线外的点即为异常值
├── "变量之间有何关联?"(关系分析)
│   ├── 2个数值型变量 → correlation_heatmap.png
│   ├── 2-6个数值型变量 → pairplot.png(散点矩阵)
│   ├── 数值型 vs 分类型 → violin_plot.png
│   └── 所有数值型变量 → correlation_heatmap.png
└── "我能否通过Y预测X?"(预测分析)
    └── correlation_heatmap.png → |r| > 0.5 表明具有预测能力

How to Interpret Results (For Claude)

结果解读指南(供Claude使用)

Quality Score Interpretation

质量分数解读

ScoreGradeWhat to Tell User
90-100A"Your data is excellent quality - ready for analysis"
80-89B"Good quality data with minor issues worth noting"
70-79C"Moderate quality - address missing values before critical analysis"
60-69D"Significant quality issues - recommend data cleaning first"
<60F"Critical issues - data needs substantial cleaning"
分数等级告知用户的内容
90-100A"您的数据质量极佳 - 可直接用于分析"
80-89B"数据质量良好,存在一些值得注意的小问题"
70-79C"数据质量中等 - 在进行关键分析前需处理缺失值"
60-69D"存在显著的质量问题 - 建议先进行数据清洗"
<60F"存在严重问题 - 数据需要大量清洗工作"

Correlation Interpretation

相关性解读

|r| ValueStrengthWhat to Say
0.9 - 1.0Very Strong"X and Y are very strongly related - almost deterministic"
0.7 - 0.9Strong"X and Y have a strong relationship - X could help predict Y"
0.5 - 0.7Moderate"X and Y are moderately correlated - some predictive value"
0.3 - 0.5Weak"X and Y have a weak relationship - limited predictive power"
0.0 - 0.3Negligible"X and Y appear unrelated"
Sign matters:
  • Positive: "As X increases, Y tends to increase"
  • Negative: "As X increases, Y tends to decrease"
|r| 值强度表述方式
0.9 - 1.0极强"X与Y存在极强的相关性 - 几乎呈确定性关系"
0.7 - 0.9"X与Y存在强相关性 - X可用于辅助预测Y"
0.5 - 0.7中等"X与Y存在中等相关性 - 具有一定预测价值"
0.3 - 0.5"X与Y存在弱相关性 - 预测能力有限"
0.0 - 0.3可忽略"X与Y似乎不存在关联"
符号的意义:
  • 正相关:"随着X的增加,Y往往也会增加"
  • 负相关:"随着X的增加,Y往往会减少"

Skewness Interpretation

偏度解读

SkewnessDistribution ShapeRecommendation
< -1Heavy left tail"Most values are high, with some very low outliers"
-1 to -0.5Mild left skew"Slightly more low outliers than high"
-0.5 to 0.5Symmetric"Nicely balanced distribution - good for most analyses"
0.5 to 1Mild right skew"Slightly more high outliers than low"
> 1Heavy right tail"Most values are low, with some very high outliers. Consider log transform for modeling."
偏度值分布形态建议
< -1左尾偏重"大多数数值较高,存在一些极低的异常值"
-1 至 -0.5轻度左偏"低异常值略多于高异常值"
-0.5 至 0.5对称分布"分布均衡良好 - 适用于大多数分析场景"
0.5 至 1轻度右偏"高异常值略多于低异常值"
> 1右尾偏重"大多数数值较低,存在一些极高的异常值。建模时可考虑对数变换。"

Outlier Assessment

异常值评估

When reporting outliers:
  • Few outliers (<1%): "A few extreme values that may warrant investigation"
  • Moderate outliers (1-5%): "Notable outliers - check if they're errors or genuine extremes"
  • Many outliers (>5%): "High outlier rate suggests either data issues or a non-normal distribution"
报告异常值时:
  • 少量异常值(<1%):"存在少数极端值,可能需要进一步调查"
  • 中等数量异常值(1-5%):"存在明显的异常值 - 需检查这些值是错误数据还是真实的极端情况"
  • 大量异常值(>5%):"异常值比例较高,表明可能存在数据问题或非正态分布"

Insight Generation Framework

洞察生成框架

After running analysis, provide insights in this order:
运行分析后,按以下顺序提供洞察:

1. Data Overview (Always)

1. 数据概览(必选)

"Your dataset has [rows] records and [cols] columns:
- [n] numeric columns: [list top 3]
- [n] categorical columns: [list top 3]
- Data quality score: [score]/100 ([grade])"
"您的数据集包含[行数]条记录和[列数]列:
- [n]个数值型列:[列出前3个]
- [n]个分类型列:[列出前3个]
- 数据质量分数:[分数]/100([等级])"

2. Key Findings (Pick most relevant)

2. 关键发现(选择最相关的内容)

If quality issues exist:
"I noticed some data quality concerns:
- [X]% missing values in [column] - [recommend: drop/impute/investigate]
- [N] duplicate rows detected - [recommend: keep first/remove all/investigate]"
If strong correlations found:
"Interesting relationships I found:
- [col1] and [col2] are strongly correlated (r=[value]) - [interpretation]
- This suggests [actionable insight]"
If outliers detected:
"I detected outliers in [columns]:
- [column]: [n] values beyond normal range ([min outlier] to [max outlier])
- These could be [data errors / genuine extremes / worth investigating]"
If skewed distributions:
"[Column] has a [right/left]-skewed distribution:
- Most values cluster around [median]
- But there are extreme values up to [max]
- For modeling, consider [log transform / robust methods]"
若存在质量问题:
"我发现一些数据质量问题:
- [列名]列存在[X]%的缺失值 - [建议:删除/填充/调查原因]
- 检测到[N]条重复记录 - [建议:保留第一条/全部删除/调查原因]"
若发现强相关性:
"我发现了一些有趣的关联:
- [列1]与[列2]存在强相关性(r=[数值]) - [解读内容]
- 这表明[可落地的洞察]"
若检测到异常值:
"我在[列名]中检测到异常值:
- [列名]:[n]个数值超出正常范围([最小异常值]至[最大异常值])
- 这些值可能是[数据错误/真实极端值/需要进一步调查]"
若存在偏态分布:
"[列名]存在[右/左]偏态分布:
- 大多数数值集中在[中位数]附近
- 但存在高达[最大值]的极端值
- 建模时可考虑[对数变换/稳健方法]"

3. Recommendations (Based on findings)

3. 建议(基于发现)

FindingRecommendation
Missing >20% in column"Consider dropping this column or investigating why it's missing"
Missing <5% scattered"Safe to impute with median (numeric) or mode (categorical)"
High correlation (>0.9)"These columns may be redundant - consider keeping only one"
Many outliers"Use robust statistics (median instead of mean) or investigate data collection"
Highly skewed"Apply log transform before linear modeling"
Low quality score"Prioritize data cleaning before analysis"
发现建议
某列缺失值占比>20%"考虑删除该列或调查缺失原因"
缺失值占比<5%且分散"可安全使用中位数(数值型)或众数(分类型)进行填充"
强相关性(>0.9)"这些列可能存在冗余 - 考虑仅保留其中一列"
大量异常值"使用稳健统计方法(用中位数替代均值)或调查数据收集过程"
严重偏态"进行线性建模前先应用对数变换"
低质量分数"在分析前优先进行数据清洗"

Multi-Chart Dashboard Requests

多图表仪表盘请求

When user asks for a "dashboard" or "comprehensive view":
bash
undefined
当用户要求生成“仪表盘”或“全面视图”时:
bash
undefined

Generate all visualizations

生成所有可视化图表

python3 analyze_csv.py data.csv --format html --max-charts 10

Then present charts in this order:
1. **overview_dashboard.png** - "Here's your data at a glance"
2. **correlation_heatmap.png** - "Key relationships between variables"
3. **numeric_distributions.png** - "How your numeric data is distributed"
4. **box_plots.png** - "Outlier analysis"
5. **categorical_distributions.png** - "Category breakdowns" (if applicable)
python3 analyze_csv.py data.csv --format html --max-charts 10

然后按以下顺序展示图表:
1. **overview_dashboard.png** - "这是您的数据概览"
2. **correlation_heatmap.png** - "变量间的关键关联"
3. **numeric_distributions.png** - "数值型数据的分布情况"
4. **box_plots.png** - "异常值分析"
5. **categorical_distributions.png** - "分类 breakdowns"(若适用)

Command Reference

命令参考

Basic Analysis

基础分析

bash
python3 analyze_csv.py data.csv
bash
python3 analyze_csv.py data.csv

Full Report with All Charts

包含所有图表的完整报告

bash
python3 analyze_csv.py data.csv --format markdown --max-charts 10
bash
python3 analyze_csv.py data.csv --format markdown --max-charts 10

Quick Analysis (No Charts)

快速分析(无图表)

bash
python3 analyze_csv.py data.csv --no-charts
bash
python3 analyze_csv.py data.csv --no-charts

Large Files (>100MB)

大文件(>100MB)

bash
python3 analyze_csv.py huge.csv --sample 50000
bash
python3 analyze_csv.py huge.csv --sample 50000

Specific Date Columns

指定日期列

bash
python3 analyze_csv.py data.csv --date-columns created_at updated_at
bash
python3 analyze_csv.py data.csv --date-columns created_at updated_at

JSON for Programmatic Use

供程序调用的JSON格式输出

bash
python3 analyze_csv.py data.csv --format json --no-charts
bash
python3 analyze_csv.py data.csv --format json --no-charts

Custom Output Location

自定义输出位置

bash
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis
bash
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis

Chart Descriptions (For Explaining to Users)

图表说明(用于向用户解释)

ChartWhen to ShowHow to Describe
overview_dashboard.pngAlways for first look"Here's a bird's eye view of your data"
missing_values.pngIf missing data exists"This shows where your data has gaps"
numeric_distributions.pngWhen exploring distributions"This shows how your numeric values are spread out"
box_plots.pngWhen checking for outliers"The dots outside the boxes are potential outliers"
correlation_heatmap.pngWhen exploring relationships"Darker colors = stronger relationships"
categorical_distributions.pngFor category analysis"This shows the breakdown of your categories"
time_series.pngFor temporal data"Here's how your data changes over time"
pairplot.pngFor multivariate exploration"Each cell shows how two variables relate"
violin_plot.pngComparing groups"This shows how distributions differ across groups"
图表展示场景描述方式
overview_dashboard.png首次查看数据时必展示"这是您的数据全景视图"
missing_values.png存在缺失数据时"该图表展示了数据中的缺失位置"
numeric_distributions.png探索分布情况时"该图表展示了数值型数据的分布范围"
box_plots.png检查异常值时"箱线外的点即为潜在异常值"
correlation_heatmap.png探索变量关联时"颜色越深,相关性越强"
categorical_distributions.png分类分析时"该图表展示了各类别的分布情况"
time_series.png时间序列数据分析时"该图表展示了数据随时间的变化趋势"
pairplot.png多变量探索时"每个单元格展示了两个变量之间的关系"
violin_plot.png组间比较时"该图表展示了不同组之间的分布差异"

Common User Questions → Actions

常见用户问题 → 对应操作

User SaysAction
"Analyze this CSV"Run full analysis, show overview + key insights
"Is my data clean?"Focus on quality_score, missing values, duplicates
"Find patterns"Show correlation_heatmap, highlight strong correlations
"Are there outliers?"Show box_plots, list outlier counts per column
"Compare X across Y"Generate violin_plot for numeric X vs categorical Y
"Show me trends"Generate time_series if datetime column exists
"Create a dashboard"Generate all charts, present organized summary
"What should I clean?"List columns with missing >5%, duplicates, outliers
用户表述操作
"分析这个CSV文件"运行完整分析,展示概览+关键洞察
"我的数据干净吗?"重点关注quality_score、缺失值和重复记录
"寻找数据模式"展示correlation_heatmap,突出强相关性
"是否存在异常值?"展示box_plots,列出每列的异常值数量
"比较不同Y分组下的X"生成数值型X vs 分类型Y的violin_plot
"展示趋势"若存在日期时间列则生成time_series
"创建仪表盘"生成所有图表,呈现结构化的摘要
"我应该清洗哪些内容?"列出缺失值>5%的列、重复记录和异常值情况

Output Locations

输出位置

Charts are saved to:
  • Default:
    ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/
  • Custom: Use
    --output-dir /path/to/project/.tmp/analysis
Always copy charts to user's project .tmp for visibility:
bash
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/
图表将保存至:
  • 默认路径:
    ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/
  • 自定义路径:使用
    --output-dir /path/to/project/.tmp/analysis
    参数指定
请务必将图表复制到用户项目的.tmp目录以确保可见性:
bash
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/

Cost

成本

Free - runs entirely locally using pandas, matplotlib, seaborn, scipy.
完全免费 - 基于pandas、matplotlib、seaborn、scipy在本地运行。

Dependencies

依赖项

bash
pip install pandas matplotlib seaborn scipy numpy
bash
pip install pandas matplotlib seaborn scipy numpy