csv-analyzer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCSV Analyzer
CSV分析器
Overview
概述
Comprehensive CSV data analysis and visualization engine. Run the script, then use this guide to interpret results and provide insights to users.
全面的CSV数据分析与可视化引擎。运行脚本后,可使用本指南解读结果并为用户提供洞察。
Quick Start
快速开始
bash
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csvbash
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csvChart Selection Decision Tree
图表选择决策树
IMPORTANT: Choose charts based on what the user needs to understand:
What is the user trying to understand?
│
├── "What does my data look like?" (Overview)
│ └── Run with defaults → overview_dashboard.png
│
├── "Is my data clean?" (Quality)
│ └── Check: quality_score, missing_values, duplicates
│ └── Show: missing_values.png if problems exist
│
├── "What's the distribution?" (Single Variable)
│ ├── Numeric → numeric_distributions.png (histogram + KDE)
│ ├── Categorical → categorical_distributions.png (bar chart)
│ └── Time-based → time_series.png
│
├── "Are there outliers?" (Anomalies)
│ └── box_plots.png → points beyond whiskers are outliers
│
├── "How are variables related?" (Relationships)
│ ├── 2 numeric vars → correlation_heatmap.png
│ ├── 2-6 numeric vars → pairplot.png (scatter matrix)
│ ├── Numeric vs Categorical → violin_plot.png
│ └── All numeric → correlation_heatmap.png
│
└── "Can I predict X from Y?" (Predictive)
└── correlation_heatmap.png → |r| > 0.5 suggests predictive power重要提示:根据用户的需求选择图表:
用户想要了解什么?
│
├── "我的数据是什么样的?"(概览)
│ └── 使用默认设置运行 → overview_dashboard.png
│
├── "我的数据是否干净?"(质量)
│ └── 查看:quality_score、missing_values、duplicates
│ └── 展示:若存在问题则显示missing_values.png
│
├── "数据分布情况如何?"(单变量)
│ ├── 数值型 → numeric_distributions.png(直方图+核密度估计图)
│ ├── 分类型 → categorical_distributions.png(柱状图)
│ └── 时间型 → time_series.png
│
├── "是否存在异常值?"(异常检测)
│ └── box_plots.png → 须线外的点即为异常值
│
├── "变量之间有何关联?"(关系分析)
│ ├── 2个数值型变量 → correlation_heatmap.png
│ ├── 2-6个数值型变量 → pairplot.png(散点矩阵)
│ ├── 数值型 vs 分类型 → violin_plot.png
│ └── 所有数值型变量 → correlation_heatmap.png
│
└── "我能否通过Y预测X?"(预测分析)
└── correlation_heatmap.png → |r| > 0.5 表明具有预测能力How to Interpret Results (For Claude)
结果解读指南(供Claude使用)
Quality Score Interpretation
质量分数解读
| Score | Grade | What to Tell User |
|---|---|---|
| 90-100 | A | "Your data is excellent quality - ready for analysis" |
| 80-89 | B | "Good quality data with minor issues worth noting" |
| 70-79 | C | "Moderate quality - address missing values before critical analysis" |
| 60-69 | D | "Significant quality issues - recommend data cleaning first" |
| <60 | F | "Critical issues - data needs substantial cleaning" |
| 分数 | 等级 | 告知用户的内容 |
|---|---|---|
| 90-100 | A | "您的数据质量极佳 - 可直接用于分析" |
| 80-89 | B | "数据质量良好,存在一些值得注意的小问题" |
| 70-79 | C | "数据质量中等 - 在进行关键分析前需处理缺失值" |
| 60-69 | D | "存在显著的质量问题 - 建议先进行数据清洗" |
| <60 | F | "存在严重问题 - 数据需要大量清洗工作" |
Correlation Interpretation
相关性解读
| |r| Value | Strength | What to Say |
|---|---|---|
| 0.9 - 1.0 | Very Strong | "X and Y are very strongly related - almost deterministic" |
| 0.7 - 0.9 | Strong | "X and Y have a strong relationship - X could help predict Y" |
| 0.5 - 0.7 | Moderate | "X and Y are moderately correlated - some predictive value" |
| 0.3 - 0.5 | Weak | "X and Y have a weak relationship - limited predictive power" |
| 0.0 - 0.3 | Negligible | "X and Y appear unrelated" |
Sign matters:
- Positive: "As X increases, Y tends to increase"
- Negative: "As X increases, Y tends to decrease"
| |r| 值 | 强度 | 表述方式 |
|---|---|---|
| 0.9 - 1.0 | 极强 | "X与Y存在极强的相关性 - 几乎呈确定性关系" |
| 0.7 - 0.9 | 强 | "X与Y存在强相关性 - X可用于辅助预测Y" |
| 0.5 - 0.7 | 中等 | "X与Y存在中等相关性 - 具有一定预测价值" |
| 0.3 - 0.5 | 弱 | "X与Y存在弱相关性 - 预测能力有限" |
| 0.0 - 0.3 | 可忽略 | "X与Y似乎不存在关联" |
符号的意义:
- 正相关:"随着X的增加,Y往往也会增加"
- 负相关:"随着X的增加,Y往往会减少"
Skewness Interpretation
偏度解读
| Skewness | Distribution Shape | Recommendation |
|---|---|---|
| < -1 | Heavy left tail | "Most values are high, with some very low outliers" |
| -1 to -0.5 | Mild left skew | "Slightly more low outliers than high" |
| -0.5 to 0.5 | Symmetric | "Nicely balanced distribution - good for most analyses" |
| 0.5 to 1 | Mild right skew | "Slightly more high outliers than low" |
| > 1 | Heavy right tail | "Most values are low, with some very high outliers. Consider log transform for modeling." |
| 偏度值 | 分布形态 | 建议 |
|---|---|---|
| < -1 | 左尾偏重 | "大多数数值较高,存在一些极低的异常值" |
| -1 至 -0.5 | 轻度左偏 | "低异常值略多于高异常值" |
| -0.5 至 0.5 | 对称分布 | "分布均衡良好 - 适用于大多数分析场景" |
| 0.5 至 1 | 轻度右偏 | "高异常值略多于低异常值" |
| > 1 | 右尾偏重 | "大多数数值较低,存在一些极高的异常值。建模时可考虑对数变换。" |
Outlier Assessment
异常值评估
When reporting outliers:
- Few outliers (<1%): "A few extreme values that may warrant investigation"
- Moderate outliers (1-5%): "Notable outliers - check if they're errors or genuine extremes"
- Many outliers (>5%): "High outlier rate suggests either data issues or a non-normal distribution"
报告异常值时:
- 少量异常值(<1%):"存在少数极端值,可能需要进一步调查"
- 中等数量异常值(1-5%):"存在明显的异常值 - 需检查这些值是错误数据还是真实的极端情况"
- 大量异常值(>5%):"异常值比例较高,表明可能存在数据问题或非正态分布"
Insight Generation Framework
洞察生成框架
After running analysis, provide insights in this order:
运行分析后,按以下顺序提供洞察:
1. Data Overview (Always)
1. 数据概览(必选)
"Your dataset has [rows] records and [cols] columns:
- [n] numeric columns: [list top 3]
- [n] categorical columns: [list top 3]
- Data quality score: [score]/100 ([grade])""您的数据集包含[行数]条记录和[列数]列:
- [n]个数值型列:[列出前3个]
- [n]个分类型列:[列出前3个]
- 数据质量分数:[分数]/100([等级])"2. Key Findings (Pick most relevant)
2. 关键发现(选择最相关的内容)
If quality issues exist:
"I noticed some data quality concerns:
- [X]% missing values in [column] - [recommend: drop/impute/investigate]
- [N] duplicate rows detected - [recommend: keep first/remove all/investigate]"If strong correlations found:
"Interesting relationships I found:
- [col1] and [col2] are strongly correlated (r=[value]) - [interpretation]
- This suggests [actionable insight]"If outliers detected:
"I detected outliers in [columns]:
- [column]: [n] values beyond normal range ([min outlier] to [max outlier])
- These could be [data errors / genuine extremes / worth investigating]"If skewed distributions:
"[Column] has a [right/left]-skewed distribution:
- Most values cluster around [median]
- But there are extreme values up to [max]
- For modeling, consider [log transform / robust methods]"若存在质量问题:
"我发现一些数据质量问题:
- [列名]列存在[X]%的缺失值 - [建议:删除/填充/调查原因]
- 检测到[N]条重复记录 - [建议:保留第一条/全部删除/调查原因]"若发现强相关性:
"我发现了一些有趣的关联:
- [列1]与[列2]存在强相关性(r=[数值]) - [解读内容]
- 这表明[可落地的洞察]"若检测到异常值:
"我在[列名]中检测到异常值:
- [列名]:[n]个数值超出正常范围([最小异常值]至[最大异常值])
- 这些值可能是[数据错误/真实极端值/需要进一步调查]"若存在偏态分布:
"[列名]存在[右/左]偏态分布:
- 大多数数值集中在[中位数]附近
- 但存在高达[最大值]的极端值
- 建模时可考虑[对数变换/稳健方法]"3. Recommendations (Based on findings)
3. 建议(基于发现)
| Finding | Recommendation |
|---|---|
| Missing >20% in column | "Consider dropping this column or investigating why it's missing" |
| Missing <5% scattered | "Safe to impute with median (numeric) or mode (categorical)" |
| High correlation (>0.9) | "These columns may be redundant - consider keeping only one" |
| Many outliers | "Use robust statistics (median instead of mean) or investigate data collection" |
| Highly skewed | "Apply log transform before linear modeling" |
| Low quality score | "Prioritize data cleaning before analysis" |
| 发现 | 建议 |
|---|---|
| 某列缺失值占比>20% | "考虑删除该列或调查缺失原因" |
| 缺失值占比<5%且分散 | "可安全使用中位数(数值型)或众数(分类型)进行填充" |
| 强相关性(>0.9) | "这些列可能存在冗余 - 考虑仅保留其中一列" |
| 大量异常值 | "使用稳健统计方法(用中位数替代均值)或调查数据收集过程" |
| 严重偏态 | "进行线性建模前先应用对数变换" |
| 低质量分数 | "在分析前优先进行数据清洗" |
Multi-Chart Dashboard Requests
多图表仪表盘请求
When user asks for a "dashboard" or "comprehensive view":
bash
undefined当用户要求生成“仪表盘”或“全面视图”时:
bash
undefinedGenerate all visualizations
生成所有可视化图表
python3 analyze_csv.py data.csv --format html --max-charts 10
Then present charts in this order:
1. **overview_dashboard.png** - "Here's your data at a glance"
2. **correlation_heatmap.png** - "Key relationships between variables"
3. **numeric_distributions.png** - "How your numeric data is distributed"
4. **box_plots.png** - "Outlier analysis"
5. **categorical_distributions.png** - "Category breakdowns" (if applicable)python3 analyze_csv.py data.csv --format html --max-charts 10
然后按以下顺序展示图表:
1. **overview_dashboard.png** - "这是您的数据概览"
2. **correlation_heatmap.png** - "变量间的关键关联"
3. **numeric_distributions.png** - "数值型数据的分布情况"
4. **box_plots.png** - "异常值分析"
5. **categorical_distributions.png** - "分类 breakdowns"(若适用)Command Reference
命令参考
Basic Analysis
基础分析
bash
python3 analyze_csv.py data.csvbash
python3 analyze_csv.py data.csvFull Report with All Charts
包含所有图表的完整报告
bash
python3 analyze_csv.py data.csv --format markdown --max-charts 10bash
python3 analyze_csv.py data.csv --format markdown --max-charts 10Quick Analysis (No Charts)
快速分析(无图表)
bash
python3 analyze_csv.py data.csv --no-chartsbash
python3 analyze_csv.py data.csv --no-chartsLarge Files (>100MB)
大文件(>100MB)
bash
python3 analyze_csv.py huge.csv --sample 50000bash
python3 analyze_csv.py huge.csv --sample 50000Specific Date Columns
指定日期列
bash
python3 analyze_csv.py data.csv --date-columns created_at updated_atbash
python3 analyze_csv.py data.csv --date-columns created_at updated_atJSON for Programmatic Use
供程序调用的JSON格式输出
bash
python3 analyze_csv.py data.csv --format json --no-chartsbash
python3 analyze_csv.py data.csv --format json --no-chartsCustom Output Location
自定义输出位置
bash
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysisbash
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysisChart Descriptions (For Explaining to Users)
图表说明(用于向用户解释)
| Chart | When to Show | How to Describe |
|---|---|---|
| overview_dashboard.png | Always for first look | "Here's a bird's eye view of your data" |
| missing_values.png | If missing data exists | "This shows where your data has gaps" |
| numeric_distributions.png | When exploring distributions | "This shows how your numeric values are spread out" |
| box_plots.png | When checking for outliers | "The dots outside the boxes are potential outliers" |
| correlation_heatmap.png | When exploring relationships | "Darker colors = stronger relationships" |
| categorical_distributions.png | For category analysis | "This shows the breakdown of your categories" |
| time_series.png | For temporal data | "Here's how your data changes over time" |
| pairplot.png | For multivariate exploration | "Each cell shows how two variables relate" |
| violin_plot.png | Comparing groups | "This shows how distributions differ across groups" |
| 图表 | 展示场景 | 描述方式 |
|---|---|---|
| overview_dashboard.png | 首次查看数据时必展示 | "这是您的数据全景视图" |
| missing_values.png | 存在缺失数据时 | "该图表展示了数据中的缺失位置" |
| numeric_distributions.png | 探索分布情况时 | "该图表展示了数值型数据的分布范围" |
| box_plots.png | 检查异常值时 | "箱线外的点即为潜在异常值" |
| correlation_heatmap.png | 探索变量关联时 | "颜色越深,相关性越强" |
| categorical_distributions.png | 分类分析时 | "该图表展示了各类别的分布情况" |
| time_series.png | 时间序列数据分析时 | "该图表展示了数据随时间的变化趋势" |
| pairplot.png | 多变量探索时 | "每个单元格展示了两个变量之间的关系" |
| violin_plot.png | 组间比较时 | "该图表展示了不同组之间的分布差异" |
Common User Questions → Actions
常见用户问题 → 对应操作
| User Says | Action |
|---|---|
| "Analyze this CSV" | Run full analysis, show overview + key insights |
| "Is my data clean?" | Focus on quality_score, missing values, duplicates |
| "Find patterns" | Show correlation_heatmap, highlight strong correlations |
| "Are there outliers?" | Show box_plots, list outlier counts per column |
| "Compare X across Y" | Generate violin_plot for numeric X vs categorical Y |
| "Show me trends" | Generate time_series if datetime column exists |
| "Create a dashboard" | Generate all charts, present organized summary |
| "What should I clean?" | List columns with missing >5%, duplicates, outliers |
| 用户表述 | 操作 |
|---|---|
| "分析这个CSV文件" | 运行完整分析,展示概览+关键洞察 |
| "我的数据干净吗?" | 重点关注quality_score、缺失值和重复记录 |
| "寻找数据模式" | 展示correlation_heatmap,突出强相关性 |
| "是否存在异常值?" | 展示box_plots,列出每列的异常值数量 |
| "比较不同Y分组下的X" | 生成数值型X vs 分类型Y的violin_plot |
| "展示趋势" | 若存在日期时间列则生成time_series |
| "创建仪表盘" | 生成所有图表,呈现结构化的摘要 |
| "我应该清洗哪些内容?" | 列出缺失值>5%的列、重复记录和异常值情况 |
Output Locations
输出位置
Charts are saved to:
- Default:
~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/ - Custom: Use
--output-dir /path/to/project/.tmp/analysis
Always copy charts to user's project .tmp for visibility:
bash
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/图表将保存至:
- 默认路径:
~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/ - 自定义路径:使用 参数指定
--output-dir /path/to/project/.tmp/analysis
请务必将图表复制到用户项目的.tmp目录以确保可见性:
bash
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/Cost
成本
Free - runs entirely locally using pandas, matplotlib, seaborn, scipy.
完全免费 - 基于pandas、matplotlib、seaborn、scipy在本地运行。
Dependencies
依赖项
bash
pip install pandas matplotlib seaborn scipy numpybash
pip install pandas matplotlib seaborn scipy numpy