data-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Analysis Skill

数据分析技能

Overview

概述

This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.

本技能使用DuckDB —— 一款进程内分析型SQL引擎，分析用户上传的Excel/CSV文件。它支持模式检查、基于SQL的查询、统计摘要和结果导出，所有操作均可通过单个Python脚本完成。

Core Capabilities

核心能力

Inspect Excel/CSV file structure (sheets, columns, types, row counts)
Execute arbitrary SQL queries against uploaded data
Generate statistical summaries (mean, median, stddev, percentiles, nulls)
Support multi-sheet Excel workbooks (each sheet becomes a table)
Export query results to CSV, JSON, or Markdown
Handle large files efficiently with DuckDB's columnar engine

检查Excel/CSV文件结构（工作表、列、类型、行数）
对上传的数据执行任意SQL查询
生成统计摘要（均值、中位数、标准差、百分位数、空值）
支持多工作表Excel工作簿（每个工作表对应一个表）
将查询结果导出为CSV、JSON或Markdown格式
借助DuckDB的列式引擎高效处理大文件

Workflow

工作流程

Step 1: Understand Requirements

步骤1：理解需求

When a user uploads data files and requests analysis, identify:

File location: Path(s) to uploaded Excel/CSV files under
```
/mnt/user-data/uploads/
```
Analysis goal: What insights the user wants (summary, filtering, aggregation, comparison, etc.)
Output format: How results should be presented (table, CSV export, JSON, etc.)
You don't need to check the folder under
```
/mnt/user-data
```

当用户上传数据文件并请求分析时，确定以下信息：

文件位置：/mnt/user-data/uploads/下的Excel/CSV文件路径
分析目标：用户想要获取的洞察（摘要、过滤、聚合、比较等）
输出格式：结果的呈现方式（表格、CSV导出、JSON等）
无需检查/mnt/user-data下的文件夹

Step 2: Inspect File Structure

步骤2：检查文件结构

First, inspect the uploaded file to understand its schema:

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action inspect

This returns:

Sheet names (for Excel) or filename (for CSV)
Column names, data types, and non-null counts
Row count per sheet/file
Sample data (first 5 rows)

首先，检查上传文件以了解其模式：

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action inspect

该命令会返回：

工作表名称（针对Excel）或文件名（针对CSV）
列名、数据类型和非空计数
每个工作表/文件的行数
示例数据（前5行）

Step 3: Perform Analysis

步骤3：执行分析

Based on the schema, construct SQL queries to answer the user's questions.

根据模式，构造SQL查询以回答用户的问题。

Run SQL Query

运行SQL查询

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"

Generate Statistical Summary

生成统计摘要

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action summary \
  --table Sheet1

This returns for each numeric column: count, mean, std, min, 25%, 50%, 75%, max, null_count. For string columns: count, unique, top value, frequency, null_count.

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action summary \
  --table Sheet1

该命令会针对每个数值列返回：计数、均值、标准差、最小值、25%分位数、50%分位数、75%分位数、最大值、空值计数。针对字符串列返回：计数、唯一值数量、出现频率最高的值、频率、空值计数。

Export Results

导出结果

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
  --output-file /mnt/user-data/outputs/filtered-results.csv

Supported output formats (auto-detected from extension):

```
.csv
```
— Comma-separated values
```
.json
```
— JSON array of records
```
.md
```
— Markdown table

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/data.xlsx \
  --action query \
  --sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
  --output-file /mnt/user-data/outputs/filtered-results.csv

支持的输出格式（根据扩展名自动检测）：

```
.csv
```
—— 逗号分隔值
```
.json
```
—— 记录的JSON数组
```
.md
```
—— Markdown表格

Parameters

参数说明

Parameter	Required	Description
`--files`	Yes	Space-separated paths to Excel/CSV files
`--action`	Yes	One of: `inspect` , `query` , `summary`
`--sql`	For `query`	SQL query to execute
`--table`	For `summary`	Table/sheet name to summarize
`--output-file`	No	Path to export results (CSV/JSON/MD)

[!NOTE] Do NOT read the Python file, just call it with the parameters.

参数	是否必填	描述
`--files`	是	空格分隔的Excel/CSV文件路径
`--action`	是	可选值： `inspect` 、 `query` 、 `summary`
`--sql`	当 `action` 为 `query` 时必填	要执行的SQL查询语句
`--table`	当 `action` 为 `summary` 时必填	要汇总的表/工作表名称
`--output-file`	否	结果导出路径（CSV/JSON/MD格式）

[!NOTE] 请勿读取Python文件，只需传入参数调用它即可。

Table Naming Rules

表命名规则

Excel files: Each sheet becomes a table named after the sheet (e.g.,
```
Sheet1
```
,
```
Sales
```
,
```
Revenue
```
)
CSV files: Table name is the filename without extension (e.g.,
```
data.csv
```
→
```
data
```
)
Multiple files: All tables from all files are available in the same query context, enabling cross-file joins
Special characters: Sheet/file names with spaces or special characters are auto-sanitized (spaces → underscores). Use double quotes for names that start with numbers or contain special characters, e.g.,
```
"2024_Sales"
```

Excel文件：每个工作表成为一个以工作表名称命名的表（例如
```
Sheet1
```
、
```
Sales
```
、
```
Revenue
```
）
CSV文件：表名为不带扩展名的文件名（例如
```
data.csv
```
→
```
data
```
）
多个文件：所有文件中的所有表都在同一个查询上下文中可用，支持跨文件连接
特殊字符：包含空格或特殊字符的工作表/文件名会自动清理（空格替换为下划线）。对于以数字开头或包含特殊字符的名称，使用双引号，例如
```
"2024_Sales"
```

Analysis Patterns

分析模式示例

Basic Exploration

基础探索

sql

-- Row count
SELECT COUNT(*) FROM Sheet1

-- Distinct values in a column
SELECT DISTINCT category FROM Sheet1

-- Value distribution
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC

-- Date range
SELECT MIN(date_col), MAX(date_col) FROM Sheet1

sql

-- 统计行数
SELECT COUNT(*) FROM Sheet1

-- 列中的唯一值
SELECT DISTINCT category FROM Sheet1

-- 值分布
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC

-- 日期范围
SELECT MIN(date_col), MAX(date_col) FROM Sheet1

Aggregation & Grouping

聚合与分组

sql

-- Revenue by category and month
SELECT category, DATE_TRUNC('month', order_date) as month,
       SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC

-- Top 10 customers by spend
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10

sql

-- 按类别和月份统计收入
SELECT category, DATE_TRUNC('month', order_date) as month,
       SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC

-- 消费最高的前10位客户
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10

Cross-file Joins

跨文件连接

sql

-- Join sales with customer info from different files
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500

sql

-- 连接不同文件中的销售数据和客户信息
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500

Window Functions

窗口函数

sql

-- Running total and rank
SELECT order_date, amount,
       SUM(amount) OVER (ORDER BY order_date) as running_total,
       RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales

sql

-- 累计总和与排名
SELECT order_date, amount,
       SUM(amount) OVER (ORDER BY order_date) as running_total,
       RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales

Pivot-style Analysis

透视表风格分析

sql

-- Pivot: monthly revenue by category
SELECT category,
       SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
       SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
       SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category

sql

-- 透视表：按类别统计月度收入
SELECT category,
       SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
       SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
       SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category

Complete Example

完整示例

User uploads

sales_2024.xlsx

(with sheets:

Orders

Products

Customers

) and asks: "Analyze my sales data — show top products by revenue and monthly trends."

用户上传

sales_2024.xlsx

（包含工作表：

Orders

、

Products

、

Customers

），并提出需求：“分析我的销售数据——展示按收入排名的热门产品和月度趋势。”

Step 1: Inspect the file

步骤1：检查文件

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action inspect

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action inspect

Step 2: Top products by revenue

步骤2：按收入排名的热门产品

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"

Step 3: Monthly revenue trends

步骤3：月度收入趋势

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
  --output-file /mnt/user-data/outputs/monthly-trends.csv

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action query \
  --sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
  --output-file /mnt/user-data/outputs/monthly-trends.csv

Step 4: Statistical summary

步骤4：统计摘要

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action summary \
  --table Orders

Present results to the user with clear explanations of findings, trends, and actionable insights.

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/sales_2024.xlsx \
  --action summary \
  --table Orders

向用户呈现结果时，需用清晰的语言解释发现的结论、趋势和可执行的洞察。

Multi-file Example

多文件示例

User uploads

orders.csv

and

customers.xlsx

and asks: "Which region has the highest average order value?"

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
  --action query \
  --sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"

用户上传

orders.csv

和

customers.xlsx

，并提出需求：“哪个地区的平均订单金额最高？”

bash

python /mnt/skills/public/data-analysis/scripts/analyze.py \
  --files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
  --action query \
  --sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"

Output Handling

输出处理

After analysis:

Present query results directly in conversation as formatted tables
For large results, export to file and share via
```
present_files
```
tool
Always explain findings in plain language with key takeaways
Suggest follow-up analyses when patterns are interesting
Offer to export results if the user wants to keep them

分析完成后：

在对话中直接以格式化表格形式展示查询结果
对于大型结果，导出到文件并通过
```
present_files
```
工具分享
始终用通俗易懂的语言解释发现的结果、趋势和可操作的洞察
当发现有趣的模式时，建议后续分析方向
如果用户需要保存结果，提供导出服务

Caching

缓存机制

The script automatically caches loaded data to avoid re-parsing files on every call:

On first load, files are parsed and stored in a persistent DuckDB database under
```
/mnt/user-data/workspace/.data-analysis-cache/
```
The cache key is a SHA256 hash of all input file contents — if files change, a new cache is created
Subsequent calls with the same files will use the cached database directly (near-instant startup)
Cache is transparent — no extra parameters needed

This is especially useful when running multiple queries against the same data files (inspect → query → summary).

脚本会自动缓存已加载的数据，避免每次调用时重新解析文件：

首次加载时，文件会被解析并存储在
```
/mnt/user-data/workspace/.data-analysis-cache/
```
下的持久化DuckDB数据库中
缓存键是所有输入文件内容的SHA256哈希值——如果文件发生变化，会创建新的缓存
后续使用相同文件的调用会直接使用缓存数据库（启动速度极快）
缓存是透明的——无需额外参数

这在针对同一数据文件运行多个查询（检查→查询→摘要）时尤其有用。

Notes

注意事项

DuckDB supports full SQL including window functions, CTEs, subqueries, and advanced aggregations
Excel date columns are automatically parsed; use DuckDB date functions (
```
DATE_TRUNC
```
,
```
EXTRACT
```
, etc.)
For very large files (100MB+), DuckDB handles them efficiently without loading everything into memory
Column names with spaces are accessible using double quotes:
```
"Column Name"
```

DuckDB支持完整的SQL语法，包括窗口函数、CTE、子查询和高级聚合
Excel日期列会被自动解析；使用DuckDB日期函数（
```
DATE_TRUNC
```
、
```
EXTRACT
```
等）
对于超大文件（100MB+），DuckDB无需将所有内容加载到内存即可高效处理
包含空格的列名可通过双引号访问：
```
"Column Name"
```