data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis Skill
数据分析技能
Overview
概述
This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.
本技能使用DuckDB —— 一款进程内分析型SQL引擎,分析用户上传的Excel/CSV文件。它支持模式检查、基于SQL的查询、统计摘要和结果导出,所有操作均可通过单个Python脚本完成。
Core Capabilities
核心能力
- Inspect Excel/CSV file structure (sheets, columns, types, row counts)
- Execute arbitrary SQL queries against uploaded data
- Generate statistical summaries (mean, median, stddev, percentiles, nulls)
- Support multi-sheet Excel workbooks (each sheet becomes a table)
- Export query results to CSV, JSON, or Markdown
- Handle large files efficiently with DuckDB's columnar engine
- 检查Excel/CSV文件结构(工作表、列、类型、行数)
- 对上传的数据执行任意SQL查询
- 生成统计摘要(均值、中位数、标准差、百分位数、空值)
- 支持多工作表Excel工作簿(每个工作表对应一个表)
- 将查询结果导出为CSV、JSON或Markdown格式
- 借助DuckDB的列式引擎高效处理大文件
Workflow
工作流程
Step 1: Understand Requirements
步骤1:理解需求
When a user uploads data files and requests analysis, identify:
- File location: Path(s) to uploaded Excel/CSV files under
/mnt/user-data/uploads/ - Analysis goal: What insights the user wants (summary, filtering, aggregation, comparison, etc.)
- Output format: How results should be presented (table, CSV export, JSON, etc.)
- You don't need to check the folder under
/mnt/user-data
当用户上传数据文件并请求分析时,确定以下信息:
- 文件位置:/mnt/user-data/uploads/下的Excel/CSV文件路径
- 分析目标:用户想要获取的洞察(摘要、过滤、聚合、比较等)
- 输出格式:结果的呈现方式(表格、CSV导出、JSON等)
- 无需检查/mnt/user-data下的文件夹
Step 2: Inspect File Structure
步骤2:检查文件结构
First, inspect the uploaded file to understand its schema:
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action inspectThis returns:
- Sheet names (for Excel) or filename (for CSV)
- Column names, data types, and non-null counts
- Row count per sheet/file
- Sample data (first 5 rows)
首先,检查上传文件以了解其模式:
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action inspect该命令会返回:
- 工作表名称(针对Excel)或文件名(针对CSV)
- 列名、数据类型和非空计数
- 每个工作表/文件的行数
- 示例数据(前5行)
Step 3: Perform Analysis
步骤3:执行分析
Based on the schema, construct SQL queries to answer the user's questions.
根据模式,构造SQL查询以回答用户的问题。
Run SQL Query
运行SQL查询
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"Generate Statistical Summary
生成统计摘要
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action summary \
--table Sheet1This returns for each numeric column: count, mean, std, min, 25%, 50%, 75%, max, null_count.
For string columns: count, unique, top value, frequency, null_count.
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action summary \
--table Sheet1该命令会针对每个数值列返回:计数、均值、标准差、最小值、25%分位数、50%分位数、75%分位数、最大值、空值计数。
针对字符串列返回:计数、唯一值数量、出现频率最高的值、频率、空值计数。
Export Results
导出结果
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
--output-file /mnt/user-data/outputs/filtered-results.csvSupported output formats (auto-detected from extension):
- — Comma-separated values
.csv - — JSON array of records
.json - — Markdown table
.md
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
--output-file /mnt/user-data/outputs/filtered-results.csv支持的输出格式(根据扩展名自动检测):
- —— 逗号分隔值
.csv - —— 记录的JSON数组
.json - —— Markdown表格
.md
Parameters
参数说明
| Parameter | Required | Description |
|---|---|---|
| Yes | Space-separated paths to Excel/CSV files |
| Yes | One of: |
| For | SQL query to execute |
| For | Table/sheet name to summarize |
| No | Path to export results (CSV/JSON/MD) |
[!NOTE] Do NOT read the Python file, just call it with the parameters.
| 参数 | 是否必填 | 描述 |
|---|---|---|
| 是 | 空格分隔的Excel/CSV文件路径 |
| 是 | 可选值: |
| 当 | 要执行的SQL查询语句 |
| 当 | 要汇总的表/工作表名称 |
| 否 | 结果导出路径(CSV/JSON/MD格式) |
[!NOTE] 请勿读取Python文件,只需传入参数调用它即可。
Table Naming Rules
表命名规则
- Excel files: Each sheet becomes a table named after the sheet (e.g., ,
Sheet1,Sales)Revenue - CSV files: Table name is the filename without extension (e.g., →
data.csv)data - Multiple files: All tables from all files are available in the same query context, enabling cross-file joins
- Special characters: Sheet/file names with spaces or special characters are auto-sanitized (spaces → underscores). Use double quotes for names that start with numbers or contain special characters, e.g.,
"2024_Sales"
- Excel文件:每个工作表成为一个以工作表名称命名的表(例如、
Sheet1、Sales)Revenue - CSV文件:表名为不带扩展名的文件名(例如→
data.csv)data - 多个文件:所有文件中的所有表都在同一个查询上下文中可用,支持跨文件连接
- 特殊字符:包含空格或特殊字符的工作表/文件名会自动清理(空格替换为下划线)。对于以数字开头或包含特殊字符的名称,使用双引号,例如
"2024_Sales"
Analysis Patterns
分析模式示例
Basic Exploration
基础探索
sql
-- Row count
SELECT COUNT(*) FROM Sheet1
-- Distinct values in a column
SELECT DISTINCT category FROM Sheet1
-- Value distribution
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC
-- Date range
SELECT MIN(date_col), MAX(date_col) FROM Sheet1sql
-- 统计行数
SELECT COUNT(*) FROM Sheet1
-- 列中的唯一值
SELECT DISTINCT category FROM Sheet1
-- 值分布
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC
-- 日期范围
SELECT MIN(date_col), MAX(date_col) FROM Sheet1Aggregation & Grouping
聚合与分组
sql
-- Revenue by category and month
SELECT category, DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC
-- Top 10 customers by spend
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10sql
-- 按类别和月份统计收入
SELECT category, DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC
-- 消费最高的前10位客户
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10Cross-file Joins
跨文件连接
sql
-- Join sales with customer info from different files
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500sql
-- 连接不同文件中的销售数据和客户信息
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500Window Functions
窗口函数
sql
-- Running total and rank
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total,
RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Salessql
-- 累计总和与排名
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total,
RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM SalesPivot-style Analysis
透视表风格分析
sql
-- Pivot: monthly revenue by category
SELECT category,
SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY categorysql
-- 透视表:按类别统计月度收入
SELECT category,
SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY categoryComplete Example
完整示例
User uploads (with sheets: , , ) and asks: "Analyze my sales data — show top products by revenue and monthly trends."
sales_2024.xlsxOrdersProductsCustomers用户上传(包含工作表:、、),并提出需求:“分析我的销售数据——展示按收入排名的热门产品和月度趋势。”
sales_2024.xlsxOrdersProductsCustomersStep 1: Inspect the file
步骤1:检查文件
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action inspectbash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action inspectStep 2: Top products by revenue
步骤2:按收入排名的热门产品
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"Step 3: Monthly revenue trends
步骤3:月度收入趋势
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
--output-file /mnt/user-data/outputs/monthly-trends.csvbash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
--output-file /mnt/user-data/outputs/monthly-trends.csvStep 4: Statistical summary
步骤4:统计摘要
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action summary \
--table OrdersPresent results to the user with clear explanations of findings, trends, and actionable insights.
bash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action summary \
--table Orders向用户呈现结果时,需用清晰的语言解释发现的结论、趋势和可执行的洞察。
Multi-file Example
多文件示例
User uploads and and asks: "Which region has the highest average order value?"
orders.csvcustomers.xlsxbash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
--action query \
--sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"用户上传和,并提出需求:“哪个地区的平均订单金额最高?”
orders.csvcustomers.xlsxbash
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
--action query \
--sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"Output Handling
输出处理
After analysis:
- Present query results directly in conversation as formatted tables
- For large results, export to file and share via tool
present_files - Always explain findings in plain language with key takeaways
- Suggest follow-up analyses when patterns are interesting
- Offer to export results if the user wants to keep them
分析完成后:
- 在对话中直接以格式化表格形式展示查询结果
- 对于大型结果,导出到文件并通过工具分享
present_files - 始终用通俗易懂的语言解释发现的结果、趋势和可操作的洞察
- 当发现有趣的模式时,建议后续分析方向
- 如果用户需要保存结果,提供导出服务
Caching
缓存机制
The script automatically caches loaded data to avoid re-parsing files on every call:
- On first load, files are parsed and stored in a persistent DuckDB database under
/mnt/user-data/workspace/.data-analysis-cache/ - The cache key is a SHA256 hash of all input file contents — if files change, a new cache is created
- Subsequent calls with the same files will use the cached database directly (near-instant startup)
- Cache is transparent — no extra parameters needed
This is especially useful when running multiple queries against the same data files (inspect → query → summary).
脚本会自动缓存已加载的数据,避免每次调用时重新解析文件:
- 首次加载时,文件会被解析并存储在下的持久化DuckDB数据库中
/mnt/user-data/workspace/.data-analysis-cache/ - 缓存键是所有输入文件内容的SHA256哈希值——如果文件发生变化,会创建新的缓存
- 后续使用相同文件的调用会直接使用缓存数据库(启动速度极快)
- 缓存是透明的——无需额外参数
这在针对同一数据文件运行多个查询(检查→查询→摘要)时尤其有用。
Notes
注意事项
- DuckDB supports full SQL including window functions, CTEs, subqueries, and advanced aggregations
- Excel date columns are automatically parsed; use DuckDB date functions (,
DATE_TRUNC, etc.)EXTRACT - For very large files (100MB+), DuckDB handles them efficiently without loading everything into memory
- Column names with spaces are accessible using double quotes:
"Column Name"
- DuckDB支持完整的SQL语法,包括窗口函数、CTE、子查询和高级聚合
- Excel日期列会被自动解析;使用DuckDB日期函数(、
DATE_TRUNC等)EXTRACT - 对于超大文件(100MB+),DuckDB无需将所有内容加载到内存即可高效处理
- 包含空格的列名可通过双引号访问:
"Column Name"