explore-data

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

/explore-data - Profile and Explore a Dataset

/explore-data - 数据集分析与探查

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.

Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.

若遇到不熟悉的占位符或需要检查已连接的工具，请查看 CONNECTORS.md。

为表格或上传的文件生成全面的数据分析报告。在深入分析前，先了解数据集的结构、质量和模式。

Usage

使用方法

/explore-data <table_name or file>

/explore-data <表名或文件>

Workflow

工作流

1. Access the Data

1. 访问数据

If a data warehouse MCP server is connected:

Resolve the table name (handle schema prefixes, suggest matches if ambiguous)
Query table metadata: column names, types, descriptions if available
Run profiling queries against the live data

If a file is provided (CSV, Excel, Parquet, JSON):

Read the file and load into a working dataset
Infer column types from the data

If neither:

Ask the user to provide a table name (with their warehouse connected) or upload a file
If they describe a table schema, provide guidance on what profiling queries to run

若已连接数据仓库MCP服务器：

解析表名（处理 schema 前缀，若存在歧义则提供匹配建议）
查询表元数据：列名、类型、可用的描述信息
针对实时数据运行分析查询

若提供了文件（CSV、Excel、Parquet、JSON）：

读取文件并加载到工作数据集
根据数据推断列类型

若两者都没有：

请用户提供表名（需已连接仓库）或上传文件
若用户描述了表结构，提供应运行的分析查询指导

2. Understand Structure

2. 理解结构

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification — categorize each column as one of:

Identifier: Unique keys, foreign keys, entity IDs
Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
Metric: Quantitative values for measurement (revenue, count, duration, score)
Temporal: Dates and timestamps (created_at, updated_at, event_date)
Text: Free-form text fields (description, notes, name)
Boolean: True/false flags
Structural: JSON, arrays, nested structures

在分析任何数据前，先理解其结构：

表级问题：

有多少行和列？
数据粒度（每行代表什么？）
主键是什么？是否唯一？
数据最后更新时间是什么时候？
数据的时间跨度有多大？

列分类 — 将每列归类为以下类型之一：

标识符：唯一键、外键、实体ID
维度：用于分组/过滤的分类属性（状态、类型、地区、类别）
指标：用于度量的量化值（收入、计数、时长、分数）
时间型：日期和时间戳（created_at、updated_at、event_date）
文本型：自由格式文本字段（描述、备注、名称）
布尔型：真/假标记
结构化类型：JSON、数组、嵌套结构

3. Generate Data Profile

3. 生成数据分析报告

Run the following profiling checks:

Table-level metrics:

Total row count
Column count and types breakdown
Approximate table size (if available from metadata)
Date range coverage (min/max of date columns)

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Present the profile as a clean summary table, grouped by column type (dimensions, metrics, dates, IDs).

运行以下分析检查：

表级指标：

总行数
列数量及类型分布
表的大致大小（若元数据中可获取）
日期范围覆盖（日期列的最小/最大值）

所有列：

空值数量和空值率
唯一值数量和基数比（唯一值/总数）
最常见的值（前5-10个及出现频率）
最不常见的值（后5个，用于发现异常）

数值列（指标）：

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

字符串列（维度、文本）：

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

日期/时间戳列：

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

布尔列：

true count, false count, null count
true rate

将分析结果整理为清晰的汇总表格，按列类型分组（维度、指标、日期、ID）。

4. Identify Data Quality Issues

4. 识别数据质量问题

Apply the quality assessment framework below. Flag potential problems:

High null rates: Columns with >5% nulls (warn), >20% nulls (alert)
Low cardinality surprises: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)
High cardinality surprises: Columns that should be categorical but have too many distinct values
Suspicious values: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")
Duplicate detection: Check if there's a natural key and whether it has duplicates
Distribution skew: Extremely skewed numeric distributions that could affect averages
Encoding issues: Mixed case in categorical fields, trailing whitespace, inconsistent formats

应用以下质量评估框架，标记潜在问题：

高空值率：空值率>5%的列（警告），>20%的列（警报）
基数异常：本应高基数但实际不是的列（例如，“user_id”只有50个唯一值）
高基数异常：本应是分类列但唯一值过多的列
可疑值：本应只有正值却出现负数、历史数据中出现未来日期、明显的占位符值（如“N/A”、“TBD”、“test”、“999999”）
重复值检测：检查是否存在自然键及其是否有重复
分布偏斜：数值分布极度偏斜，可能影响平均值
编码问题：分类字段大小写混合、尾随空格、格式不一致

5. Discover Relationships and Patterns

5. 发现关系与模式

After profiling individual columns:

Foreign key candidates: ID columns that might link to other tables
Hierarchies: Columns that form natural drill-down paths (country > state > city)
Correlations: Numeric columns that move together
Derived columns: Columns that appear to be computed from others
Redundant columns: Columns with identical or near-identical information

完成单个列的分析后：

外键候选：可能关联到其他表的ID列
层级结构：构成自然下钻路径的列（国家 > 州 > 城市）
相关性：一起变化的数值列
衍生列：看起来由其他列计算得出的列
冗余列：包含相同或几乎相同信息的列

6. Suggest Interesting Dimensions and Metrics

6. 建议有价值的维度和指标

Based on the column profile, recommend:

Best dimension columns for slicing data (categorical columns with reasonable cardinality, 3-50 values)
Key metric columns for measurement (numeric columns with meaningful distributions)
Time columns suitable for trend analysis
Natural groupings or hierarchies apparent in the data
Potential join keys linking to other tables (ID columns, foreign keys)

基于列分析结果，推荐：

最佳维度列：用于切片数据的分类列（基数合理，3-50个值）
核心指标列：用于度量的数值列（分布有意义）
适合趋势分析的时间列
数据中明显的自然分组或层级
可关联到其他表的潜在连接键（ID列、外键）

7. Recommend Follow-Up Analyses

7. 推荐后续分析方向

Suggest 3-5 specific analyses the user could run next:

"Trend analysis on [metric] by [time_column] grouped by [dimension]"
"Distribution deep-dive on [skewed_column] to understand outliers"
"Data quality investigation on [problematic_column]"
"Correlation analysis between [metric_a] and [metric_b]"
"Cohort analysis using [date_column] and [status_column]"

建议用户接下来运行3-5个具体分析：

“按[时间列]分组，对[指标]进行趋势分析”
“对[偏斜列]进行分布深入分析，了解异常值”
“对[问题列]进行数据质量调查”
“[指标A]与[指标B]的相关性分析”
“使用[日期列]和[状态列]进行同期群分析”

Output Format

输出格式

undefined

undefined

Data Profile: [table_name]

数据分析报告: [表名]

Overview

概述

Rows: 2,340,891
Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
Date range: 2021-03-15 to 2024-01-22

行数: 2,340,891
列数: 23 (8个维度、6个指标、4个日期、5个ID)
日期范围: 2021-03-15 至 2024-01-22

Column Details

列详情

[summary table]

[汇总表格]

Data Quality Issues

数据质量问题

[flagged issues with severity]

[标记的问题及严重程度]

Recommended Explorations

Quality Assessment Framework

质量评估框架

Completeness Score

完整性得分

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

为每列评分：

完整（非空值>99%）：绿色
基本完整（95-99%）：黄色 -- 调查空值情况
不完整（80-95%）：橙色 -- 了解原因及影响
稀疏（<80%）：红色 -- 若无填充可能无法使用

Consistency Checks

一致性检查

Look for:

Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency: Numbers stored as strings, dates in various formats
Referential integrity: Foreign keys that don't match any parent record
Business rule violations: Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency: Status = "completed" but completed_at is null

检查以下内容：

值格式不一致：同一概念的表示不同（“USA”、“US”、“United States”、“us”）
类型不一致：数字存储为字符串、日期格式多样
引用完整性：外键与父表记录不匹配
业务规则违反：负数量、结束日期早于开始日期、百分比>100
跨列一致性：状态为“completed”但completed_at为空

Accuracy Indicators

准确性指标

Red flags that suggest accuracy issues:

Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values: Suspiciously high frequency of a single value
Stale data: Updated_at shows no recent changes in an active system
Impossible values: Ages > 150, dates in the far future, negative durations
Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

暗示准确性问题的红色标记：

占位符值：0、-1、999999、“N/A”、“TBD”、“test”、“xxx”
默认值：某个值出现频率异常高
陈旧数据：在活跃系统中，updated_at无最近更新
不可能的值：年龄>150、遥远的未来日期、负时长
整数值偏差：所有值以0或5结尾（表明是估算而非实测）

Timeliness Assessment

及时性评估

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

表最后更新时间是什么时候？
预期的更新频率是多少？
事件发生时间与加载时间是否存在延迟？
时间序列中是否存在间隙？

Pattern Discovery Techniques

模式发现技术

Distribution Analysis

分布分析

For numeric columns, characterize the distribution:

Normal: Mean and median are close, bell-shaped
Skewed right: Long tail of high values (common for revenue, session duration)
Skewed left: Long tail of low values (less common)
Bimodal: Two peaks (suggests two distinct populations)
Power law: Few very large values, many small ones (common for user activity)
Uniform: Roughly equal frequency across range (often synthetic or random)

对于数值列，描述其分布：

正态分布：均值和中位数接近，呈钟形
右偏分布：高值长尾（常见于收入、会话时长）
左偏分布：低值长尾（较少见）
双峰分布：两个峰值（表明存在两个不同群体）
幂律分布：少量极大值，大量极小值（常见于用户活动）
均匀分布：范围内频率大致相等（通常是合成或随机数据）

Temporal Patterns

时间模式

For time series data, look for:

Trend: Sustained upward or downward movement
Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects: Weekday vs. weekend differences
Holiday effects: Drops or spikes around known holidays
Change points: Sudden shifts in level or trend
Anomalies: Individual data points that break the pattern

对于时间序列数据，查看：

趋势：持续上升或下降
季节性：重复模式（周、月、季度、年度）
周效应：工作日与周末的差异
节假日效应：节假日前后的下降或峰值
变化点：水平或趋势的突然转变
异常值：打破模式的单个数据点

Segmentation Discovery

细分发现

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

通过以下方式识别自然细分：

找到具有3-20个唯一值的分类列
比较细分值间的指标分布
寻找行为显著不同的细分
测试细分是否同质或包含子细分

Correlation Exploration

Schema Understanding and Documentation

架构理解与文档

Schema Documentation Template

架构文档模板

When documenting a dataset for team use:

markdown

undefined

为团队使用而记录数据集时：