explore-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/explore-data - Profile and Explore a Dataset

/explore-data - 数据集分析与探查

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.
Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.
若遇到不熟悉的占位符或需要检查已连接的工具,请查看 CONNECTORS.md
为表格或上传的文件生成全面的数据分析报告。在深入分析前,先了解数据集的结构、质量和模式。

Usage

使用方法

/explore-data <table_name or file>
/explore-data <表名或文件>

Workflow

工作流

1. Access the Data

1. 访问数据

If a data warehouse MCP server is connected:
  1. Resolve the table name (handle schema prefixes, suggest matches if ambiguous)
  2. Query table metadata: column names, types, descriptions if available
  3. Run profiling queries against the live data
If a file is provided (CSV, Excel, Parquet, JSON):
  1. Read the file and load into a working dataset
  2. Infer column types from the data
If neither:
  1. Ask the user to provide a table name (with their warehouse connected) or upload a file
  2. If they describe a table schema, provide guidance on what profiling queries to run
若已连接数据仓库MCP服务器:
  1. 解析表名(处理 schema 前缀,若存在歧义则提供匹配建议)
  2. 查询表元数据:列名、类型、可用的描述信息
  3. 针对实时数据运行分析查询
若提供了文件(CSV、Excel、Parquet、JSON):
  1. 读取文件并加载到工作数据集
  2. 根据数据推断列类型
若两者都没有:
  1. 请用户提供表名(需已连接仓库)或上传文件
  2. 若用户描述了表结构,提供应运行的分析查询指导

2. Understand Structure

2. 理解结构

Before analyzing any data, understand its structure:
Table-level questions:
  • How many rows and columns?
  • What is the grain (one row per what)?
  • What is the primary key? Is it unique?
  • When was the data last updated?
  • How far back does the data go?
Column classification — categorize each column as one of:
  • Identifier: Unique keys, foreign keys, entity IDs
  • Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
  • Metric: Quantitative values for measurement (revenue, count, duration, score)
  • Temporal: Dates and timestamps (created_at, updated_at, event_date)
  • Text: Free-form text fields (description, notes, name)
  • Boolean: True/false flags
  • Structural: JSON, arrays, nested structures
在分析任何数据前,先理解其结构:
表级问题:
  • 有多少行和列?
  • 数据粒度(每行代表什么?)
  • 主键是什么?是否唯一?
  • 数据最后更新时间是什么时候?
  • 数据的时间跨度有多大?
列分类 — 将每列归类为以下类型之一:
  • 标识符:唯一键、外键、实体ID
  • 维度:用于分组/过滤的分类属性(状态、类型、地区、类别)
  • 指标:用于度量的量化值(收入、计数、时长、分数)
  • 时间型:日期和时间戳(created_at、updated_at、event_date)
  • 文本型:自由格式文本字段(描述、备注、名称)
  • 布尔型:真/假标记
  • 结构化类型:JSON、数组、嵌套结构

3. Generate Data Profile

3. 生成数据分析报告

Run the following profiling checks:
Table-level metrics:
  • Total row count
  • Column count and types breakdown
  • Approximate table size (if available from metadata)
  • Date range coverage (min/max of date columns)
All columns:
  • Null count and null rate
  • Distinct count and cardinality ratio (distinct / total)
  • Most common values (top 5-10 with frequencies)
  • Least common values (bottom 5 to spot anomalies)
Numeric columns (metrics):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
String columns (dimensions, text):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
Date/timestamp columns:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
Boolean columns:
true count, false count, null count
true rate
Present the profile as a clean summary table, grouped by column type (dimensions, metrics, dates, IDs).
运行以下分析检查:
表级指标:
  • 总行数
  • 列数量及类型分布
  • 表的大致大小(若元数据中可获取)
  • 日期范围覆盖(日期列的最小/最大值)
所有列:
  • 空值数量和空值率
  • 唯一值数量和基数比(唯一值/总数)
  • 最常见的值(前5-10个及出现频率)
  • 最不常见的值(后5个,用于发现异常)
数值列(指标):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
字符串列(维度、文本):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
日期/时间戳列:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
布尔列:
true count, false count, null count
true rate
将分析结果整理为清晰的汇总表格,按列类型分组(维度、指标、日期、ID)。

4. Identify Data Quality Issues

4. 识别数据质量问题

Apply the quality assessment framework below. Flag potential problems:
  • High null rates: Columns with >5% nulls (warn), >20% nulls (alert)
  • Low cardinality surprises: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)
  • High cardinality surprises: Columns that should be categorical but have too many distinct values
  • Suspicious values: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")
  • Duplicate detection: Check if there's a natural key and whether it has duplicates
  • Distribution skew: Extremely skewed numeric distributions that could affect averages
  • Encoding issues: Mixed case in categorical fields, trailing whitespace, inconsistent formats
应用以下质量评估框架,标记潜在问题:
  • 高空值率:空值率>5%的列(警告),>20%的列(警报)
  • 基数异常:本应高基数但实际不是的列(例如,“user_id”只有50个唯一值)
  • 高基数异常:本应是分类列但唯一值过多的列
  • 可疑值:本应只有正值却出现负数、历史数据中出现未来日期、明显的占位符值(如“N/A”、“TBD”、“test”、“999999”)
  • 重复值检测:检查是否存在自然键及其是否有重复
  • 分布偏斜:数值分布极度偏斜,可能影响平均值
  • 编码问题:分类字段大小写混合、尾随空格、格式不一致

5. Discover Relationships and Patterns

5. 发现关系与模式

After profiling individual columns:
  • Foreign key candidates: ID columns that might link to other tables
  • Hierarchies: Columns that form natural drill-down paths (country > state > city)
  • Correlations: Numeric columns that move together
  • Derived columns: Columns that appear to be computed from others
  • Redundant columns: Columns with identical or near-identical information
完成单个列的分析后:
  • 外键候选:可能关联到其他表的ID列
  • 层级结构:构成自然下钻路径的列(国家 > 州 > 城市)
  • 相关性:一起变化的数值列
  • 衍生列:看起来由其他列计算得出的列
  • 冗余列:包含相同或几乎相同信息的列

6. Suggest Interesting Dimensions and Metrics

6. 建议有价值的维度和指标

Based on the column profile, recommend:
  • Best dimension columns for slicing data (categorical columns with reasonable cardinality, 3-50 values)
  • Key metric columns for measurement (numeric columns with meaningful distributions)
  • Time columns suitable for trend analysis
  • Natural groupings or hierarchies apparent in the data
  • Potential join keys linking to other tables (ID columns, foreign keys)
基于列分析结果,推荐:
  • 最佳维度列:用于切片数据的分类列(基数合理,3-50个值)
  • 核心指标列:用于度量的数值列(分布有意义)
  • 适合趋势分析的时间列
  • 数据中明显的自然分组或层级
  • 可关联到其他表的潜在连接键(ID列、外键)

7. Recommend Follow-Up Analyses

7. 推荐后续分析方向

Suggest 3-5 specific analyses the user could run next:
  • "Trend analysis on [metric] by [time_column] grouped by [dimension]"
  • "Distribution deep-dive on [skewed_column] to understand outliers"
  • "Data quality investigation on [problematic_column]"
  • "Correlation analysis between [metric_a] and [metric_b]"
  • "Cohort analysis using [date_column] and [status_column]"
建议用户接下来运行3-5个具体分析:
  • “按[时间列]分组,对[指标]进行趋势分析”
  • “对[偏斜列]进行分布深入分析,了解异常值”
  • “对[问题列]进行数据质量调查”
  • “[指标A]与[指标B]的相关性分析”
  • “使用[日期列]和[状态列]进行同期群分析”

Output Format

输出格式

undefined
undefined

Data Profile: [table_name]

数据分析报告: [表名]

Overview

概述

  • Rows: 2,340,891
  • Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
  • Date range: 2021-03-15 to 2024-01-22
  • 行数: 2,340,891
  • 列数: 23 (8个维度、6个指标、4个日期、5个ID)
  • 日期范围: 2021-03-15 至 2024-01-22

Column Details

列详情

[summary table]
[汇总表格]

Data Quality Issues

数据质量问题

[flagged issues with severity]
[标记的问题及严重程度]

Recommended Explorations

推荐探索方向

[numbered list of suggested follow-up analyses]

---
[编号列出的后续分析建议]

---

Quality Assessment Framework

质量评估框架

Completeness Score

完整性得分

Rate each column:
  • Complete (>99% non-null): Green
  • Mostly complete (95-99%): Yellow -- investigate the nulls
  • Incomplete (80-95%): Orange -- understand why and whether it matters
  • Sparse (<80%): Red -- may not be usable without imputation
为每列评分:
  • 完整(非空值>99%):绿色
  • 基本完整(95-99%):黄色 -- 调查空值情况
  • 不完整(80-95%):橙色 -- 了解原因及影响
  • 稀疏(<80%):红色 -- 若无填充可能无法使用

Consistency Checks

一致性检查

Look for:
  • Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
  • Type inconsistency: Numbers stored as strings, dates in various formats
  • Referential integrity: Foreign keys that don't match any parent record
  • Business rule violations: Negative quantities, end dates before start dates, percentages > 100
  • Cross-column consistency: Status = "completed" but completed_at is null
检查以下内容:
  • 值格式不一致:同一概念的表示不同(“USA”、“US”、“United States”、“us”)
  • 类型不一致:数字存储为字符串、日期格式多样
  • 引用完整性:外键与父表记录不匹配
  • 业务规则违反:负数量、结束日期早于开始日期、百分比>100
  • 跨列一致性:状态为“completed”但completed_at为空

Accuracy Indicators

准确性指标

Red flags that suggest accuracy issues:
  • Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
  • Default values: Suspiciously high frequency of a single value
  • Stale data: Updated_at shows no recent changes in an active system
  • Impossible values: Ages > 150, dates in the far future, negative durations
  • Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)
暗示准确性问题的红色标记:
  • 占位符值:0、-1、999999、“N/A”、“TBD”、“test”、“xxx”
  • 默认值:某个值出现频率异常高
  • 陈旧数据:在活跃系统中,updated_at无最近更新
  • 不可能的值:年龄>150、遥远的未来日期、负时长
  • 整数值偏差:所有值以0或5结尾(表明是估算而非实测)

Timeliness Assessment

及时性评估

  • When was the table last updated?
  • What is the expected update frequency?
  • Is there a lag between event time and load time?
  • Are there gaps in the time series?
  • 表最后更新时间是什么时候?
  • 预期的更新频率是多少?
  • 事件发生时间与加载时间是否存在延迟?
  • 时间序列中是否存在间隙?

Pattern Discovery Techniques

模式发现技术

Distribution Analysis

分布分析

For numeric columns, characterize the distribution:
  • Normal: Mean and median are close, bell-shaped
  • Skewed right: Long tail of high values (common for revenue, session duration)
  • Skewed left: Long tail of low values (less common)
  • Bimodal: Two peaks (suggests two distinct populations)
  • Power law: Few very large values, many small ones (common for user activity)
  • Uniform: Roughly equal frequency across range (often synthetic or random)
对于数值列,描述其分布:
  • 正态分布:均值和中位数接近,呈钟形
  • 右偏分布:高值长尾(常见于收入、会话时长)
  • 左偏分布:低值长尾(较少见)
  • 双峰分布:两个峰值(表明存在两个不同群体)
  • 幂律分布:少量极大值,大量极小值(常见于用户活动)
  • 均匀分布:范围内频率大致相等(通常是合成或随机数据)

Temporal Patterns

时间模式

For time series data, look for:
  • Trend: Sustained upward or downward movement
  • Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
  • Day-of-week effects: Weekday vs. weekend differences
  • Holiday effects: Drops or spikes around known holidays
  • Change points: Sudden shifts in level or trend
  • Anomalies: Individual data points that break the pattern
对于时间序列数据,查看:
  • 趋势:持续上升或下降
  • 季节性:重复模式(周、月、季度、年度)
  • 周效应:工作日与周末的差异
  • 节假日效应:节假日前后的下降或峰值
  • 变化点:水平或趋势的突然转变
  • 异常值:打破模式的单个数据点

Segmentation Discovery

细分发现

Identify natural segments by:
  • Finding categorical columns with 3-20 distinct values
  • Comparing metric distributions across segment values
  • Looking for segments with significantly different behavior
  • Testing whether segments are homogeneous or contain sub-segments
通过以下方式识别自然细分:
  • 找到具有3-20个唯一值的分类列
  • 比较细分值间的指标分布
  • 寻找行为显著不同的细分
  • 测试细分是否同质或包含子细分

Correlation Exploration

相关性探索

Between numeric columns:
  • Compute correlation matrix for all metric pairs
  • Flag strong correlations (|r| > 0.7) for investigation
  • Note: Correlation does not imply causation -- flag this explicitly
  • Check for non-linear relationships (e.g., quadratic, logarithmic)
数值列之间:
  • 计算所有指标对的相关矩阵
  • 标记强相关性(|r|>0.7)以进行调查
  • 注意:相关性不意味着因果关系——需明确标记这一点
  • 检查非线性关系(如二次、对数关系)

Schema Understanding and Documentation

架构理解与文档

Schema Documentation Template

架构文档模板

When documenting a dataset for team use:
markdown
undefined
为团队使用而记录数据集时:
markdown
undefined

Table: [schema.table_name]

表: [schema.table_name]

Description: [What this table represents] Grain: [One row per...] Primary Key: [column(s)] Row Count: [approximate, with date] Update Frequency: [real-time / hourly / daily / weekly] Owner: [team or person responsible]
描述: [该表代表什么] 粒度: [每行代表...] 主键: [列名] 行数: [近似值,带日期] 更新频率: [实时 / 每小时 / 每日 / 每周] 负责人: [负责的团队或个人]

Key Columns

关键列

ColumnTypeDescriptionExample ValuesNotes
user_idSTRINGUnique user identifier"usr_abc123"FK to users.id
event_typeSTRINGType of event"click", "view", "purchase"15 distinct values
revenueDECIMALTransaction revenue in USD29.99, 149.00Null for non-purchase events
created_atTIMESTAMPWhen the event occurred2024-01-15 14:23:01Partitioned on this column
列名类型描述示例值备注
user_idSTRING用户唯一标识符"usr_abc123"外键关联 users.id
event_typeSTRING事件类型"click", "view", "purchase"15个唯一值
revenueDECIMAL交易收入(美元)29.99, 149.00非购买事件为空
created_atTIMESTAMP事件发生时间2024-01-15 14:23:01按该列分区

Relationships

关系

  • Joins to
    users
    on
    user_id
  • Joins to
    products
    on
    product_id
  • Parent of
    event_details
    (1:many on event_id)
  • users
    表通过
    user_id
    关联
  • products
    表通过
    product_id
    关联
  • event_details
    表的父表(通过event_id建立1:多关系)

Known Issues

已知问题

  • [List any known data quality issues]
  • [Note any gotchas for analysts]
  • [列出所有已知数据质量问题]
  • [记录分析师需要注意的事项]

Common Query Patterns

常见查询模式

  • [Typical use cases for this table]
undefined
  • [该表的典型使用场景]
undefined

Schema Exploration Queries

架构探索查询

When connected to a data warehouse, use these patterns to discover schema:
sql
-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name
连接到数据仓库时,使用以下模式探索架构:
sql
-- 列出schema中的所有表(PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- 列详情(PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- 表大小(PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- 所有表的行数(通用模式)
-- 逐表运行:SELECT COUNT(*) FROM table_name

Lineage and Dependencies

血缘与依赖

When exploring an unfamiliar data environment:
  1. Start with the "output" tables (what reports or dashboards consume)
  2. Trace upstream: What tables feed into them?
  3. Identify raw/staging/mart layers
  4. Map the transformation chain from raw data to analytical tables
  5. Note where data is enriched, filtered, or aggregated
探索不熟悉的数据环境时:
  1. 从“输出”表开始(报表或仪表板使用的表)
  2. 向上追溯:哪些表为其提供数据?
  3. 识别原始/ staging/ mart层
  4. 绘制从原始数据到分析表的转换链
  5. 记录数据在何处被丰富、过滤或聚合

Tips

提示

  • For very large tables (100M+ rows), profiling queries use sampling by default -- mention if you need exact counts
  • If exploring a new dataset for the first time, this command gives you the lay of the land before writing specific queries
  • The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look
  • 对于超大型表(1亿+行),分析查询默认使用抽样——若需要精确计数请说明
  • 若首次探索新数据集,该命令可在编写特定查询前帮你了解整体情况
  • 质量标记是启发式的——并非每个标记都是实际问题,但每个都值得快速查看