explore-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/explore-data - Profile and Explore a Dataset
/explore-data - 数据集分析与探查
If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.
Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.
若遇到不熟悉的占位符或需要检查已连接的工具,请查看 CONNECTORS.md。
为表格或上传的文件生成全面的数据分析报告。在深入分析前,先了解数据集的结构、质量和模式。
Usage
使用方法
/explore-data <table_name or file>/explore-data <表名或文件>Workflow
工作流
1. Access the Data
1. 访问数据
If a data warehouse MCP server is connected:
- Resolve the table name (handle schema prefixes, suggest matches if ambiguous)
- Query table metadata: column names, types, descriptions if available
- Run profiling queries against the live data
If a file is provided (CSV, Excel, Parquet, JSON):
- Read the file and load into a working dataset
- Infer column types from the data
If neither:
- Ask the user to provide a table name (with their warehouse connected) or upload a file
- If they describe a table schema, provide guidance on what profiling queries to run
若已连接数据仓库MCP服务器:
- 解析表名(处理 schema 前缀,若存在歧义则提供匹配建议)
- 查询表元数据:列名、类型、可用的描述信息
- 针对实时数据运行分析查询
若提供了文件(CSV、Excel、Parquet、JSON):
- 读取文件并加载到工作数据集
- 根据数据推断列类型
若两者都没有:
- 请用户提供表名(需已连接仓库)或上传文件
- 若用户描述了表结构,提供应运行的分析查询指导
2. Understand Structure
2. 理解结构
Before analyzing any data, understand its structure:
Table-level questions:
- How many rows and columns?
- What is the grain (one row per what)?
- What is the primary key? Is it unique?
- When was the data last updated?
- How far back does the data go?
Column classification — categorize each column as one of:
- Identifier: Unique keys, foreign keys, entity IDs
- Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
- Metric: Quantitative values for measurement (revenue, count, duration, score)
- Temporal: Dates and timestamps (created_at, updated_at, event_date)
- Text: Free-form text fields (description, notes, name)
- Boolean: True/false flags
- Structural: JSON, arrays, nested structures
在分析任何数据前,先理解其结构:
表级问题:
- 有多少行和列?
- 数据粒度(每行代表什么?)
- 主键是什么?是否唯一?
- 数据最后更新时间是什么时候?
- 数据的时间跨度有多大?
列分类 — 将每列归类为以下类型之一:
- 标识符:唯一键、外键、实体ID
- 维度:用于分组/过滤的分类属性(状态、类型、地区、类别)
- 指标:用于度量的量化值(收入、计数、时长、分数)
- 时间型:日期和时间戳(created_at、updated_at、event_date)
- 文本型:自由格式文本字段(描述、备注、名称)
- 布尔型:真/假标记
- 结构化类型:JSON、数组、嵌套结构
3. Generate Data Profile
3. 生成数据分析报告
Run the following profiling checks:
Table-level metrics:
- Total row count
- Column count and types breakdown
- Approximate table size (if available from metadata)
- Date range coverage (min/max of date columns)
All columns:
- Null count and null rate
- Distinct count and cardinality ratio (distinct / total)
- Most common values (top 5-10 with frequencies)
- Least common values (bottom 5 to spot anomalies)
Numeric columns (metrics):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)String columns (dimensions, text):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace countDate/timestamp columns:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time seriesBoolean columns:
true count, false count, null count
true ratePresent the profile as a clean summary table, grouped by column type (dimensions, metrics, dates, IDs).
运行以下分析检查:
表级指标:
- 总行数
- 列数量及类型分布
- 表的大致大小(若元数据中可获取)
- 日期范围覆盖(日期列的最小/最大值)
所有列:
- 空值数量和空值率
- 唯一值数量和基数比(唯一值/总数)
- 最常见的值(前5-10个及出现频率)
- 最不常见的值(后5个,用于发现异常)
数值列(指标):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)字符串列(维度、文本):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count日期/时间戳列:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series布尔列:
true count, false count, null count
true rate将分析结果整理为清晰的汇总表格,按列类型分组(维度、指标、日期、ID)。
4. Identify Data Quality Issues
4. 识别数据质量问题
Apply the quality assessment framework below. Flag potential problems:
- High null rates: Columns with >5% nulls (warn), >20% nulls (alert)
- Low cardinality surprises: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)
- High cardinality surprises: Columns that should be categorical but have too many distinct values
- Suspicious values: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")
- Duplicate detection: Check if there's a natural key and whether it has duplicates
- Distribution skew: Extremely skewed numeric distributions that could affect averages
- Encoding issues: Mixed case in categorical fields, trailing whitespace, inconsistent formats
应用以下质量评估框架,标记潜在问题:
- 高空值率:空值率>5%的列(警告),>20%的列(警报)
- 基数异常:本应高基数但实际不是的列(例如,“user_id”只有50个唯一值)
- 高基数异常:本应是分类列但唯一值过多的列
- 可疑值:本应只有正值却出现负数、历史数据中出现未来日期、明显的占位符值(如“N/A”、“TBD”、“test”、“999999”)
- 重复值检测:检查是否存在自然键及其是否有重复
- 分布偏斜:数值分布极度偏斜,可能影响平均值
- 编码问题:分类字段大小写混合、尾随空格、格式不一致
5. Discover Relationships and Patterns
5. 发现关系与模式
After profiling individual columns:
- Foreign key candidates: ID columns that might link to other tables
- Hierarchies: Columns that form natural drill-down paths (country > state > city)
- Correlations: Numeric columns that move together
- Derived columns: Columns that appear to be computed from others
- Redundant columns: Columns with identical or near-identical information
完成单个列的分析后:
- 外键候选:可能关联到其他表的ID列
- 层级结构:构成自然下钻路径的列(国家 > 州 > 城市)
- 相关性:一起变化的数值列
- 衍生列:看起来由其他列计算得出的列
- 冗余列:包含相同或几乎相同信息的列
6. Suggest Interesting Dimensions and Metrics
6. 建议有价值的维度和指标
Based on the column profile, recommend:
- Best dimension columns for slicing data (categorical columns with reasonable cardinality, 3-50 values)
- Key metric columns for measurement (numeric columns with meaningful distributions)
- Time columns suitable for trend analysis
- Natural groupings or hierarchies apparent in the data
- Potential join keys linking to other tables (ID columns, foreign keys)
基于列分析结果,推荐:
- 最佳维度列:用于切片数据的分类列(基数合理,3-50个值)
- 核心指标列:用于度量的数值列(分布有意义)
- 适合趋势分析的时间列
- 数据中明显的自然分组或层级
- 可关联到其他表的潜在连接键(ID列、外键)
7. Recommend Follow-Up Analyses
7. 推荐后续分析方向
Suggest 3-5 specific analyses the user could run next:
- "Trend analysis on [metric] by [time_column] grouped by [dimension]"
- "Distribution deep-dive on [skewed_column] to understand outliers"
- "Data quality investigation on [problematic_column]"
- "Correlation analysis between [metric_a] and [metric_b]"
- "Cohort analysis using [date_column] and [status_column]"
建议用户接下来运行3-5个具体分析:
- “按[时间列]分组,对[指标]进行趋势分析”
- “对[偏斜列]进行分布深入分析,了解异常值”
- “对[问题列]进行数据质量调查”
- “[指标A]与[指标B]的相关性分析”
- “使用[日期列]和[状态列]进行同期群分析”
Output Format
输出格式
undefinedundefinedData Profile: [table_name]
数据分析报告: [表名]
Overview
概述
- Rows: 2,340,891
- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
- Date range: 2021-03-15 to 2024-01-22
- 行数: 2,340,891
- 列数: 23 (8个维度、6个指标、4个日期、5个ID)
- 日期范围: 2021-03-15 至 2024-01-22
Column Details
列详情
[summary table]
[汇总表格]
Data Quality Issues
数据质量问题
[flagged issues with severity]
[标记的问题及严重程度]
Recommended Explorations
推荐探索方向
[numbered list of suggested follow-up analyses]
---[编号列出的后续分析建议]
---Quality Assessment Framework
质量评估框架
Completeness Score
完整性得分
Rate each column:
- Complete (>99% non-null): Green
- Mostly complete (95-99%): Yellow -- investigate the nulls
- Incomplete (80-95%): Orange -- understand why and whether it matters
- Sparse (<80%): Red -- may not be usable without imputation
为每列评分:
- 完整(非空值>99%):绿色
- 基本完整(95-99%):黄色 -- 调查空值情况
- 不完整(80-95%):橙色 -- 了解原因及影响
- 稀疏(<80%):红色 -- 若无填充可能无法使用
Consistency Checks
一致性检查
Look for:
- Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
- Type inconsistency: Numbers stored as strings, dates in various formats
- Referential integrity: Foreign keys that don't match any parent record
- Business rule violations: Negative quantities, end dates before start dates, percentages > 100
- Cross-column consistency: Status = "completed" but completed_at is null
检查以下内容:
- 值格式不一致:同一概念的表示不同(“USA”、“US”、“United States”、“us”)
- 类型不一致:数字存储为字符串、日期格式多样
- 引用完整性:外键与父表记录不匹配
- 业务规则违反:负数量、结束日期早于开始日期、百分比>100
- 跨列一致性:状态为“completed”但completed_at为空
Accuracy Indicators
准确性指标
Red flags that suggest accuracy issues:
- Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
- Default values: Suspiciously high frequency of a single value
- Stale data: Updated_at shows no recent changes in an active system
- Impossible values: Ages > 150, dates in the far future, negative durations
- Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)
暗示准确性问题的红色标记:
- 占位符值:0、-1、999999、“N/A”、“TBD”、“test”、“xxx”
- 默认值:某个值出现频率异常高
- 陈旧数据:在活跃系统中,updated_at无最近更新
- 不可能的值:年龄>150、遥远的未来日期、负时长
- 整数值偏差:所有值以0或5结尾(表明是估算而非实测)
Timeliness Assessment
及时性评估
- When was the table last updated?
- What is the expected update frequency?
- Is there a lag between event time and load time?
- Are there gaps in the time series?
- 表最后更新时间是什么时候?
- 预期的更新频率是多少?
- 事件发生时间与加载时间是否存在延迟?
- 时间序列中是否存在间隙?
Pattern Discovery Techniques
模式发现技术
Distribution Analysis
分布分析
For numeric columns, characterize the distribution:
- Normal: Mean and median are close, bell-shaped
- Skewed right: Long tail of high values (common for revenue, session duration)
- Skewed left: Long tail of low values (less common)
- Bimodal: Two peaks (suggests two distinct populations)
- Power law: Few very large values, many small ones (common for user activity)
- Uniform: Roughly equal frequency across range (often synthetic or random)
对于数值列,描述其分布:
- 正态分布:均值和中位数接近,呈钟形
- 右偏分布:高值长尾(常见于收入、会话时长)
- 左偏分布:低值长尾(较少见)
- 双峰分布:两个峰值(表明存在两个不同群体)
- 幂律分布:少量极大值,大量极小值(常见于用户活动)
- 均匀分布:范围内频率大致相等(通常是合成或随机数据)
Temporal Patterns
时间模式
For time series data, look for:
- Trend: Sustained upward or downward movement
- Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
- Day-of-week effects: Weekday vs. weekend differences
- Holiday effects: Drops or spikes around known holidays
- Change points: Sudden shifts in level or trend
- Anomalies: Individual data points that break the pattern
对于时间序列数据,查看:
- 趋势:持续上升或下降
- 季节性:重复模式(周、月、季度、年度)
- 周效应:工作日与周末的差异
- 节假日效应:节假日前后的下降或峰值
- 变化点:水平或趋势的突然转变
- 异常值:打破模式的单个数据点
Segmentation Discovery
细分发现
Identify natural segments by:
- Finding categorical columns with 3-20 distinct values
- Comparing metric distributions across segment values
- Looking for segments with significantly different behavior
- Testing whether segments are homogeneous or contain sub-segments
通过以下方式识别自然细分:
- 找到具有3-20个唯一值的分类列
- 比较细分值间的指标分布
- 寻找行为显著不同的细分
- 测试细分是否同质或包含子细分
Correlation Exploration
相关性探索
Between numeric columns:
- Compute correlation matrix for all metric pairs
- Flag strong correlations (|r| > 0.7) for investigation
- Note: Correlation does not imply causation -- flag this explicitly
- Check for non-linear relationships (e.g., quadratic, logarithmic)
数值列之间:
- 计算所有指标对的相关矩阵
- 标记强相关性(|r|>0.7)以进行调查
- 注意:相关性不意味着因果关系——需明确标记这一点
- 检查非线性关系(如二次、对数关系)
Schema Understanding and Documentation
架构理解与文档
Schema Documentation Template
架构文档模板
When documenting a dataset for team use:
markdown
undefined为团队使用而记录数据集时:
markdown
undefinedTable: [schema.table_name]
表: [schema.table_name]
Description: [What this table represents]
Grain: [One row per...]
Primary Key: [column(s)]
Row Count: [approximate, with date]
Update Frequency: [real-time / hourly / daily / weekly]
Owner: [team or person responsible]
描述: [该表代表什么]
粒度: [每行代表...]
主键: [列名]
行数: [近似值,带日期]
更新频率: [实时 / 每小时 / 每日 / 每周]
负责人: [负责的团队或个人]
Key Columns
关键列
| Column | Type | Description | Example Values | Notes |
|---|---|---|---|---|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |
| 列名 | 类型 | 描述 | 示例值 | 备注 |
|---|---|---|---|---|
| user_id | STRING | 用户唯一标识符 | "usr_abc123" | 外键关联 users.id |
| event_type | STRING | 事件类型 | "click", "view", "purchase" | 15个唯一值 |
| revenue | DECIMAL | 交易收入(美元) | 29.99, 149.00 | 非购买事件为空 |
| created_at | TIMESTAMP | 事件发生时间 | 2024-01-15 14:23:01 | 按该列分区 |
Relationships
关系
- Joins to on
usersuser_id - Joins to on
productsproduct_id - Parent of (1:many on event_id)
event_details
- 与表通过
users关联user_id - 与表通过
products关联product_id - 是表的父表(通过event_id建立1:多关系)
event_details
Known Issues
已知问题
- [List any known data quality issues]
- [Note any gotchas for analysts]
- [列出所有已知数据质量问题]
- [记录分析师需要注意的事项]
Common Query Patterns
常见查询模式
- [Typical use cases for this table]
undefined- [该表的典型使用场景]
undefinedSchema Exploration Queries
架构探索查询
When connected to a data warehouse, use these patterns to discover schema:
sql
-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name连接到数据仓库时,使用以下模式探索架构:
sql
-- 列出schema中的所有表(PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- 列详情(PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- 表大小(PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- 所有表的行数(通用模式)
-- 逐表运行:SELECT COUNT(*) FROM table_nameLineage and Dependencies
血缘与依赖
When exploring an unfamiliar data environment:
- Start with the "output" tables (what reports or dashboards consume)
- Trace upstream: What tables feed into them?
- Identify raw/staging/mart layers
- Map the transformation chain from raw data to analytical tables
- Note where data is enriched, filtered, or aggregated
探索不熟悉的数据环境时:
- 从“输出”表开始(报表或仪表板使用的表)
- 向上追溯:哪些表为其提供数据?
- 识别原始/ staging/ mart层
- 绘制从原始数据到分析表的转换链
- 记录数据在何处被丰富、过滤或聚合
Tips
提示
- For very large tables (100M+ rows), profiling queries use sampling by default -- mention if you need exact counts
- If exploring a new dataset for the first time, this command gives you the lay of the land before writing specific queries
- The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look
- 对于超大型表(1亿+行),分析查询默认使用抽样——若需要精确计数请说明
- 若首次探索新数据集,该命令可在编写特定查询前帮你了解整体情况
- 质量标记是启发式的——并非每个标记都是实际问题,但每个都值得快速查看