data-exploration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Exploration Skill

数据探查技能

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

一套用于数据集探查、数据质量评估、模式发现和Schema理解的系统化方法。

Data Profiling Methodology

数据集探查方法

Phase 1: Structural Understanding

第一阶段：结构理解

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification: Categorize each column as one of:

Identifier: Unique keys, foreign keys, entity IDs
Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
Metric: Quantitative values for measurement (revenue, count, duration, score)
Temporal: Dates and timestamps (created_at, updated_at, event_date)
Text: Free-form text fields (description, notes, name)
Boolean: True/false flags
Structural: JSON, arrays, nested structures

在分析任何数据之前，先理解其结构：

表级问题：

有多少行和列？
粒度是什么（每行代表什么）？
主键是什么？是否唯一？
数据最后更新时间是什么时候？
数据的时间跨度有多大？

列分类： 将每列归类为以下类型之一：

标识符：唯一键、外键、实体ID
维度：用于分组/过滤的分类属性（状态、类型、地区、类别）
指标：用于度量的量化值（收入、数量、时长、分数）
时间型：日期和时间戳（created_at、updated_at、event_date）
文本型：自由格式文本字段（描述、备注、名称）
布尔型：真/假标记
结构化类型：JSON、数组、嵌套结构

Phase 2: Column-Level Profiling

第二阶段：列级探查

For each column, compute:

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

针对每列计算以下内容：

所有列：

空值数量和空值率
去重值数量和基数比（去重值/总行数）
最常见值（前5-10个及出现频率）
最不常见值（后5个，用于发现异常）

数值列（指标）：

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

字符串列（维度、文本）：

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

日期/时间戳列：

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

布尔列：

true count, false count, null count
true rate

Phase 3: Relationship Discovery

第三阶段：关系发现

After profiling individual columns:

Foreign key candidates: ID columns that might link to other tables
Hierarchies: Columns that form natural drill-down paths (country > state > city)
Correlations: Numeric columns that move together
Derived columns: Columns that appear to be computed from others
Redundant columns: Columns with identical or near-identical information

完成单个列的探查后：

外键候选列：可能关联到其他表的ID列
层级关系：形成自然下钻路径的列（国家 > 州 > 城市）
相关性：数值列之间的关联趋势
衍生列：看起来由其他列计算得到的列
冗余列：包含相同或近乎相同信息的列

Quality Assessment Framework

质量评估框架

Completeness Score

完整性评分

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

为每列评级：

完整（非空值占比>99%）：绿色
基本完整（95-99%）：黄色 -- 需调查空值情况
不完整（80-95%）：橙色 -- 需了解原因及影响
稀疏（<80%）：红色 -- 若无填充可能无法使用

Consistency Checks

一致性检查

Look for:

Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency: Numbers stored as strings, dates in various formats
Referential integrity: Foreign keys that don't match any parent record
Business rule violations: Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency: Status = "completed" but completed_at is null

检查以下内容：

值格式不一致：同一概念的表示方式不同（"USA"、"US"、"United States"、"us"）
类型不一致：数字以字符串存储、日期格式多样
引用完整性：外键与父表记录不匹配
业务规则违反：负数量、结束日期早于开始日期、百分比>100
跨列一致性：状态为“已完成”但completed_at为空

Accuracy Indicators

准确性指标

Red flags that suggest accuracy issues:

Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values: Suspiciously high frequency of a single value
Stale data: Updated_at shows no recent changes in an active system
Impossible values: Ages > 150, dates in the far future, negative durations
Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

暗示准确性问题的危险信号：

占位符值：0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
默认值：单个值出现频率异常高
陈旧数据：活跃系统中updated_at无最近更新记录
不可能值：年龄>150、未来日期、负时长
整数值偏差：所有值均以0或5结尾（暗示估算而非实际度量）

Timeliness Assessment

及时性评估

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

表最后更新时间是什么时候？
预期更新频率是多少？
事件时间与加载时间之间是否存在延迟？
时间序列中是否存在间隙？

Pattern Discovery Techniques

模式发现技巧

Distribution Analysis

分布分析

For numeric columns, characterize the distribution:

Normal: Mean and median are close, bell-shaped
Skewed right: Long tail of high values (common for revenue, session duration)
Skewed left: Long tail of low values (less common)
Bimodal: Two peaks (suggests two distinct populations)
Power law: Few very large values, many small ones (common for user activity)
Uniform: Roughly equal frequency across range (often synthetic or random)

针对数值列，描述其分布类型：

正态分布：均值与中位数接近，呈钟形曲线
右偏分布：高值侧长尾（常见于收入、会话时长）
左偏分布：低值侧长尾（较少见）
双峰分布：存在两个峰值（暗示两个不同群体）
幂律分布：少量极大值，大量极小值（常见于用户行为数据）
均匀分布：范围内频率大致相等（通常为合成或随机数据）

Temporal Patterns

时间模式

For time series data, look for:

Trend: Sustained upward or downward movement
Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects: Weekday vs. weekend differences
Holiday effects: Drops or spikes around known holidays
Change points: Sudden shifts in level or trend
Anomalies: Individual data points that break the pattern

针对时间序列数据，查找以下模式：

趋势：持续上升或下降的走势
季节性：重复出现的模式（周、月、季度、年度）
周效应：工作日与周末的差异
节假日效应：节假日前后的数据下降或激增
突变点：水平或趋势的突然变化
异常值：打破模式的单个数据点

Segmentation Discovery

细分发现

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

通过以下方式识别自然细分群体：

找到具有3-20个去重值的分类列
比较不同细分群体的指标分布
寻找行为差异显著的细分群体
测试细分群体是否同质或包含子细分群体

Correlation Exploration

Schema Understanding and Documentation

Schema理解与文档化

Schema Documentation Template

Schema文档模板

When documenting a dataset for team use:

markdown

undefined

为团队使用而记录数据集时，可使用以下模板：

markdown

undefined

Table: [schema.table_name]

表：[schema.table_name]

Description: [What this table represents] Grain: [One row per...] Primary Key: [column(s)] Row Count: [approximate, with date] Update Frequency: [real-time / hourly / daily / weekly] Owner: [team or person responsible]

描述: [该表代表的内容] 粒度: [每行代表...] 主键: [列名] 行数: [近似值，含日期] 更新频率: [实时/每小时/每日/每周] 负责人: [负责的团队或个人]

Key Columns

关键列

Column	Type	Description	Example Values	Notes
user_id	STRING	Unique user identifier	"usr_abc123"	FK to users.id
event_type	STRING	Type of event	"click", "view", "purchase"	15 distinct values
revenue	DECIMAL	Transaction revenue in USD	29.99, 149.00	Null for non-purchase events
created_at	TIMESTAMP	When the event occurred	2024-01-15 14:23:01	Partitioned on this column

列名	类型	描述	示例值	备注
user_id	STRING	唯一用户标识符	"usr_abc123"	外键关联users.id
event_type	STRING	事件类型	"click", "view", "purchase"	共15个去重值
revenue	DECIMAL	交易收入（美元）	29.99, 149.00	非购买事件为空
created_at	TIMESTAMP	事件发生时间	2024-01-15 14:23:01	按该列分区

Relationships

关系

Joins to
```
users
```
on
```
user_id
```
Joins to
```
products
```
on
```
product_id
```
Parent of
```
event_details
```
(1:many on event_id)

通过
```
user_id
```
关联到
```
users
```
表
通过
```
product_id
```
关联到
```
products
```
表
是
```
event_details
```
表的父表（通过event_id一对多关联）

Known Issues

已知问题

[List any known data quality issues]
[Note any gotchas for analysts]

[列出所有已知数据质量问题]
[记录分析师需要注意的陷阱]

Common Query Patterns

常见查询模式

[Typical use cases for this table]

undefined

[该表的典型使用场景]

undefined

Schema Exploration Queries

Schema探查查询

When connected to a data warehouse, use these patterns to discover schema:

sql

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

连接到数据仓库时，可使用以下查询模式探索Schema：

sql

-- 列出schema中的所有表（PostgreSQL）
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- 列详情（PostgreSQL）
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- 表大小（PostgreSQL）
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- 所有表的行数（通用写法）
-- 逐表执行：SELECT COUNT(*) FROM table_name

Lineage and Dependencies

数据血缘与依赖关系

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

在不熟悉的数据环境中探索时：

从“输出表”开始（即报表或仪表盘所使用的表）
向上追溯：哪些表为其提供数据？
识别原始层/Staging层/数据集市层
绘制从原始数据到分析表的转换链路
记录数据在何处被丰富、过滤或聚合

data-exploration

Original

Translation

Data Exploration Skill

数据探查技能

Data Profiling Methodology

数据集探查方法

Phase 1: Structural Understanding

第一阶段：结构理解

Phase 2: Column-Level Profiling

第二阶段：列级探查

Phase 3: Relationship Discovery

第三阶段：关系发现

Quality Assessment Framework

质量评估框架

Completeness Score

完整性评分

Consistency Checks

一致性检查

Accuracy Indicators

准确性指标

Timeliness Assessment

及时性评估

Pattern Discovery Techniques

模式发现技巧

Distribution Analysis

分布分析

Temporal Patterns

时间模式

Segmentation Discovery

细分发现

Correlation Exploration

相关性探索

Schema Understanding and Documentation

Schema理解与文档化

Schema Documentation Template

Schema文档模板

Table: [schema.table_name]

表：[schema.table_name]

Key Columns

关键列

Relationships

关系

Known Issues

已知问题

Common Query Patterns

常见查询模式

Schema Exploration Queries

Schema探查查询

Lineage and Dependencies

数据血缘与依赖关系