data-exploration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Exploration Skill

数据探查技能

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.
一套用于数据集探查、数据质量评估、模式发现和Schema理解的系统化方法。

Data Profiling Methodology

数据集探查方法

Phase 1: Structural Understanding

第一阶段:结构理解

Before analyzing any data, understand its structure:
Table-level questions:
  • How many rows and columns?
  • What is the grain (one row per what)?
  • What is the primary key? Is it unique?
  • When was the data last updated?
  • How far back does the data go?
Column classification: Categorize each column as one of:
  • Identifier: Unique keys, foreign keys, entity IDs
  • Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
  • Metric: Quantitative values for measurement (revenue, count, duration, score)
  • Temporal: Dates and timestamps (created_at, updated_at, event_date)
  • Text: Free-form text fields (description, notes, name)
  • Boolean: True/false flags
  • Structural: JSON, arrays, nested structures
在分析任何数据之前,先理解其结构:
表级问题:
  • 有多少行和列?
  • 粒度是什么(每行代表什么)?
  • 主键是什么?是否唯一?
  • 数据最后更新时间是什么时候?
  • 数据的时间跨度有多大?
列分类: 将每列归类为以下类型之一:
  • 标识符:唯一键、外键、实体ID
  • 维度:用于分组/过滤的分类属性(状态、类型、地区、类别)
  • 指标:用于度量的量化值(收入、数量、时长、分数)
  • 时间型:日期和时间戳(created_at、updated_at、event_date)
  • 文本型:自由格式文本字段(描述、备注、名称)
  • 布尔型:真/假标记
  • 结构化类型:JSON、数组、嵌套结构

Phase 2: Column-Level Profiling

第二阶段:列级探查

For each column, compute:
All columns:
  • Null count and null rate
  • Distinct count and cardinality ratio (distinct / total)
  • Most common values (top 5-10 with frequencies)
  • Least common values (bottom 5 to spot anomalies)
Numeric columns (metrics):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
String columns (dimensions, text):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
Date/timestamp columns:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
Boolean columns:
true count, false count, null count
true rate
针对每列计算以下内容:
所有列:
  • 空值数量和空值率
  • 去重值数量和基数比(去重值/总行数)
  • 最常见值(前5-10个及出现频率)
  • 最不常见值(后5个,用于发现异常)
数值列(指标):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)
字符串列(维度、文本):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count
日期/时间戳列:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series
布尔列:
true count, false count, null count
true rate

Phase 3: Relationship Discovery

第三阶段:关系发现

After profiling individual columns:
  • Foreign key candidates: ID columns that might link to other tables
  • Hierarchies: Columns that form natural drill-down paths (country > state > city)
  • Correlations: Numeric columns that move together
  • Derived columns: Columns that appear to be computed from others
  • Redundant columns: Columns with identical or near-identical information
完成单个列的探查后:
  • 外键候选列:可能关联到其他表的ID列
  • 层级关系:形成自然下钻路径的列(国家 > 州 > 城市)
  • 相关性:数值列之间的关联趋势
  • 衍生列:看起来由其他列计算得到的列
  • 冗余列:包含相同或近乎相同信息的列

Quality Assessment Framework

质量评估框架

Completeness Score

完整性评分

Rate each column:
  • Complete (>99% non-null): Green
  • Mostly complete (95-99%): Yellow -- investigate the nulls
  • Incomplete (80-95%): Orange -- understand why and whether it matters
  • Sparse (<80%): Red -- may not be usable without imputation
为每列评级:
  • 完整(非空值占比>99%):绿色
  • 基本完整(95-99%):黄色 -- 需调查空值情况
  • 不完整(80-95%):橙色 -- 需了解原因及影响
  • 稀疏(<80%):红色 -- 若无填充可能无法使用

Consistency Checks

一致性检查

Look for:
  • Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
  • Type inconsistency: Numbers stored as strings, dates in various formats
  • Referential integrity: Foreign keys that don't match any parent record
  • Business rule violations: Negative quantities, end dates before start dates, percentages > 100
  • Cross-column consistency: Status = "completed" but completed_at is null
检查以下内容:
  • 值格式不一致:同一概念的表示方式不同("USA"、"US"、"United States"、"us")
  • 类型不一致:数字以字符串存储、日期格式多样
  • 引用完整性:外键与父表记录不匹配
  • 业务规则违反:负数量、结束日期早于开始日期、百分比>100
  • 跨列一致性:状态为“已完成”但completed_at为空

Accuracy Indicators

准确性指标

Red flags that suggest accuracy issues:
  • Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
  • Default values: Suspiciously high frequency of a single value
  • Stale data: Updated_at shows no recent changes in an active system
  • Impossible values: Ages > 150, dates in the far future, negative durations
  • Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)
暗示准确性问题的危险信号:
  • 占位符值:0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
  • 默认值:单个值出现频率异常高
  • 陈旧数据:活跃系统中updated_at无最近更新记录
  • 不可能值:年龄>150、未来日期、负时长
  • 整数值偏差:所有值均以0或5结尾(暗示估算而非实际度量)

Timeliness Assessment

及时性评估

  • When was the table last updated?
  • What is the expected update frequency?
  • Is there a lag between event time and load time?
  • Are there gaps in the time series?
  • 表最后更新时间是什么时候?
  • 预期更新频率是多少?
  • 事件时间与加载时间之间是否存在延迟?
  • 时间序列中是否存在间隙?

Pattern Discovery Techniques

模式发现技巧

Distribution Analysis

分布分析

For numeric columns, characterize the distribution:
  • Normal: Mean and median are close, bell-shaped
  • Skewed right: Long tail of high values (common for revenue, session duration)
  • Skewed left: Long tail of low values (less common)
  • Bimodal: Two peaks (suggests two distinct populations)
  • Power law: Few very large values, many small ones (common for user activity)
  • Uniform: Roughly equal frequency across range (often synthetic or random)
针对数值列,描述其分布类型:
  • 正态分布:均值与中位数接近,呈钟形曲线
  • 右偏分布:高值侧长尾(常见于收入、会话时长)
  • 左偏分布:低值侧长尾(较少见)
  • 双峰分布:存在两个峰值(暗示两个不同群体)
  • 幂律分布:少量极大值,大量极小值(常见于用户行为数据)
  • 均匀分布:范围内频率大致相等(通常为合成或随机数据)

Temporal Patterns

时间模式

For time series data, look for:
  • Trend: Sustained upward or downward movement
  • Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
  • Day-of-week effects: Weekday vs. weekend differences
  • Holiday effects: Drops or spikes around known holidays
  • Change points: Sudden shifts in level or trend
  • Anomalies: Individual data points that break the pattern
针对时间序列数据,查找以下模式:
  • 趋势:持续上升或下降的走势
  • 季节性:重复出现的模式(周、月、季度、年度)
  • 周效应:工作日与周末的差异
  • 节假日效应:节假日前后的数据下降或激增
  • 突变点:水平或趋势的突然变化
  • 异常值:打破模式的单个数据点

Segmentation Discovery

细分发现

Identify natural segments by:
  • Finding categorical columns with 3-20 distinct values
  • Comparing metric distributions across segment values
  • Looking for segments with significantly different behavior
  • Testing whether segments are homogeneous or contain sub-segments
通过以下方式识别自然细分群体:
  • 找到具有3-20个去重值的分类列
  • 比较不同细分群体的指标分布
  • 寻找行为差异显著的细分群体
  • 测试细分群体是否同质或包含子细分群体

Correlation Exploration

相关性探索

Between numeric columns:
  • Compute correlation matrix for all metric pairs
  • Flag strong correlations (|r| > 0.7) for investigation
  • Note: Correlation does not imply causation -- flag this explicitly
  • Check for non-linear relationships (e.g., quadratic, logarithmic)
针对数值列之间的关系:
  • 计算所有指标对的相关矩阵
  • 标记强相关性(|r| > 0.7)以进行调查
  • 注意:相关性不代表因果关系 -- 需明确标记这一点
  • 检查非线性关系(如二次、对数关系)

Schema Understanding and Documentation

Schema理解与文档化

Schema Documentation Template

Schema文档模板

When documenting a dataset for team use:
markdown
undefined
为团队使用而记录数据集时,可使用以下模板:
markdown
undefined

Table: [schema.table_name]

表:[schema.table_name]

Description: [What this table represents] Grain: [One row per...] Primary Key: [column(s)] Row Count: [approximate, with date] Update Frequency: [real-time / hourly / daily / weekly] Owner: [team or person responsible]
描述: [该表代表的内容] 粒度: [每行代表...] 主键: [列名] 行数: [近似值,含日期] 更新频率: [实时/每小时/每日/每周] 负责人: [负责的团队或个人]

Key Columns

关键列

ColumnTypeDescriptionExample ValuesNotes
user_idSTRINGUnique user identifier"usr_abc123"FK to users.id
event_typeSTRINGType of event"click", "view", "purchase"15 distinct values
revenueDECIMALTransaction revenue in USD29.99, 149.00Null for non-purchase events
created_atTIMESTAMPWhen the event occurred2024-01-15 14:23:01Partitioned on this column
列名类型描述示例值备注
user_idSTRING唯一用户标识符"usr_abc123"外键关联users.id
event_typeSTRING事件类型"click", "view", "purchase"共15个去重值
revenueDECIMAL交易收入(美元)29.99, 149.00非购买事件为空
created_atTIMESTAMP事件发生时间2024-01-15 14:23:01按该列分区

Relationships

关系

  • Joins to
    users
    on
    user_id
  • Joins to
    products
    on
    product_id
  • Parent of
    event_details
    (1:many on event_id)
  • 通过
    user_id
    关联到
    users
  • 通过
    product_id
    关联到
    products
  • event_details
    表的父表(通过event_id一对多关联)

Known Issues

已知问题

  • [List any known data quality issues]
  • [Note any gotchas for analysts]
  • [列出所有已知数据质量问题]
  • [记录分析师需要注意的陷阱]

Common Query Patterns

常见查询模式

  • [Typical use cases for this table]
undefined
  • [该表的典型使用场景]
undefined

Schema Exploration Queries

Schema探查查询

When connected to a data warehouse, use these patterns to discover schema:
sql
-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name
连接到数据仓库时,可使用以下查询模式探索Schema:
sql
-- 列出schema中的所有表(PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- 列详情(PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- 表大小(PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- 所有表的行数(通用写法)
-- 逐表执行:SELECT COUNT(*) FROM table_name

Lineage and Dependencies

数据血缘与依赖关系

When exploring an unfamiliar data environment:
  1. Start with the "output" tables (what reports or dashboards consume)
  2. Trace upstream: What tables feed into them?
  3. Identify raw/staging/mart layers
  4. Map the transformation chain from raw data to analytical tables
  5. Note where data is enriched, filtered, or aggregated
在不熟悉的数据环境中探索时:
  1. 从“输出表”开始(即报表或仪表盘所使用的表)
  2. 向上追溯:哪些表为其提供数据?
  3. 识别原始层/Staging层/数据集市层
  4. 绘制从原始数据到分析表的转换链路
  5. 记录数据在何处被丰富、过滤或聚合