data-exploration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Exploration Skill
数据探查技能
Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.
一套用于数据集探查、数据质量评估、模式发现和Schema理解的系统化方法。
Data Profiling Methodology
数据集探查方法
Phase 1: Structural Understanding
第一阶段:结构理解
Before analyzing any data, understand its structure:
Table-level questions:
- How many rows and columns?
- What is the grain (one row per what)?
- What is the primary key? Is it unique?
- When was the data last updated?
- How far back does the data go?
Column classification:
Categorize each column as one of:
- Identifier: Unique keys, foreign keys, entity IDs
- Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
- Metric: Quantitative values for measurement (revenue, count, duration, score)
- Temporal: Dates and timestamps (created_at, updated_at, event_date)
- Text: Free-form text fields (description, notes, name)
- Boolean: True/false flags
- Structural: JSON, arrays, nested structures
在分析任何数据之前,先理解其结构:
表级问题:
- 有多少行和列?
- 粒度是什么(每行代表什么)?
- 主键是什么?是否唯一?
- 数据最后更新时间是什么时候?
- 数据的时间跨度有多大?
列分类:
将每列归类为以下类型之一:
- 标识符:唯一键、外键、实体ID
- 维度:用于分组/过滤的分类属性(状态、类型、地区、类别)
- 指标:用于度量的量化值(收入、数量、时长、分数)
- 时间型:日期和时间戳(created_at、updated_at、event_date)
- 文本型:自由格式文本字段(描述、备注、名称)
- 布尔型:真/假标记
- 结构化类型:JSON、数组、嵌套结构
Phase 2: Column-Level Profiling
第二阶段:列级探查
For each column, compute:
All columns:
- Null count and null rate
- Distinct count and cardinality ratio (distinct / total)
- Most common values (top 5-10 with frequencies)
- Least common values (bottom 5 to spot anomalies)
Numeric columns (metrics):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)String columns (dimensions, text):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace countDate/timestamp columns:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time seriesBoolean columns:
true count, false count, null count
true rate针对每列计算以下内容:
所有列:
- 空值数量和空值率
- 去重值数量和基数比(去重值/总行数)
- 最常见值(前5-10个及出现频率)
- 最不常见值(后5个,用于发现异常)
数值列(指标):
min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)字符串列(维度、文本):
min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count日期/时间戳列:
min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series布尔列:
true count, false count, null count
true ratePhase 3: Relationship Discovery
第三阶段:关系发现
After profiling individual columns:
- Foreign key candidates: ID columns that might link to other tables
- Hierarchies: Columns that form natural drill-down paths (country > state > city)
- Correlations: Numeric columns that move together
- Derived columns: Columns that appear to be computed from others
- Redundant columns: Columns with identical or near-identical information
完成单个列的探查后:
- 外键候选列:可能关联到其他表的ID列
- 层级关系:形成自然下钻路径的列(国家 > 州 > 城市)
- 相关性:数值列之间的关联趋势
- 衍生列:看起来由其他列计算得到的列
- 冗余列:包含相同或近乎相同信息的列
Quality Assessment Framework
质量评估框架
Completeness Score
完整性评分
Rate each column:
- Complete (>99% non-null): Green
- Mostly complete (95-99%): Yellow -- investigate the nulls
- Incomplete (80-95%): Orange -- understand why and whether it matters
- Sparse (<80%): Red -- may not be usable without imputation
为每列评级:
- 完整(非空值占比>99%):绿色
- 基本完整(95-99%):黄色 -- 需调查空值情况
- 不完整(80-95%):橙色 -- 需了解原因及影响
- 稀疏(<80%):红色 -- 若无填充可能无法使用
Consistency Checks
一致性检查
Look for:
- Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
- Type inconsistency: Numbers stored as strings, dates in various formats
- Referential integrity: Foreign keys that don't match any parent record
- Business rule violations: Negative quantities, end dates before start dates, percentages > 100
- Cross-column consistency: Status = "completed" but completed_at is null
检查以下内容:
- 值格式不一致:同一概念的表示方式不同("USA"、"US"、"United States"、"us")
- 类型不一致:数字以字符串存储、日期格式多样
- 引用完整性:外键与父表记录不匹配
- 业务规则违反:负数量、结束日期早于开始日期、百分比>100
- 跨列一致性:状态为“已完成”但completed_at为空
Accuracy Indicators
准确性指标
Red flags that suggest accuracy issues:
- Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
- Default values: Suspiciously high frequency of a single value
- Stale data: Updated_at shows no recent changes in an active system
- Impossible values: Ages > 150, dates in the far future, negative durations
- Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)
暗示准确性问题的危险信号:
- 占位符值:0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
- 默认值:单个值出现频率异常高
- 陈旧数据:活跃系统中updated_at无最近更新记录
- 不可能值:年龄>150、未来日期、负时长
- 整数值偏差:所有值均以0或5结尾(暗示估算而非实际度量)
Timeliness Assessment
及时性评估
- When was the table last updated?
- What is the expected update frequency?
- Is there a lag between event time and load time?
- Are there gaps in the time series?
- 表最后更新时间是什么时候?
- 预期更新频率是多少?
- 事件时间与加载时间之间是否存在延迟?
- 时间序列中是否存在间隙?
Pattern Discovery Techniques
模式发现技巧
Distribution Analysis
分布分析
For numeric columns, characterize the distribution:
- Normal: Mean and median are close, bell-shaped
- Skewed right: Long tail of high values (common for revenue, session duration)
- Skewed left: Long tail of low values (less common)
- Bimodal: Two peaks (suggests two distinct populations)
- Power law: Few very large values, many small ones (common for user activity)
- Uniform: Roughly equal frequency across range (often synthetic or random)
针对数值列,描述其分布类型:
- 正态分布:均值与中位数接近,呈钟形曲线
- 右偏分布:高值侧长尾(常见于收入、会话时长)
- 左偏分布:低值侧长尾(较少见)
- 双峰分布:存在两个峰值(暗示两个不同群体)
- 幂律分布:少量极大值,大量极小值(常见于用户行为数据)
- 均匀分布:范围内频率大致相等(通常为合成或随机数据)
Temporal Patterns
时间模式
For time series data, look for:
- Trend: Sustained upward or downward movement
- Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
- Day-of-week effects: Weekday vs. weekend differences
- Holiday effects: Drops or spikes around known holidays
- Change points: Sudden shifts in level or trend
- Anomalies: Individual data points that break the pattern
针对时间序列数据,查找以下模式:
- 趋势:持续上升或下降的走势
- 季节性:重复出现的模式(周、月、季度、年度)
- 周效应:工作日与周末的差异
- 节假日效应:节假日前后的数据下降或激增
- 突变点:水平或趋势的突然变化
- 异常值:打破模式的单个数据点
Segmentation Discovery
细分发现
Identify natural segments by:
- Finding categorical columns with 3-20 distinct values
- Comparing metric distributions across segment values
- Looking for segments with significantly different behavior
- Testing whether segments are homogeneous or contain sub-segments
通过以下方式识别自然细分群体:
- 找到具有3-20个去重值的分类列
- 比较不同细分群体的指标分布
- 寻找行为差异显著的细分群体
- 测试细分群体是否同质或包含子细分群体
Correlation Exploration
相关性探索
Between numeric columns:
- Compute correlation matrix for all metric pairs
- Flag strong correlations (|r| > 0.7) for investigation
- Note: Correlation does not imply causation -- flag this explicitly
- Check for non-linear relationships (e.g., quadratic, logarithmic)
针对数值列之间的关系:
- 计算所有指标对的相关矩阵
- 标记强相关性(|r| > 0.7)以进行调查
- 注意:相关性不代表因果关系 -- 需明确标记这一点
- 检查非线性关系(如二次、对数关系)
Schema Understanding and Documentation
Schema理解与文档化
Schema Documentation Template
Schema文档模板
When documenting a dataset for team use:
markdown
undefined为团队使用而记录数据集时,可使用以下模板:
markdown
undefinedTable: [schema.table_name]
表:[schema.table_name]
Description: [What this table represents]
Grain: [One row per...]
Primary Key: [column(s)]
Row Count: [approximate, with date]
Update Frequency: [real-time / hourly / daily / weekly]
Owner: [team or person responsible]
描述: [该表代表的内容]
粒度: [每行代表...]
主键: [列名]
行数: [近似值,含日期]
更新频率: [实时/每小时/每日/每周]
负责人: [负责的团队或个人]
Key Columns
关键列
| Column | Type | Description | Example Values | Notes |
|---|---|---|---|---|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |
| 列名 | 类型 | 描述 | 示例值 | 备注 |
|---|---|---|---|---|
| user_id | STRING | 唯一用户标识符 | "usr_abc123" | 外键关联users.id |
| event_type | STRING | 事件类型 | "click", "view", "purchase" | 共15个去重值 |
| revenue | DECIMAL | 交易收入(美元) | 29.99, 149.00 | 非购买事件为空 |
| created_at | TIMESTAMP | 事件发生时间 | 2024-01-15 14:23:01 | 按该列分区 |
Relationships
关系
- Joins to on
usersuser_id - Joins to on
productsproduct_id - Parent of (1:many on event_id)
event_details
- 通过关联到
user_id表users - 通过关联到
product_id表products - 是表的父表(通过event_id一对多关联)
event_details
Known Issues
已知问题
- [List any known data quality issues]
- [Note any gotchas for analysts]
- [列出所有已知数据质量问题]
- [记录分析师需要注意的陷阱]
Common Query Patterns
常见查询模式
- [Typical use cases for this table]
undefined- [该表的典型使用场景]
undefinedSchema Exploration Queries
Schema探查查询
When connected to a data warehouse, use these patterns to discover schema:
sql
-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name连接到数据仓库时,可使用以下查询模式探索Schema:
sql
-- 列出schema中的所有表(PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- 列详情(PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;
-- 表大小(PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
-- 所有表的行数(通用写法)
-- 逐表执行:SELECT COUNT(*) FROM table_nameLineage and Dependencies
数据血缘与依赖关系
When exploring an unfamiliar data environment:
- Start with the "output" tables (what reports or dashboards consume)
- Trace upstream: What tables feed into them?
- Identify raw/staging/mart layers
- Map the transformation chain from raw data to analytical tables
- Note where data is enriched, filtered, or aggregated
在不熟悉的数据环境中探索时:
- 从“输出表”开始(即报表或仪表盘所使用的表)
- 向上追溯:哪些表为其提供数据?
- 识别原始层/Staging层/数据集市层
- 绘制从原始数据到分析表的转换链路
- 记录数据在何处被丰富、过滤或聚合