profiling-tables
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Profile
数据剖析
Generate a comprehensive profile of a table that a new team member could use to understand the data.
生成一份全面的表剖析报告,供新团队成员了解数据。
Step 1: Basic Metadata
步骤1:基础元数据
Query column metadata:
sql
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITIONIf the table name isn't fully qualified, search INFORMATION_SCHEMA.TABLES to locate it first.
查询列元数据:
sql
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION如果表名不是全限定名,请先搜索INFORMATION_SCHEMA.TABLES来定位它。
Step 2: Size and Shape
步骤2:数据规模与形态
Run via :
run_sqlsql
SELECT
COUNT(*) as total_rows,
COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>通过执行:
run_sqlsql
SELECT
COUNT(*) as total_rows,
COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>Step 3: Column-Level Statistics
步骤3:列级统计信息
For each column, gather appropriate statistics based on data type:
针对每一列,根据数据类型收集相应的统计信息:
Numeric Columns
数值型列
sql
SELECT
MIN(column_name) as min_val,
MAX(column_name) as max_val,
AVG(column_name) as avg_val,
STDDEV(column_name) as std_dev,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>sql
SELECT
MIN(column_name) as min_val,
MAX(column_name) as max_val,
AVG(column_name) as avg_val,
STDDEV(column_name) as std_dev,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>String Columns
字符串型列
sql
SELECT
MIN(LEN(column_name)) as min_length,
MAX(LEN(column_name)) as max_length,
AVG(LEN(column_name)) as avg_length,
SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>sql
SELECT
MIN(LEN(column_name)) as min_length,
MAX(LEN(column_name)) as max_length,
AVG(LEN(column_name)) as avg_length,
SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
COUNT(DISTINCT column_name) as distinct_count
FROM <table>Date/Timestamp Columns
日期/时间戳列
sql
SELECT
MIN(column_name) as earliest,
MAX(column_name) as latest,
DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>sql
SELECT
MIN(column_name) as earliest,
MAX(column_name) as latest,
DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>Step 4: Cardinality Analysis
步骤4:基数分析
For columns that look like categorical/dimension keys:
sql
SELECT
column_name,
COUNT(*) as frequency,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20This reveals:
- High-cardinality columns (likely IDs or unique values)
- Low-cardinality columns (likely categories or status fields)
- Skewed distributions (one value dominates)
针对看起来是分类/维度键的列:
sql
SELECT
column_name,
COUNT(*) as frequency,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20这可以揭示:
- 高基数列(可能是ID或唯一值)
- 低基数列(可能是分类或状态字段)
- 倾斜分布(某个值占主导)
Step 5: Sample Data
步骤5:样本数据
Get representative rows:
sql
SELECT *
FROM <table>
LIMIT 10If the table is large and you want variety, sample from different time periods or categories.
获取具有代表性的行:
sql
SELECT *
FROM <table>
LIMIT 10如果表很大且希望获取多样的样本,可以从不同时间段或分类中抽样。
Step 6: Data Quality Assessment
步骤6:数据质量评估
Summarize quality across dimensions:
从多个维度总结数据质量:
Completeness
完整性
- Which columns have NULLs? What percentage?
- Are NULLs expected or problematic?
- 哪些列存在NULL值?占比多少?
- NULL值是预期的还是有问题的?
Uniqueness
唯一性
- Does the apparent primary key have duplicates?
- Are there unexpected duplicate rows?
- 表面上的主键是否存在重复值?
- 是否存在意外的重复行?
Freshness
新鲜度
- When was data last updated? (MAX of timestamp columns)
- Is the update frequency as expected?
- 数据最后更新时间是什么时候?(时间戳列的MAX值)
- 更新频率是否符合预期?
Validity
有效性
- Are there values outside expected ranges?
- Are there invalid formats (dates, emails, etc.)?
- Are there orphaned foreign keys?
- 是否存在超出预期范围的值?
- 是否存在无效格式(日期、邮箱等)?
- 是否存在孤立的外键?
Consistency
一致性
- Do related columns make sense together?
- Are there logical contradictions?
- 相关列的组合是否合理?
- 是否存在逻辑矛盾?
Step 7: Output Summary
步骤7:输出总结报告
Provide a structured profile:
提供结构化的剖析报告:
Overview
概述
2-3 sentences describing what this table contains, who uses it, and how fresh it is.
用2-3句话描述该表包含的内容、使用者以及数据的新鲜度。
Schema
模式
| Column | Type | Nulls% | Distinct | Description |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
| 列名 | 类型 | NULL值占比 | 唯一值数量 | 描述 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Key Statistics
关键统计信息
- Row count: X
- Date range: Y to Z
- Last updated: timestamp
- 行数:X
- 日期范围:Y至Z
- 最后更新时间:时间戳
Data Quality Score
数据质量评分
- Completeness: X/10
- Uniqueness: X/10
- Freshness: X/10
- Overall: X/10
- 完整性:X/10
- 唯一性:X/10
- 新鲜度:X/10
- 总体评分:X/10
Potential Issues
潜在问题
List any data quality concerns discovered.
列出发现的所有数据质量问题。
Recommended Queries
推荐查询
3-5 useful queries for common questions about this data.
提供3-5个针对该数据常见问题的实用查询语句。