profiling-tables

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Profile

数据剖析

Generate a comprehensive profile of a table that a new team member could use to understand the data.
生成一份全面的表剖析报告,供新团队成员了解数据。

Step 1: Basic Metadata

步骤1:基础元数据

Query column metadata:
sql
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION
If the table name isn't fully qualified, search INFORMATION_SCHEMA.TABLES to locate it first.
查询列元数据:
sql
SELECT COLUMN_NAME, DATA_TYPE, COMMENT
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>' AND TABLE_NAME = '<table>'
ORDER BY ORDINAL_POSITION
如果表名不是全限定名,请先搜索INFORMATION_SCHEMA.TABLES来定位它。

Step 2: Size and Shape

步骤2:数据规模与形态

Run via
run_sql
:
sql
SELECT
    COUNT(*) as total_rows,
    COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>
通过
run_sql
执行:
sql
SELECT
    COUNT(*) as total_rows,
    COUNT(*) / 1000000.0 as millions_of_rows
FROM <table>

Step 3: Column-Level Statistics

步骤3:列级统计信息

For each column, gather appropriate statistics based on data type:
针对每一列,根据数据类型收集相应的统计信息:

Numeric Columns

数值型列

sql
SELECT
    MIN(column_name) as min_val,
    MAX(column_name) as max_val,
    AVG(column_name) as avg_val,
    STDDEV(column_name) as std_dev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>
sql
SELECT
    MIN(column_name) as min_val,
    MAX(column_name) as max_val,
    AVG(column_name) as avg_val,
    STDDEV(column_name) as std_dev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>

String Columns

字符串型列

sql
SELECT
    MIN(LEN(column_name)) as min_length,
    MAX(LEN(column_name)) as max_length,
    AVG(LEN(column_name)) as avg_length,
    SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>
sql
SELECT
    MIN(LEN(column_name)) as min_length,
    MAX(LEN(column_name)) as max_length,
    AVG(LEN(column_name)) as avg_length,
    SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count,
    COUNT(DISTINCT column_name) as distinct_count
FROM <table>

Date/Timestamp Columns

日期/时间戳列

sql
SELECT
    MIN(column_name) as earliest,
    MAX(column_name) as latest,
    DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>
sql
SELECT
    MIN(column_name) as earliest,
    MAX(column_name) as latest,
    DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days,
    SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count
FROM <table>

Step 4: Cardinality Analysis

步骤4:基数分析

For columns that look like categorical/dimension keys:
sql
SELECT
    column_name,
    COUNT(*) as frequency,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20
This reveals:
  • High-cardinality columns (likely IDs or unique values)
  • Low-cardinality columns (likely categories or status fields)
  • Skewed distributions (one value dominates)
针对看起来是分类/维度键的列:
sql
SELECT
    column_name,
    COUNT(*) as frequency,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM <table>
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20
这可以揭示:
  • 高基数列(可能是ID或唯一值)
  • 低基数列(可能是分类或状态字段)
  • 倾斜分布(某个值占主导)

Step 5: Sample Data

步骤5:样本数据

Get representative rows:
sql
SELECT *
FROM <table>
LIMIT 10
If the table is large and you want variety, sample from different time periods or categories.
获取具有代表性的行:
sql
SELECT *
FROM <table>
LIMIT 10
如果表很大且希望获取多样的样本,可以从不同时间段或分类中抽样。

Step 6: Data Quality Assessment

步骤6:数据质量评估

Summarize quality across dimensions:
从多个维度总结数据质量:

Completeness

完整性

  • Which columns have NULLs? What percentage?
  • Are NULLs expected or problematic?
  • 哪些列存在NULL值?占比多少?
  • NULL值是预期的还是有问题的?

Uniqueness

唯一性

  • Does the apparent primary key have duplicates?
  • Are there unexpected duplicate rows?
  • 表面上的主键是否存在重复值?
  • 是否存在意外的重复行?

Freshness

新鲜度

  • When was data last updated? (MAX of timestamp columns)
  • Is the update frequency as expected?
  • 数据最后更新时间是什么时候?(时间戳列的MAX值)
  • 更新频率是否符合预期?

Validity

有效性

  • Are there values outside expected ranges?
  • Are there invalid formats (dates, emails, etc.)?
  • Are there orphaned foreign keys?
  • 是否存在超出预期范围的值?
  • 是否存在无效格式(日期、邮箱等)?
  • 是否存在孤立的外键?

Consistency

一致性

  • Do related columns make sense together?
  • Are there logical contradictions?
  • 相关列的组合是否合理?
  • 是否存在逻辑矛盾?

Step 7: Output Summary

步骤7:输出总结报告

Provide a structured profile:
提供结构化的剖析报告:

Overview

概述

2-3 sentences describing what this table contains, who uses it, and how fresh it is.
用2-3句话描述该表包含的内容、使用者以及数据的新鲜度。

Schema

模式

ColumnTypeNulls%DistinctDescription
...............
列名类型NULL值占比唯一值数量描述
...............

Key Statistics

关键统计信息

  • Row count: X
  • Date range: Y to Z
  • Last updated: timestamp
  • 行数:X
  • 日期范围:Y至Z
  • 最后更新时间:时间戳

Data Quality Score

数据质量评分

  • Completeness: X/10
  • Uniqueness: X/10
  • Freshness: X/10
  • Overall: X/10
  • 完整性:X/10
  • 唯一性:X/10
  • 新鲜度:X/10
  • 总体评分:X/10

Potential Issues

潜在问题

List any data quality concerns discovered.
列出发现的所有数据质量问题。

Recommended Queries

推荐查询

3-5 useful queries for common questions about this data.
提供3-5个针对该数据常见问题的实用查询语句。