data-profiler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Profiler Skill

Data Profiler 技能

Overview

概述

This skill provides comprehensive data profiling capabilities at Basic and Intermediate levels. It analyzes datasets to extract schema information, statistical summaries, data quality metrics, distributions, uniqueness characteristics, and pattern detection.
该技能提供基础和中级的全面数据探查能力。它分析数据集以提取架构信息、统计摘要、数据质量指标、分布情况、唯一性特征和模式检测。

Profiling Levels

探查级别

Basic Level

基础级别

  • Row count and column count
  • Column names and data types
  • Null/missing value counts and percentages
  • Min, max, mean, median for numeric columns
  • String length statistics for text columns
  • 行数和列数
  • 列名和数据类型
  • 空值/缺失值的数量和占比
  • 数值列的最小值、最大值、平均值、中位数
  • 文本列的字符串长度统计

Intermediate Level

中级级别

  • Distributions: Value frequencies, histograms, percentiles
  • Uniqueness: Distinct counts, cardinality, duplicate detection
  • Pattern Detection: Date/time formats, regex patterns, data format identification
  • Quality Metrics: Completeness scores, consistency indicators
  • 分布情况:值频率、直方图、百分位数
  • 唯一性:不同值计数、基数、重复项检测
  • 模式检测:日期/时间格式、正则表达式模式、数据格式识别
  • 质量指标:完整性分数、一致性指标

Supported Formats

支持的格式

  • CSV (Comma-Separated Values)
  • JSON (JSON Lines and standard JSON)
  • Parquet (Apache Parquet columnar format)
  • CSV(逗号分隔值)
  • JSON(JSON Lines和标准JSON)
  • Parquet(Apache Parquet列式格式)

When to Use

使用场景

  • User requests data profiling or data analysis
  • Need to understand dataset structure and quality
  • Discovering schema for unknown datasets
  • Assessing data quality before ingestion
  • Identifying data issues (nulls, duplicates, anomalies)
  • Creating dataset specifications or documentation
  • 用户请求数据探查或数据分析时
  • 需要了解数据集结构和质量时
  • 发现未知数据集的架构时
  • 数据导入前评估数据质量时
  • 识别数据问题(空值、重复项、异常值)时
  • 创建数据集规格或文档时

Core Profiling Functions

核心探查功能

1. Schema Detection

1. 架构检测

python
import pandas as pd

def detect_schema(df: pd.DataFrame) -> dict:
    """
    Detect schema information from DataFrame.

    Returns:
        Dictionary with column names, types, nullable status
    """
    schema = {
        'columns': []
    }

    for col in df.columns:
        col_info = {
            'name': col,
            'type': str(df[col].dtype),
            'nullable': df[col].isnull().any(),
            'null_count': int(df[col].isnull().sum()),
            'null_percentage': float(df[col].isnull().sum() / len(df) * 100)
        }

        # Infer semantic type
        if pd.api.types.is_numeric_dtype(df[col]):
            col_info['semantic_type'] = 'numeric'
        elif pd.api.types.is_datetime64_dtype(df[col]):
            col_info['semantic_type'] = 'datetime'
        elif pd.api.types.is_bool_dtype(df[col]):
            col_info['semantic_type'] = 'boolean'
        else:
            col_info['semantic_type'] = 'string'

        schema['columns'].append(col_info)

    return schema
python
import pandas as pd

def detect_schema(df: pd.DataFrame) -> dict:
    """
    Detect schema information from DataFrame.

    Returns:
        Dictionary with column names, types, nullable status
    """
    schema = {
        'columns': []
    }

    for col in df.columns:
        col_info = {
            'name': col,
            'type': str(df[col].dtype),
            'nullable': df[col].isnull().any(),
            'null_count': int(df[col].isnull().sum()),
            'null_percentage': float(df[col].isnull().sum() / len(df) * 100)
        }

        # Infer semantic type
        if pd.api.types.is_numeric_dtype(df[col]):
            col_info['semantic_type'] = 'numeric'
        elif pd.api.types.is_datetime64_dtype(df[col]):
            col_info['semantic_type'] = 'datetime'
        elif pd.api.types.is_bool_dtype(df[col]):
            col_info['semantic_type'] = 'boolean'
        else:
            col_info['semantic_type'] = 'string'

        schema['columns'].append(col_info)

    return schema

2. Basic Statistics

2. 基础统计

python
def get_basic_statistics(df: pd.DataFrame) -> dict:
    """
    Get basic statistical summary.

    Returns:
        Dictionary with row count, column count, and column stats
    """
    stats = {
        'row_count': len(df),
        'column_count': len(df.columns),
        'columns': {}
    }

    for col in df.columns:
        col_stats = {}

        if pd.api.types.is_numeric_dtype(df[col]):
            col_stats = {
                'min': float(df[col].min()) if not df[col].isna().all() else None,
                'max': float(df[col].max()) if not df[col].isna().all() else None,
                'mean': float(df[col].mean()) if not df[col].isna().all() else None,
                'median': float(df[col].median()) if not df[col].isna().all() else None,
                'std': float(df[col].std()) if not df[col].isna().all() else None
            }
        elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            non_null = df[col].dropna()
            if len(non_null) > 0:
                col_stats = {
                    'min_length': int(non_null.astype(str).str.len().min()),
                    'max_length': int(non_null.astype(str).str.len().max()),
                    'avg_length': float(non_null.astype(str).str.len().mean())
                }

        stats['columns'][col] = col_stats

    return stats
python
def get_basic_statistics(df: pd.DataFrame) -> dict:
    """
    Get basic statistical summary.

    Returns:
        Dictionary with row count, column count, and column stats
    """
    stats = {
        'row_count': len(df),
        'column_count': len(df.columns),
        'columns': {}
    }

    for col in df.columns:
        col_stats = {}

        if pd.api.types.is_numeric_dtype(df[col]):
            col_stats = {
                'min': float(df[col].min()) if not df[col].isna().all() else None,
                'max': float(df[col].max()) if not df[col].isna().all() else None,
                'mean': float(df[col].mean()) if not df[col].isna().all() else None,
                'median': float(df[col].median()) if not df[col].isna().all() else None,
                'std': float(df[col].std()) if not df[col].isna().all() else None
            }
        elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            non_null = df[col].dropna()
            if len(non_null) > 0:
                col_stats = {
                    'min_length': int(non_null.astype(str).str.len().min()),
                    'max_length': int(non_null.astype(str).str.len().max()),
                    'avg_length': float(non_null.astype(str).str.len().mean())
                }

        stats['columns'][col] = col_stats

    return stats

3. Distribution Analysis

3. 分布分析

python
def analyze_distributions(df: pd.DataFrame, max_unique: int = 50) -> dict:
    """
    Analyze value distributions for each column.

    Args:
        df: DataFrame to analyze
        max_unique: Max unique values to show frequencies for

    Returns:
        Dictionary with distribution info per column
    """
    distributions = {}

    for col in df.columns:
        dist_info = {}

        # Value counts
        value_counts = df[col].value_counts()
        unique_count = len(value_counts)

        dist_info['unique_count'] = unique_count
        dist_info['unique_percentage'] = (unique_count / len(df)) * 100

        # Show top values if not too many unique
        if unique_count <= max_unique:
            dist_info['value_frequencies'] = {
                str(k): int(v) for k, v in value_counts.head(20).items()
            }

        # Percentiles for numeric columns
        if pd.api.types.is_numeric_dtype(df[col]):
            percentiles = df[col].quantile([0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
            dist_info['percentiles'] = {
                f'p{int(p*100)}': float(v) for p, v in percentiles.items()
            }

        distributions[col] = dist_info

    return distributions
python
def analyze_distributions(df: pd.DataFrame, max_unique: int = 50) -> dict:
    """
    Analyze value distributions for each column.

    Args:
        df: DataFrame to analyze
        max_unique: Max unique values to show frequencies for

    Returns:
        Dictionary with distribution info per column
    """
    distributions = {}

    for col in df.columns:
        dist_info = {}

        # Value counts
        value_counts = df[col].value_counts()
        unique_count = len(value_counts)

        dist_info['unique_count'] = unique_count
        dist_info['unique_percentage'] = (unique_count / len(df)) * 100

        # Show top values if not too many unique
        if unique_count <= max_unique:
            dist_info['value_frequencies'] = {
                str(k): int(v) for k, v in value_counts.head(20).items()
            }

        # Percentiles for numeric columns
        if pd.api.types.is_numeric_dtype(df[col]):
            percentiles = df[col].quantile([0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
            dist_info['percentiles'] = {
                f'p{int(p*100)}': float(v) for p, v in percentiles.items()
            }

        distributions[col] = dist_info

    return distributions

4. Uniqueness Analysis

4. 唯一性分析

python
def analyze_uniqueness(df: pd.DataFrame) -> dict:
    """
    Analyze uniqueness characteristics of columns.

    Returns:
        Dictionary with uniqueness metrics per column
    """
    uniqueness = {}

    for col in df.columns:
        unique_info = {}

        total_count = len(df[col])
        non_null_count = df[col].notna().sum()
        unique_count = df[col].nunique()
        duplicate_count = total_count - unique_count

        unique_info['distinct_count'] = unique_count
        unique_info['duplicate_count'] = duplicate_count
        unique_info['uniqueness_ratio'] = unique_count / total_count if total_count > 0 else 0
        unique_info['is_unique_key'] = (unique_count == non_null_count)
        unique_info['has_duplicates'] = (duplicate_count > 0)

        # Find duplicated values (top 10)
        if duplicate_count > 0:
            duplicates = df[col].value_counts()
            duplicates = duplicates[duplicates > 1].head(10)
            unique_info['top_duplicates'] = {
                str(k): int(v) for k, v in duplicates.items()
            }

        uniqueness[col] = unique_info

    return uniqueness
python
def analyze_uniqueness(df: pd.DataFrame) -> dict:
    """
    Analyze uniqueness characteristics of columns.

    Returns:
        Dictionary with uniqueness metrics per column
    """
    uniqueness = {}

    for col in df.columns:
        unique_info = {}

        total_count = len(df[col])
        non_null_count = df[col].notna().sum()
        unique_count = df[col].nunique()
        duplicate_count = total_count - unique_count

        unique_info['distinct_count'] = unique_count
        unique_info['duplicate_count'] = duplicate_count
        unique_info['uniqueness_ratio'] = unique_count / total_count if total_count > 0 else 0
        unique_info['is_unique_key'] = (unique_count == non_null_count)
        unique_info['has_duplicates'] = (duplicate_count > 0)

        # Find duplicated values (top 10)
        if duplicate_count > 0:
            duplicates = df[col].value_counts()
            duplicates = duplicates[duplicates > 1].head(10)
            unique_info['top_duplicates'] = {
                str(k): int(v) for k, v in duplicates.items()
            }

        uniqueness[col] = unique_info

    return uniqueness

5. Pattern Detection

5. 模式检测

python
import re

def detect_patterns(df: pd.DataFrame) -> dict:
    """
    Detect common patterns in string columns.

    Returns:
        Dictionary with detected patterns per column
    """
    patterns = {}

    # Common regex patterns
    pattern_regexes = {
        'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
        'phone_us': r'^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$',
        'url': r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b',
        'ipv4': r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$',
        'date_iso': r'^\d{4}-\d{2}-\d{2}',
        'date_us': r'^\d{1,2}\/\d{1,2}\/\d{2,4}$',
        'uuid': r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$',
        'numeric': r'^\d+(\.\d+)?$',
        'alphanumeric': r'^[a-zA-Z0-9]+$'
    }

    for col in df.columns:
        if pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            col_patterns = {}
            non_null = df[col].dropna().astype(str)

            if len(non_null) == 0:
                continue

            # Test each pattern
            for pattern_name, pattern_regex in pattern_regexes.items():
                matches = non_null.str.match(pattern_regex, flags=re.IGNORECASE)
                match_count = matches.sum()
                match_percentage = (match_count / len(non_null)) * 100

                if match_percentage > 50:  # If >50% match, consider it a pattern
                    col_patterns[pattern_name] = {
                        'match_count': int(match_count),
                        'match_percentage': float(match_percentage)
                    }

            if col_patterns:
                patterns[col] = col_patterns

    return patterns
python
import re

def detect_patterns(df: pd.DataFrame) -> dict:
    """
    Detect common patterns in string columns.

    Returns:
        Dictionary with detected patterns per column
    """
    patterns = {}

    # Common regex patterns
    pattern_regexes = {
        'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
        'phone_us': r'^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$',
        'url': r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b',
        'ipv4': r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$',
        'date_iso': r'^\d{4}-\d{2}-\d{2}',
        'date_us': r'^\d{1,2}\/\d{1,2}\/\d{2,4}$',
        'uuid': r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$',
        'numeric': r'^\d+(\.\d+)?$',
        'alphanumeric': r'^[a-zA-Z0-9]+$'
    }

    for col in df.columns:
        if pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            col_patterns = {}
            non_null = df[col].dropna().astype(str)

            if len(non_null) == 0:
                continue

            # Test each pattern
            for pattern_name, pattern_regex in pattern_regexes.items():
                matches = non_null.str.match(pattern_regex, flags=re.IGNORECASE)
                match_count = matches.sum()
                match_percentage = (match_count / len(non_null)) * 100

                if match_percentage > 50:  # If >50% match, consider it a pattern
                    col_patterns[pattern_name] = {
                        'match_count': int(match_count),
                        'match_percentage': float(match_percentage)
                    }

            if col_patterns:
                patterns[col] = col_patterns

    return patterns

6. Data Quality Metrics

6. 数据质量指标

python
def calculate_quality_metrics(df: pd.DataFrame) -> dict:
    """
    Calculate overall data quality metrics.

    Returns:
        Dictionary with quality scores and metrics
    """
    quality = {
        'overall_completeness': 0.0,
        'columns': {}
    }

    total_cells = len(df) * len(df.columns)
    non_null_cells = df.notna().sum().sum()
    quality['overall_completeness'] = (non_null_cells / total_cells) * 100 if total_cells > 0 else 0

    for col in df.columns:
        col_quality = {}

        # Completeness
        total_count = len(df[col])
        non_null_count = df[col].notna().sum()
        col_quality['completeness'] = (non_null_count / total_count) * 100 if total_count > 0 else 0

        # Consistency (e.g., no mixed types)
        if pd.api.types.is_numeric_dtype(df[col]):
            col_quality['type_consistency'] = 100.0  # All numeric
        elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            # Check if all non-null values are same type
            non_null = df[col].dropna()
            if len(non_null) > 0:
                types = non_null.apply(type)
                col_quality['type_consistency'] = (types.value_counts().max() / len(non_null)) * 100

        quality['columns'][col] = col_quality

    return quality
python
def calculate_quality_metrics(df: pd.DataFrame) -> dict:
    """
    Calculate overall data quality metrics.

    Returns:
        Dictionary with quality scores and metrics
    """
    quality = {
        'overall_completeness': 0.0,
        'columns': {}
    }

    total_cells = len(df) * len(df.columns)
    non_null_cells = df.notna().sum().sum()
    quality['overall_completeness'] = (non_null_cells / total_cells) * 100 if total_cells > 0 else 0

    for col in df.columns:
        col_quality = {}

        # Completeness
        total_count = len(df[col])
        non_null_count = df[col].notna().sum()
        col_quality['completeness'] = (non_null_count / total_count) * 100 if total_count > 0 else 0

        # Consistency (e.g., no mixed types)
        if pd.api.types.is_numeric_dtype(df[col]):
            col_quality['type_consistency'] = 100.0  # All numeric
        elif pd.api.types.is_string_dtype(df[col]) or df[col].dtype == object:
            # Check if all non-null values are same type
            non_null = df[col].dropna()
            if len(non_null) > 0:
                types = non_null.apply(type)
                col_quality['type_consistency'] = (types.value_counts().max() / len(non_null)) * 100

        quality['columns'][col] = col_quality

    return quality

Complete Profiling Example

完整探查示例

python
import pandas as pd
from typing import Dict, Any

def profile_dataset(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Complete dataset profiling with all metrics.

    Args:
        df: pandas DataFrame to profile

    Returns:
        Comprehensive profiling results dictionary
    """
    profile = {
        'metadata': {
            'row_count': len(df),
            'column_count': len(df.columns),
            'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024 * 1024)
        },
        'schema': detect_schema(df),
        'statistics': get_basic_statistics(df),
        'distributions': analyze_distributions(df),
        'uniqueness': analyze_uniqueness(df),
        'patterns': detect_patterns(df),
        'quality': calculate_quality_metrics(df)
    }

    return profile
python
import pandas as pd
from typing import Dict, Any

def profile_dataset(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Complete dataset profiling with all metrics.

    Args:
        df: pandas DataFrame to profile

    Returns:
        Comprehensive profiling results dictionary
    """
    profile = {
        'metadata': {
            'row_count': len(df),
            'column_count': len(df.columns),
            'memory_usage_mb': df.memory_usage(deep=True).sum() / (1024 * 1024)
        },
        'schema': detect_schema(df),
        'statistics': get_basic_statistics(df),
        'distributions': analyze_distributions(df),
        'uniqueness': analyze_uniqueness(df),
        'patterns': detect_patterns(df),
        'quality': calculate_quality_metrics(df)
    }

    return profile

Usage example

Usage example

if name == "main": # Load dataset df = pd.read_csv("data.csv")
# Profile
profile_results = profile_dataset(df)

# Print summary
print(f"Dataset Profile:")
print(f"  Rows: {profile_results['metadata']['row_count']}")
print(f"  Columns: {profile_results['metadata']['column_count']}")
print(f"  Completeness: {profile_results['quality']['overall_completeness']:.2f}%")
undefined
if name == "main": # Load dataset df = pd.read_csv("data.csv")
# Profile
profile_results = profile_dataset(df)

# Print summary
print(f"Dataset Profile:")
print(f"  Rows: {profile_results['metadata']['row_count']}")
print(f"  Columns: {profile_results['metadata']['column_count']}")
print(f"  Completeness: {profile_results['quality']['overall_completeness']:.2f}%")
undefined

Best Practices

最佳实践

  1. Sample Large Datasets: For datasets >1M rows, profile a representative sample
  2. Handle Missing Data: Check for nulls before calculating statistics
  3. Type Inference: Use semantic types, not just Python/pandas types
  4. Memory Management: Be cautious with very wide datasets (many columns)
  5. Percentiles: Use percentiles to understand distribution better than mean
  6. Pattern Detection: Test patterns on non-null values only
  1. 对大型数据集抽样:对于超过100万行的数据集,对代表性样本进行探查
  2. 处理缺失数据:计算统计数据前检查空值
  3. 类型推断:使用语义类型,而不仅仅是Python/pandas类型
  4. 内存管理:对超宽数据集(多列)要谨慎
  5. 百分位数:使用百分位数比平均值更能了解分布情况
  6. 模式检测:仅对非空值测试模式

Output Format for Personas

面向角色的输出格式

When using this skill, output results in a structured format:
yaml
dataset_profile:
  metadata:
    row_count: 10000
    column_count: 15
    memory_usage_mb: 2.5

  schema:
    columns:
      - name: customer_id
        type: string
        nullable: false
        null_count: 0
        semantic_type: string

      - name: revenue
        type: float64
        nullable: true
        null_count: 250
        null_percentage: 2.5
        semantic_type: numeric

  quality:
    overall_completeness: 95.5
    columns:
      customer_id:
        completeness: 100.0
        type_consistency: 100.0
      revenue:
        completeness: 97.5
        type_consistency: 100.0

  recommendations:
    - "customer_id has 98% uniqueness - suitable as primary key"
    - "revenue has 2.5% nulls - consider default value or filtering"
    - "date column matches ISO 8601 format - ready for datetime conversion"
使用此技能时,以结构化格式输出结果:
yaml
dataset_profile:
  metadata:
    row_count: 10000
    column_count: 15
    memory_usage_mb: 2.5

  schema:
    columns:
      - name: customer_id
        type: string
        nullable: false
        null_count: 0
        semantic_type: string

      - name: revenue
        type: float64
        nullable: true
        null_count: 250
        null_percentage: 2.5
        semantic_type: numeric

  quality:
    overall_completeness: 95.5
    columns:
      customer_id:
        completeness: 100.0
        type_consistency: 100.0
      revenue:
        completeness: 97.5
        type_consistency: 100.0

  recommendations:
    - "customer_id has 98% uniqueness - suitable as primary key"
    - "revenue has 2.5% nulls - consider default value or filtering"
    - "date column matches ISO 8601 format - ready for datetime conversion"

Integration with Other Skills

与其他技能的集成

  • azure-blob-storage: Load data from Azure before profiling
  • databricks-query: Profile query results from Unity Catalog
  • Use profiling results to inform Data Ingestion Agent about transformations needed
  • azure-blob-storage:探查前从Azure加载数据
  • databricks-query:探查来自Unity Catalog的查询结果
  • 使用探查结果告知Data Ingestion Agent所需的转换操作