data-quality-checker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Quality Checker

数据质量检查工具

Implement comprehensive data quality checks and validation.
实施全面的数据质量检查与验证。

Quick Start

快速开始

Use Great Expectations for validation, implement schema checks, monitor data quality metrics, set up alerts.
使用Great Expectations进行验证,实施模式检查,监控数据质量指标,设置告警。

Instructions

说明

Great Expectations Setup

Great Expectations 配置

python
import great_expectations as gx

context = gx.get_context()
python
import great_expectations as gx

context = gx.get_context()

Create expectation suite

创建期望套件

suite = context.add_expectation_suite("data_quality_suite")
suite = context.add_expectation_suite("data_quality_suite")

Add expectations

添加期望规则

validator = context.get_validator( batch_request=batch_request, expectation_suite_name="data_quality_suite" )
validator = context.get_validator( batch_request=batch_request, expectation_suite_name="data_quality_suite" )

Schema validation

模式验证

validator.expect_table_columns_to_match_ordered_list( column_list=["id", "name", "email", "created_at"] )
validator.expect_table_columns_to_match_ordered_list( column_list=["id", "name", "email", "created_at"] )

Null checks

空值检查

validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_not_be_null("email")

Value ranges

数值范围检查

validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)

Uniqueness

唯一性检查

validator.expect_column_values_to_be_unique("email")
validator.expect_column_values_to_be_unique("email")

Run validation

运行验证

results = validator.validate()
undefined
results = validator.validate()
undefined

Custom Validation Rules

自定义验证规则

python
def validate_data_quality(df):
    issues = []
    
    # Check for nulls
    null_counts = df.isnull().sum()
    if null_counts.any():
        issues.append(f"Null values found: {null_counts[null_counts > 0]}")
    
    # Check for duplicates
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        issues.append(f"Found {duplicates} duplicate rows")
    
    # Check data freshness
    max_date = df['created_at'].max()
    if (datetime.now() - max_date).days > 1:
        issues.append("Data is stale")
    
    return issues
python
def validate_data_quality(df):
    issues = []
    
    # 检查空值
    null_counts = df.isnull().sum()
    if null_counts.any():
        issues.append(f"发现空值: {null_counts[null_counts > 0]}")
    
    # 检查重复项
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        issues.append(f"发现 {duplicates} 条重复行")
    
    # 检查数据新鲜度
    max_date = df['created_at'].max()
    if (datetime.now() - max_date).days > 1:
        issues.append("数据已过期")
    
    return issues

Data Quality Metrics

数据质量指标

python
def calculate_quality_metrics(df):
    return {
        'completeness': 1 - (df.isnull().sum().sum() / df.size),
        'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
        'validity': (df['email'].str.contains('@').sum() / len(df)),
        'timeliness': (datetime.now() - df['created_at'].max()).days
    }
python
def calculate_quality_metrics(df):
    return {
        'completeness': 1 - (df.isnull().sum().sum() / df.size),
        'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
        'validity': (df['email'].str.contains('@').sum() / len(df)),
        'timeliness': (datetime.now() - df['created_at'].max()).days
    }

Best Practices

最佳实践

  • Validate at ingestion
  • Monitor quality metrics
  • Set up alerts for failures
  • Document quality rules
  • Regular quality audits
  • Track quality trends
  • 在数据摄入阶段进行验证
  • 监控质量指标
  • 为验证失败设置告警
  • 记录质量规则
  • 定期开展质量审计
  • 跟踪质量趋势