data-quality-checker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Quality Checker
数据质量检查工具
Implement comprehensive data quality checks and validation.
实施全面的数据质量检查与验证。
Quick Start
快速开始
Use Great Expectations for validation, implement schema checks, monitor data quality metrics, set up alerts.
使用Great Expectations进行验证,实施模式检查,监控数据质量指标,设置告警。
Instructions
说明
Great Expectations Setup
Great Expectations 配置
python
import great_expectations as gx
context = gx.get_context()python
import great_expectations as gx
context = gx.get_context()Create expectation suite
创建期望套件
suite = context.add_expectation_suite("data_quality_suite")
suite = context.add_expectation_suite("data_quality_suite")
Add expectations
添加期望规则
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="data_quality_suite"
)
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="data_quality_suite"
)
Schema validation
模式验证
validator.expect_table_columns_to_match_ordered_list(
column_list=["id", "name", "email", "created_at"]
)
validator.expect_table_columns_to_match_ordered_list(
column_list=["id", "name", "email", "created_at"]
)
Null checks
空值检查
validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_not_be_null("email")
Value ranges
数值范围检查
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
Uniqueness
唯一性检查
validator.expect_column_values_to_be_unique("email")
validator.expect_column_values_to_be_unique("email")
Run validation
运行验证
results = validator.validate()
undefinedresults = validator.validate()
undefinedCustom Validation Rules
自定义验证规则
python
def validate_data_quality(df):
issues = []
# Check for nulls
null_counts = df.isnull().sum()
if null_counts.any():
issues.append(f"Null values found: {null_counts[null_counts > 0]}")
# Check for duplicates
duplicates = df.duplicated().sum()
if duplicates > 0:
issues.append(f"Found {duplicates} duplicate rows")
# Check data freshness
max_date = df['created_at'].max()
if (datetime.now() - max_date).days > 1:
issues.append("Data is stale")
return issuespython
def validate_data_quality(df):
issues = []
# 检查空值
null_counts = df.isnull().sum()
if null_counts.any():
issues.append(f"发现空值: {null_counts[null_counts > 0]}")
# 检查重复项
duplicates = df.duplicated().sum()
if duplicates > 0:
issues.append(f"发现 {duplicates} 条重复行")
# 检查数据新鲜度
max_date = df['created_at'].max()
if (datetime.now() - max_date).days > 1:
issues.append("数据已过期")
return issuesData Quality Metrics
数据质量指标
python
def calculate_quality_metrics(df):
return {
'completeness': 1 - (df.isnull().sum().sum() / df.size),
'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
'validity': (df['email'].str.contains('@').sum() / len(df)),
'timeliness': (datetime.now() - df['created_at'].max()).days
}python
def calculate_quality_metrics(df):
return {
'completeness': 1 - (df.isnull().sum().sum() / df.size),
'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
'validity': (df['email'].str.contains('@').sum() / len(df)),
'timeliness': (datetime.now() - df['created_at'].max()).days
}Best Practices
最佳实践
- Validate at ingestion
- Monitor quality metrics
- Set up alerts for failures
- Document quality rules
- Regular quality audits
- Track quality trends
- 在数据摄入阶段进行验证
- 监控质量指标
- 为验证失败设置告警
- 记录质量规则
- 定期开展质量审计
- 跟踪质量趋势