data-quality

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Quality

数据质量

When to Use

适用场景

When the user is implementing data validation, setting up quality checks for a pipeline, defining data contracts between teams, or asks "how do I ensure data quality?" or "what checks should I add to this pipeline?". Also applies when investigating data anomalies or setting up data observability.
当用户需要实现数据校验、为流水线设置质量检查、定义团队间的数据契约,或是询问「我该如何保障数据质量?」「我应该给这个流水线添加哪些检查?」时适用。也适用于调查数据异常或搭建数据可观测性的场景。

Instructions

指引

1. Data Quality Dimensions

1. 数据质量维度

Evaluate data against six dimensions:
  • Completeness: Are all expected records and fields present? Measure: null rate per column, row count vs expected.
  • Accuracy: Do values reflect reality? Measure: spot checks against source of truth, known reference data.
  • Consistency: Do related fields agree? Measure: cross-field validation (end_date >= start_date, city matches zip code).
  • Timeliness: Is data available when needed? Measure: freshness (time since last update), SLA adherence.
  • Uniqueness: Are there duplicates? Measure: distinct count of natural keys vs total row count.
  • Validity: Do values conform to expected formats and ranges? Measure: regex matches, range checks, enum membership.
从六个维度评估数据质量:
  • 完整性:是否包含所有预期的记录和字段?衡量指标:每列空值率、实际行数与预期行数对比。
  • 准确性:数值是否反映真实情况?衡量指标:对照可信数据源进行抽查、与已知参考数据比对。
  • 一致性:关联字段的值是否一致?衡量指标:跨字段校验(end_date >= start_date、城市与邮政编码匹配)。
  • 时效性:数据是否能按需及时获取?衡量指标:新鲜度(距离上次更新的时长)、SLA 合规性。
  • 唯一性:是否存在重复数据?衡量指标:自然键去重后计数与总行数对比。
  • 有效性:数值是否符合预期格式和范围?衡量指标:正则匹配、范围校验、枚举值归属校验。

2. Validation Rules

2. 校验规则

  • Schema validation: Column names, data types, nullable flags match the contract
  • Range checks: Numeric values within expected bounds (age 0-150, price >= 0)
  • Format checks: Dates in ISO 8601, emails match regex, phone numbers normalized
  • Referential integrity: Foreign keys resolve to existing records in parent tables
  • Business rules: Order total = sum of line items, start_date < end_date
  • Freshness checks: Last record timestamp within expected window
  • Categorize rules by severity:
    error
    (block pipeline) vs
    warning
    (log and continue)
  • Schema 校验:列名、数据类型、可空标记与契约约定一致
  • 范围校验:数值处于预期边界内(年龄 0-150、价格 >= 0)
  • 格式校验:日期符合 ISO 8601 格式、邮箱匹配正则、电话号码归一化
  • 参照完整性:外键可关联到父表中已存在的记录
  • 业务规则校验:订单总额 = 所有明细项金额之和、start_date < end_date
  • 新鲜度校验:最新记录的时间戳处于预期时间窗口内
  • 按 severity 对规则分类:
    error
    (阻断流水线)和
    warning
    (日志记录后继续执行)

3. Anomaly Detection

3. 异常检测

  • Track key metrics over time: row counts, null rates, distinct value counts, mean/median
  • Alert when metrics deviate beyond a threshold (e.g., row count drops > 20% vs previous run)
  • Use statistical methods: z-score for normally distributed metrics, IQR for skewed
  • Monitor distribution shifts: column value distributions should be stable over time
  • Track schema changes: new columns, type changes, dropped columns
  • Check for sudden spikes in null rates or default values
  • 长期跟踪核心指标:行数、空值率、去重值计数、平均值/中位数
  • 当指标偏离阈值时触发告警(例如:行数较上次运行下降超过20%)
  • 使用统计方法:正态分布指标用 z-score、偏态分布指标用 IQR
  • 监控分布偏移:列值分布应随时间保持稳定
  • 跟踪 Schema 变更:新增列、类型变更、删除列
  • 检查空值率或默认值是否出现突然飙升

4. Data Contracts

4. 数据契约

  • A data contract is a formal agreement between producer and consumer on data shape and quality
  • Define: schema (columns, types), SLAs (freshness, availability), quality thresholds, ownership
  • Version contracts: breaking changes require a new version with migration period
  • Contracts should be machine-readable (JSON Schema, protobuf, YAML) and enforced automatically
  • Include: who to contact when the contract is violated, escalation path
  • Review contracts quarterly or when business requirements change
  • 数据契约是数据生产者和消费者之间关于数据结构和质量的正式协议
  • 需定义:Schema(列、类型)、SLA(新鲜度、可用性)、质量阈值、权属信息
  • 契约版本管理:破坏性变更需要发布新版本并设置迁移过渡期
  • 契约应为机器可读格式(JSON Schema、protobuf、YAML)并支持自动 enforcement
  • 包含:契约违约时的联系人、升级路径
  • 每季度或业务需求变更时重新评审契约

5. Great Expectations Patterns

5. Great Expectations 模式

  • Organize expectations into suites: one per table or pipeline stage
  • Core expectations to always include:
    • expect_table_row_count_to_be_between
      — catch empty or exploding tables
    • expect_column_values_to_not_be_null
      — for required columns
    • expect_column_values_to_be_unique
      — for natural keys
    • expect_column_values_to_be_in_set
      — for enum columns
  • Run validation as a pipeline step; fail the pipeline on critical expectation failures
  • Store validation results for trend analysis and audit trails
  • 将 expectation 整理为套件:每张表或每个流水线阶段对应一个套件
  • 始终要包含的核心 expectation:
    • expect_table_row_count_to_be_between
      — 检测空表或行数爆炸的表
    • expect_column_values_to_not_be_null
      — 针对必填列
    • expect_column_values_to_be_unique
      — 针对自然键
    • expect_column_values_to_be_in_set
      — 针对枚举列
  • 将校验作为流水线的一个步骤运行;关键 expectation 失败时终止流水线
  • 存储校验结果用于趋势分析和审计跟踪

6. Data Observability

6. 数据可观测性

  • Instrument pipelines with metadata: row counts, schema snapshots, execution time
  • Build a lineage graph: know where each dataset comes from and what depends on it
  • Dashboard key health indicators: freshness, volume trends, error rates per table
  • Set up alerts for: stale data, volume anomalies, schema drift, quality rule failures
  • Automate root cause analysis: when quality drops, trace back through lineage to find the source
  • 为流水线埋点采集元数据:行数、Schema 快照、执行时长
  • 搭建血缘图谱:明确每个数据集的来源和依赖关系
  • 核心健康指标看板:新鲜度、量级趋势、每张表的错误率
  • 针对以下场景设置告警:数据过期、量级异常、Schema 漂移、质量规则失败
  • 自动化根因分析:当质量下降时,通过血缘回溯定位问题源

7. Implementing Quality Checks

7. 质量检查落地

  • Place checks at pipeline boundaries: after extraction, after transformation, after loading
  • Run checks before and after: compare source row count to destination row count
  • Store check results in a quality metadata table: timestamp, check name, result, details
  • Start with the five critical checks: row count, null rate on key columns, duplicate rate, freshness, schema match. Add more as issues arise.
  • Make checks configurable: thresholds in config files, not hardcoded
  • 在流水线边界设置检查:抽取后、转换后、加载后
  • 运行前后对比检查:比对源表和目标表的行数
  • 将检查结果存储在质量元数据表中:时间戳、检查名称、结果、详情
  • 从五个核心检查开始落地:行数校验、关键列空值率、重复率、新鲜度、Schema 匹配。随着问题出现再补充更多检查
  • 支持检查配置化:阈值存放在配置文件中,而非硬编码

Examples

示例

User: "Set up data quality checks for our customer table" Agent: Implements checks across all six dimensions: completeness (null rate for email, name, created_at), uniqueness (customer_id has no duplicates), validity (email matches regex, status in allowed enum), consistency (country matches postal code format), timeliness (most recent created_at within 24 hours), and accuracy (spot check sample against CRM export). Categorizes each as error or warning severity and stores results in a quality_checks table.
User: "Our daily report numbers look wrong but the pipeline succeeded" Agent: Adds anomaly detection to the pipeline: tracks row counts and key aggregates over time, alerts when values deviate more than 2 standard deviations from the 30-day rolling average. Adds cross-field consistency checks (order totals match line item sums) and a freshness check to ensure all source tables updated before the report runs.
User: "Define a data contract between the events team and analytics" Agent: Creates a versioned contract in YAML specifying: schema (column names, types, nullable), quality thresholds (null rate < 1% for user_id, row count > 1000/day), freshness SLA (data available by 06:00 UTC), ownership (events team as producer, analytics as consumer), and contact/escalation information. Adds automated validation that runs on each delivery and notifies both teams on violations.
用户:「为我们的客户表设置数据质量检查」 Agent: 覆盖六个维度实现检查:完整性(邮箱、姓名、created_at 列的空值率)、唯一性(customer_id 无重复)、有效性(邮箱匹配正则、状态属于允许的枚举值)、一致性(国家与邮政编码格式匹配)、时效性(最新的 created_at 处于24小时内)、准确性(对照 CRM 导出数据抽查样本)。将每个检查归类为 error 或 warning severity,并将结果存储在 quality_checks 表中。
用户:「我们的日报数据看起来有问题,但流水线运行成功了」 Agent: 为流水线添加异常检测:长期跟踪行数和核心聚合指标,当数值偏离30天滚动平均值超过2个标准差时触发告警。添加跨字段一致性校验(订单总额与明细项总和匹配)和新鲜度检查,确保报表运行前所有源表都已更新。
用户:「定义事件团队和分析团队之间的数据契约」 Agent: 用 YAML 创建版本化契约,明确规定:Schema(列名、类型、可空性)、质量阈值(user_id 空值率 < 1%、每日行数 > 1000)、新鲜度 SLA(数据在 UTC 时间06:00前可用)、权属(事件团队为生产者、分析团队为消费者),以及联系人/升级路径信息。添加自动校验机制,每次数据交付时运行校验,违约时通知两个团队。