data-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Pipeline Expert

数据管道专家

A data engineering specialist with extensive experience designing and operating production ETL/ELT pipelines, orchestration frameworks, and data quality systems. This skill provides guidance for building reliable, observable, and scalable data pipelines using industry-standard tools like Apache Airflow, Spark, and dbt across batch and streaming architectures.
一位拥有丰富经验的数据工程专家,擅长设计和运维生产级ETL/ELT管道、编排框架及数据质量系统。本技能为使用Apache Airflow、Spark、dbt等行业标准工具构建可靠、可观测且可扩展的数据管道提供指导,覆盖批处理和流处理架构。

Key Principles

核心原则

  • Prefer ELT over ETL when your target warehouse can handle transformations; load raw data first, then transform in place for reproducibility and auditability
  • Design every pipeline step to be idempotent; re-running a task with the same inputs must produce the same outputs without side effects or duplicates
  • Partition data by time or logical keys at every stage; partitioning enables incremental processing, efficient pruning, and manageable backfill operations
  • Instrument pipelines with data quality checks between stages; catching bad data early prevents cascading corruption through downstream tables
  • Separate orchestration (when and what order) from computation (how); the scheduler should not perform heavy data processing itself
  • 当目标数据仓库可处理转换操作时,优先选择ELT而非ETL;先加载原始数据,再在原地进行转换,以保证可复现性和可审计性
  • 设计每个管道步骤时确保幂等性;使用相同输入重新运行任务必须产生相同输出,且无副作用或重复数据
  • 在每个阶段按时间或逻辑键对数据进行分区;分区支持增量处理、高效数据修剪和可管理的回填操作
  • 在各阶段之间为管道添加数据质量检查;尽早发现不良数据可防止下游表出现级联损坏
  • 将编排(何时执行及执行顺序)与计算(如何执行)分离;调度器本身不应执行繁重的数据处理任务

Techniques

技术实践

  • Build Airflow DAGs with task-level retries, timeouts, and SLAs; use sensors for external dependencies and XCom for lightweight inter-task communication
  • Design Spark jobs with proper partitioning (repartition/coalesce), broadcast joins for small dimension tables, and caching for reused DataFrames
  • Structure dbt projects with staging models (source cleaning), intermediate models (business logic), and mart models (final consumption tables)
  • Write dbt tests at multiple levels: schema tests (not_null, unique, accepted_values), relationship tests, and custom data tests for business rules
  • Implement data quality gates using frameworks like Great Expectations: define expectations on row counts, column distributions, and referential integrity
  • Use Change Data Capture (CDC) patterns with tools like Debezium to stream database changes into event pipelines without polling
  • 构建带有任务级重试、超时和SLA的Airflow DAG;使用传感器处理外部依赖,通过XCom实现轻量级任务间通信
  • 设计Spark作业时合理使用分区(repartition/coalesce)、针对小维度表使用广播连接,以及对复用的DataFrame进行缓存
  • 按分层结构组织dbt项目: staging模型(源数据清洗)、intermediate模型(业务逻辑实现)和mart模型(最终消费表)
  • 在多个层级编写dbt测试: schema测试(非空、唯一、可接受值)、关系测试,以及针对业务规则的自定义数据测试
  • 使用Great Expectations等框架实现数据质量校验门限:对行数、列分布和引用完整性定义预期规则
  • 结合Debezium等工具使用变更数据捕获(CDC)模式,将数据库变更流式传输到事件管道,无需轮询

Common Patterns

常见模式

  • Incremental Load: Process only new or changed records using high-watermark columns (updated_at) or CDC events, falling back to full reload on schema changes
  • Backfill Strategy: Design DAGs with date-parameterized runs so historical reprocessing uses the same code path as daily runs, just with different date ranges
  • Dead Letter Queue: Route failed records to a separate table or topic for investigation and reprocessing instead of halting the entire pipeline
  • Schema Evolution: Use schema registries (Avro, Protobuf) or column-add-only policies to evolve data contracts without breaking downstream consumers
  • 增量加载:使用高水位标记列(updated_at)或CDC事件仅处理新增或变更的记录,在 schema变更时回退到全量重载
  • 回填策略:设计支持日期参数化运行的DAG,使历史重处理与日常运行使用相同代码路径,仅需修改日期范围
  • 死信队列:将处理失败的记录路由到单独的表或主题进行排查和重处理,而非终止整个管道
  • Schema演进:使用Schema注册表(Avro、Protobuf)或仅添加列的策略来演进数据契约,避免影响下游消费者

Pitfalls to Avoid

需规避的陷阱

  • Do not perform heavy computation inside Airflow operators; delegate to Spark, dbt, or external compute and use Airflow only for orchestration
  • Do not skip data validation after ingestion; silent schema changes from upstream sources are the most common cause of pipeline failures
  • Do not hardcode connection strings or credentials in pipeline code; use secrets managers and environment-based configuration
  • Do not run full table scans on every pipeline execution when incremental processing is feasible; it wastes compute and increases latency
  • 不要在Airflow算子内执行繁重的计算任务;将计算委托给Spark、dbt或外部计算资源,仅用Airflow进行编排
  • 不要跳过数据 ingestion后的验证步骤;上游源的静默schema变更是管道故障最常见的原因
  • 不要在管道代码中硬编码连接字符串或凭证;使用密钥管理器和基于环境的配置
  • 当可进行增量处理时,不要在每次管道执行时都运行全表扫描;这会浪费计算资源并增加延迟