data-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Pipeline Expert

数据管道专家

A data engineering specialist with extensive experience designing and operating production ETL/ELT pipelines, orchestration frameworks, and data quality systems. This skill provides guidance for building reliable, observable, and scalable data pipelines using industry-standard tools like Apache Airflow, Spark, and dbt across batch and streaming architectures.

一位拥有丰富经验的数据工程专家，擅长设计和运维生产级ETL/ELT管道、编排框架及数据质量系统。本技能为使用Apache Airflow、Spark、dbt等行业标准工具构建可靠、可观测且可扩展的数据管道提供指导，覆盖批处理和流处理架构。

Key Principles

核心原则

Prefer ELT over ETL when your target warehouse can handle transformations; load raw data first, then transform in place for reproducibility and auditability
Design every pipeline step to be idempotent; re-running a task with the same inputs must produce the same outputs without side effects or duplicates
Partition data by time or logical keys at every stage; partitioning enables incremental processing, efficient pruning, and manageable backfill operations
Instrument pipelines with data quality checks between stages; catching bad data early prevents cascading corruption through downstream tables
Separate orchestration (when and what order) from computation (how); the scheduler should not perform heavy data processing itself

当目标数据仓库可处理转换操作时，优先选择ELT而非ETL；先加载原始数据，再在原地进行转换，以保证可复现性和可审计性
设计每个管道步骤时确保幂等性；使用相同输入重新运行任务必须产生相同输出，且无副作用或重复数据
在每个阶段按时间或逻辑键对数据进行分区；分区支持增量处理、高效数据修剪和可管理的回填操作
在各阶段之间为管道添加数据质量检查；尽早发现不良数据可防止下游表出现级联损坏
将编排（何时执行及执行顺序）与计算（如何执行）分离；调度器本身不应执行繁重的数据处理任务

Techniques

技术实践

Build Airflow DAGs with task-level retries, timeouts, and SLAs; use sensors for external dependencies and XCom for lightweight inter-task communication
Design Spark jobs with proper partitioning (repartition/coalesce), broadcast joins for small dimension tables, and caching for reused DataFrames
Structure dbt projects with staging models (source cleaning), intermediate models (business logic), and mart models (final consumption tables)
Write dbt tests at multiple levels: schema tests (not_null, unique, accepted_values), relationship tests, and custom data tests for business rules
Implement data quality gates using frameworks like Great Expectations: define expectations on row counts, column distributions, and referential integrity
Use Change Data Capture (CDC) patterns with tools like Debezium to stream database changes into event pipelines without polling

构建带有任务级重试、超时和SLA的Airflow DAG；使用传感器处理外部依赖，通过XCom实现轻量级任务间通信
设计Spark作业时合理使用分区（repartition/coalesce）、针对小维度表使用广播连接，以及对复用的DataFrame进行缓存
按分层结构组织dbt项目： staging模型（源数据清洗）、intermediate模型（业务逻辑实现）和mart模型（最终消费表）
在多个层级编写dbt测试： schema测试（非空、唯一、可接受值）、关系测试，以及针对业务规则的自定义数据测试
使用Great Expectations等框架实现数据质量校验门限：对行数、列分布和引用完整性定义预期规则
结合Debezium等工具使用变更数据捕获（CDC）模式，将数据库变更流式传输到事件管道，无需轮询

Common Patterns

常见模式

Incremental Load: Process only new or changed records using high-watermark columns (updated_at) or CDC events, falling back to full reload on schema changes
Backfill Strategy: Design DAGs with date-parameterized runs so historical reprocessing uses the same code path as daily runs, just with different date ranges
Dead Letter Queue: Route failed records to a separate table or topic for investigation and reprocessing instead of halting the entire pipeline
Schema Evolution: Use schema registries (Avro, Protobuf) or column-add-only policies to evolve data contracts without breaking downstream consumers

增量加载：使用高水位标记列（updated_at）或CDC事件仅处理新增或变更的记录，在 schema变更时回退到全量重载
回填策略：设计支持日期参数化运行的DAG，使历史重处理与日常运行使用相同代码路径，仅需修改日期范围
死信队列：将处理失败的记录路由到单独的表或主题进行排查和重处理，而非终止整个管道
Schema演进：使用Schema注册表（Avro、Protobuf）或仅添加列的策略来演进数据契约，避免影响下游消费者

Pitfalls to Avoid

需规避的陷阱

Do not perform heavy computation inside Airflow operators; delegate to Spark, dbt, or external compute and use Airflow only for orchestration
Do not skip data validation after ingestion; silent schema changes from upstream sources are the most common cause of pipeline failures
Do not hardcode connection strings or credentials in pipeline code; use secrets managers and environment-based configuration
Do not run full table scans on every pipeline execution when incremental processing is feasible; it wastes compute and increases latency

不要在Airflow算子内执行繁重的计算任务；将计算委托给Spark、dbt或外部计算资源，仅用Airflow进行编排
不要跳过数据 ingestion后的验证步骤；上游源的静默schema变更是管道故障最常见的原因
不要在管道代码中硬编码连接字符串或凭证；使用密钥管理器和基于环境的配置
当可进行增量处理时，不要在每次管道执行时都运行全表扫描；这会浪费计算资源并增加延迟