tech-data-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Pipeline Design

数据管道设计

Framework

框架

IRON LAW: Data Quality Checks at Every Stage

A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.

"Garbage in, garbage out" is not a warning — it's a guarantee.
铁律:每个阶段都要进行数据质量检查

快速传输错误数据的管道比没有管道更糟糕——它会破坏下游的分析和决策。管道的每个阶段(提取、转换、加载)都必须进行数据质量检查:行数校验、空值检查、schema验证、新鲜度检查。

“垃圾进,垃圾出”不是警告——而是必然结果。

ETL vs ELT

ETL vs ELT

AspectETL (Extract, Transform, Load)ELT (Extract, Load, Transform)
Transform where?Before loading (in pipeline)After loading (in warehouse)
Best forStructured data, compliance-heavyCloud warehouses (BigQuery, Snowflake)
FlexibilityLess (transform logic is fixed)More (transform in SQL after loading)
CostCompute in pipelineCompute in warehouse
TrendLegacy/on-premModern/cloud-native
维度ETL(提取、转换、加载)ELT(提取、加载、转换)
转换位置?加载前(在管道中)加载后(在数据仓库中)
适用场景结构化数据、合规要求高的场景云数据仓库(BigQuery、Snowflake)
灵活性较低(转换逻辑固定)较高(加载后用SQL进行转换)
成本管道计算资源成本数据仓库计算资源成本
趋势传统/本地部署现代/云原生

Pipeline Architecture

管道架构

[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
   ↑                                                          ↓
   |              [Quality Checks at every stage]        [Dashboard]
   |              [Monitoring & Alerting]                [API]
   └──────────────── [Orchestrator (Airflow/Prefect)] ──────┘
[数据源] → [提取] → [暂存] → [转换] → [加载] → [服务]
   ↑                                                          ↓
   |              [每个阶段的质量检查]        [仪表盘]
   |              [监控与告警]                [API]
   └──────────────── [编排工具(Airflow/Prefect)] ──────┘

Data Source Types

数据源类型

SourceExtraction MethodChallenges
DatabaseCDC (Change Data Capture), bulk query, replicationSchema changes, performance impact on source
APIREST/GraphQL polling, webhooksRate limits, pagination, auth token refresh
FilesS3/GCS pickup, SFTP, email attachmentFormat inconsistency, encoding issues
StreamingKafka, Kinesis, Pub/SubOrdering, exactly-once processing
SaaS toolsPre-built connectors (Fivetran, Airbyte)API changes, data model complexity
数据源提取方法挑战
数据库CDC(变更数据捕获)、批量查询、复制Schema变更、对源系统的性能影响
APIREST/GraphQL轮询、Webhooks速率限制、分页、认证令牌刷新
文件S3/GCS获取、SFTP、邮件附件格式不一致、编码问题
流数据Kafka、Kinesis、Pub/Sub顺序性、精确一次处理
SaaS工具预构建连接器(Fivetran、Airbyte)API变更、数据模型复杂度

Orchestration Tools

编排工具

ToolTypeBest ForComplexity
AirflowPython DAGsComplex pipelines, team of engineersHigh
PrefectPython, modern APISimpler than Airflow, good DXMedium
dbtSQL transforms onlyTransform layer in ELTLow-Medium
CronSimple schedulingSingle script, low complexityLow
Fivetran/AirbyteManaged connectorsExtract + Load (no transform)Low
工具类型适用场景复杂度
AirflowPython DAGs复杂管道、工程师团队
PrefectPython、现代API比Airflow简单、良好的开发体验
dbt仅支持SQL转换ELT中的转换层低-中
Cron简单调度单脚本、低复杂度场景
Fivetran/Airbyte托管连接器提取+加载(无转换)

Data Quality Framework

数据质量框架

CheckWhat It ValidatesWhen
Row countExpected number of rows (within ±10% of prior run)After extract, after load
Null checkCritical columns have no unexpected nullsAfter extract
Schema validationColumn names, types match expectedAfter extract
FreshnessData is recent (not stale)After load
UniquenessNo duplicate primary keysAfter load
Range checkValues within expected boundsAfter transform
Referential integrityForeign keys match parent tablesAfter load
检查项验证内容执行时机
行数校验预期行数(与上一次运行结果偏差在±10%以内)提取后、加载后
空值检查关键列无意外空值提取后
Schema验证列名、类型与预期一致提取后
新鲜度检查数据是最新的(未过期)加载后
唯一性检查无重复主键加载后
范围检查值在预期范围内转换后
参照完整性检查外键与父表匹配加载后

Pipeline Design Steps

管道设计步骤

  1. Map sources and destinations: What data, from where, to where?
  2. Define freshness requirements: Real-time? Hourly? Daily?
  3. Choose architecture: ETL or ELT based on tools and team
  4. Build incrementally: Start with one source, one destination, one schedule
  5. Add quality checks: At minimum: row count + null check + freshness
  6. Set up monitoring: Alert on failure, quality check violations, latency
  7. Document: Data dictionary, pipeline diagram, SLAs
  1. 映射数据源与目标端:要处理什么数据?从哪里来?到哪里去?
  2. 定义新鲜度要求:实时?每小时?每天?
  3. 选择架构:根据工具和团队情况选择ETL或ELT
  4. 增量构建:从一个数据源、一个目标端、一个调度计划开始
  5. 添加质量检查:至少包含:行数校验 + 空值检查 + 新鲜度检查
  6. 设置监控:针对失败、质量检查不通过、延迟情况触发告警
  7. 文档记录:数据字典、管道流程图、服务水平协议(SLAs)

Output Format

输出格式

markdown
undefined
markdown
undefined

Data Pipeline Design: {Project}

数据管道设计:{项目名称}

Sources & Destinations

数据源与目标端

SourceTypeDestinationFreshnessVolume
{source}DB/API/File{dest}{daily/hourly}{rows/day}
数据源类型目标端新鲜度数据量
{source}DB/API/File{dest}{daily/hourly}{rows/day}

Architecture

架构

  • Pattern: ETL / ELT
  • Orchestrator: {tool}
  • Transform: {tool/SQL}
  • Quality: {tool/custom checks}
  • 模式:ETL / ELT
  • 编排工具:{tool}
  • 转换工具:{tool/SQL}
  • 质量检查:{tool/自定义检查}

Pipeline Diagram

管道流程图

{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}
{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}

Quality Checks

质量检查项

StageCheckThresholdAlert
ExtractRow count±10% of priorSlack alert
LoadFreshness< 6 hours oldPagerDuty
阶段检查项阈值告警方式
提取行数校验与上一次偏差±10%Slack告警
加载新鲜度不超过6小时PagerDuty

Schedule

调度计划

PipelineFrequencyStart TimeSLA
{name}{daily/hourly}{time}Data ready by {time}
undefined
管道频率开始时间服务水平协议
{name}{daily/hourly}{time}数据需在{time}前准备就绪
undefined

Gotchas

注意事项

  • Idempotency is essential: A pipeline that runs twice should produce the same result as running once. Use upsert (not insert) and date-partitioned loads.
  • Schema drift: Source systems change schemas without warning. Build schema detection and alerting.
  • Backfill capability: When a pipeline fails for 3 days, can you rerun for those days without duplicating data? Design for this from day 1.
  • Don't build what you can buy: Fivetran/Airbyte handle 200+ source connectors. Writing a custom Salesforce extractor is rarely worth the engineering time.
  • Data warehouse vs data lake: Warehouse (BigQuery, Snowflake) = structured, SQL-queryable. Lake (S3, GCS) = raw, any format. Most modern stacks use both (lakehouse pattern).
  • 幂等性至关重要:管道运行两次的结果应与运行一次的结果一致。使用更新插入(而非插入)和按日期分区加载。
  • Schema漂移:源系统会在无预警的情况下变更Schema。需构建Schema检测和告警机制。
  • 回填能力:当管道故障3天时,能否在不重复数据的情况下重新运行这3天的任务?从设计第一天就要考虑这一点。
  • 能买就不要自己构建:Fivetran/Airbyte支持200+数据源连接器。编写自定义Salesforce提取器通常不值得投入工程时间。
  • 数据仓库 vs 数据湖:数据仓库(BigQuery、Snowflake)= 结构化、支持SQL查询。数据湖(S3、GCS)= 原始数据、任意格式。大多数现代技术栈会同时使用两者(湖仓一体架构)。

References

参考资料

  • For dbt project structure, see
    references/dbt-guide.md
  • For data warehouse modeling (star schema), see
    references/dimensional-modeling.md
  • 关于dbt项目结构,请查看
    references/dbt-guide.md
  • 关于数据仓库建模(星型模型),请查看
    references/dimensional-modeling.md