tech-data-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Pipeline Design

数据管道设计

Framework

框架

IRON LAW: Data Quality Checks at Every Stage

A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.

"Garbage in, garbage out" is not a warning — it's a guarantee.

铁律：每个阶段都要进行数据质量检查

快速传输错误数据的管道比没有管道更糟糕——它会破坏下游的分析和决策。管道的每个阶段（提取、转换、加载）都必须进行数据质量检查：行数校验、空值检查、schema验证、新鲜度检查。

“垃圾进，垃圾出”不是警告——而是必然结果。

ETL vs ELT

Aspect	ETL (Extract, Transform, Load)	ELT (Extract, Load, Transform)
Transform where?	Before loading (in pipeline)	After loading (in warehouse)
Best for	Structured data, compliance-heavy	Cloud warehouses (BigQuery, Snowflake)
Flexibility	Less (transform logic is fixed)	More (transform in SQL after loading)
Cost	Compute in pipeline	Compute in warehouse
Trend	Legacy/on-prem	Modern/cloud-native

维度	ETL（提取、转换、加载）	ELT（提取、加载、转换）
转换位置？	加载前（在管道中）	加载后（在数据仓库中）
适用场景	结构化数据、合规要求高的场景	云数据仓库（BigQuery、Snowflake）
灵活性	较低（转换逻辑固定）	较高（加载后用SQL进行转换）
成本	管道计算资源成本	数据仓库计算资源成本
趋势	传统/本地部署	现代/云原生

Pipeline Architecture

管道架构

[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
   ↑                                                          ↓
   |              [Quality Checks at every stage]        [Dashboard]
   |              [Monitoring & Alerting]                [API]
   └──────────────── [Orchestrator (Airflow/Prefect)] ──────┘

[数据源] → [提取] → [暂存] → [转换] → [加载] → [服务]
   ↑                                                          ↓
   |              [每个阶段的质量检查]        [仪表盘]
   |              [监控与告警]                [API]
   └──────────────── [编排工具（Airflow/Prefect）] ──────┘

Data Source Types

数据源类型

Source	Extraction Method	Challenges
Database	CDC (Change Data Capture), bulk query, replication	Schema changes, performance impact on source
API	REST/GraphQL polling, webhooks	Rate limits, pagination, auth token refresh
Files	S3/GCS pickup, SFTP, email attachment	Format inconsistency, encoding issues
Streaming	Kafka, Kinesis, Pub/Sub	Ordering, exactly-once processing
SaaS tools	Pre-built connectors (Fivetran, Airbyte)	API changes, data model complexity

数据源	提取方法	挑战
数据库	CDC（变更数据捕获）、批量查询、复制	Schema变更、对源系统的性能影响
API	REST/GraphQL轮询、Webhooks	速率限制、分页、认证令牌刷新
文件	S3/GCS获取、SFTP、邮件附件	格式不一致、编码问题
流数据	Kafka、Kinesis、Pub/Sub	顺序性、精确一次处理
SaaS工具	预构建连接器（Fivetran、Airbyte）	API变更、数据模型复杂度

Orchestration Tools

编排工具

Tool	Type	Best For	Complexity
Airflow	Python DAGs	Complex pipelines, team of engineers	High
Prefect	Python, modern API	Simpler than Airflow, good DX	Medium
dbt	SQL transforms only	Transform layer in ELT	Low-Medium
Cron	Simple scheduling	Single script, low complexity	Low
Fivetran/Airbyte	Managed connectors	Extract + Load (no transform)	Low

工具	类型	适用场景	复杂度
Airflow	Python DAGs	复杂管道、工程师团队	高
Prefect	Python、现代API	比Airflow简单、良好的开发体验	中
dbt	仅支持SQL转换	ELT中的转换层	低-中
Cron	简单调度	单脚本、低复杂度场景	低
Fivetran/Airbyte	托管连接器	提取+加载（无转换）	低

Data Quality Framework

数据质量框架

Check	What It Validates	When
Row count	Expected number of rows (within ±10% of prior run)	After extract, after load
Null check	Critical columns have no unexpected nulls	After extract
Schema validation	Column names, types match expected	After extract
Freshness	Data is recent (not stale)	After load
Uniqueness	No duplicate primary keys	After load
Range check	Values within expected bounds	After transform
Referential integrity	Foreign keys match parent tables	After load

检查项	验证内容	执行时机
行数校验	预期行数（与上一次运行结果偏差在±10%以内）	提取后、加载后
空值检查	关键列无意外空值	提取后
Schema验证	列名、类型与预期一致	提取后
新鲜度检查	数据是最新的（未过期）	加载后
唯一性检查	无重复主键	加载后
范围检查	值在预期范围内	转换后
参照完整性检查	外键与父表匹配	加载后

Pipeline Design Steps

管道设计步骤

Map sources and destinations: What data, from where, to where?
Define freshness requirements: Real-time? Hourly? Daily?
Choose architecture: ETL or ELT based on tools and team
Build incrementally: Start with one source, one destination, one schedule
Add quality checks: At minimum: row count + null check + freshness
Set up monitoring: Alert on failure, quality check violations, latency
Document: Data dictionary, pipeline diagram, SLAs

映射数据源与目标端：要处理什么数据？从哪里来？到哪里去？
定义新鲜度要求：实时？每小时？每天？
选择架构：根据工具和团队情况选择ETL或ELT
增量构建：从一个数据源、一个目标端、一个调度计划开始
添加质量检查：至少包含：行数校验 + 空值检查 + 新鲜度检查
设置监控：针对失败、质量检查不通过、延迟情况触发告警
文档记录：数据字典、管道流程图、服务水平协议（SLAs）

Output Format

输出格式

markdown

undefined

markdown

undefined

Data Pipeline Design: {Project}

数据管道设计：{项目名称}

Sources & Destinations

数据源与目标端

Source	Type	Destination	Freshness	Volume
{source}	DB/API/File	{dest}	{daily/hourly}	{rows/day}

数据源	类型	目标端	新鲜度	数据量
{source}	DB/API/File	{dest}	{daily/hourly}	{rows/day}

Architecture

架构

Pattern: ETL / ELT
Orchestrator: {tool}
Transform: {tool/SQL}
Quality: {tool/custom checks}

模式：ETL / ELT
编排工具：{tool}
转换工具：{tool/SQL}
质量检查：{tool/自定义检查}

Pipeline Diagram

管道流程图

{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}

Quality Checks

质量检查项

Stage	Check	Threshold	Alert
Extract	Row count	±10% of prior	Slack alert
Load	Freshness	< 6 hours old	PagerDuty

阶段	检查项	阈值	告警方式
提取	行数校验	与上一次偏差±10%	Slack告警
加载	新鲜度	不超过6小时	PagerDuty

Schedule

调度计划

Pipeline	Frequency	Start Time	SLA
{name}	{daily/hourly}	{time}	Data ready by {time}

undefined

管道	频率	开始时间	服务水平协议
{name}	{daily/hourly}	{time}	数据需在{time}前准备就绪

undefined

Gotchas

注意事项

Idempotency is essential: A pipeline that runs twice should produce the same result as running once. Use upsert (not insert) and date-partitioned loads.
Schema drift: Source systems change schemas without warning. Build schema detection and alerting.
Backfill capability: When a pipeline fails for 3 days, can you rerun for those days without duplicating data? Design for this from day 1.
Don't build what you can buy: Fivetran/Airbyte handle 200+ source connectors. Writing a custom Salesforce extractor is rarely worth the engineering time.
Data warehouse vs data lake: Warehouse (BigQuery, Snowflake) = structured, SQL-queryable. Lake (S3, GCS) = raw, any format. Most modern stacks use both (lakehouse pattern).

幂等性至关重要：管道运行两次的结果应与运行一次的结果一致。使用更新插入（而非插入）和按日期分区加载。
Schema漂移：源系统会在无预警的情况下变更Schema。需构建Schema检测和告警机制。
回填能力：当管道故障3天时，能否在不重复数据的情况下重新运行这3天的任务？从设计第一天就要考虑这一点。
能买就不要自己构建：Fivetran/Airbyte支持200+数据源连接器。编写自定义Salesforce提取器通常不值得投入工程时间。
数据仓库 vs 数据湖：数据仓库（BigQuery、Snowflake）= 结构化、支持SQL查询。数据湖（S3、GCS）= 原始数据、任意格式。大多数现代技术栈会同时使用两者（湖仓一体架构）。

References

参考资料

For dbt project structure, see
```
references/dbt-guide.md
```
For data warehouse modeling (star schema), see
```
references/dimensional-modeling.md
```

关于dbt项目结构，请查看
```
references/dbt-guide.md
```
关于数据仓库建模（星型模型），请查看
```
references/dimensional-modeling.md
```