tech-data-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Pipeline Design
数据管道设计
Framework
框架
IRON LAW: Data Quality Checks at Every Stage
A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.
"Garbage in, garbage out" is not a warning — it's a guarantee.铁律:每个阶段都要进行数据质量检查
快速传输错误数据的管道比没有管道更糟糕——它会破坏下游的分析和决策。管道的每个阶段(提取、转换、加载)都必须进行数据质量检查:行数校验、空值检查、schema验证、新鲜度检查。
“垃圾进,垃圾出”不是警告——而是必然结果。ETL vs ELT
ETL vs ELT
| Aspect | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) |
|---|---|---|
| Transform where? | Before loading (in pipeline) | After loading (in warehouse) |
| Best for | Structured data, compliance-heavy | Cloud warehouses (BigQuery, Snowflake) |
| Flexibility | Less (transform logic is fixed) | More (transform in SQL after loading) |
| Cost | Compute in pipeline | Compute in warehouse |
| Trend | Legacy/on-prem | Modern/cloud-native |
| 维度 | ETL(提取、转换、加载) | ELT(提取、加载、转换) |
|---|---|---|
| 转换位置? | 加载前(在管道中) | 加载后(在数据仓库中) |
| 适用场景 | 结构化数据、合规要求高的场景 | 云数据仓库(BigQuery、Snowflake) |
| 灵活性 | 较低(转换逻辑固定) | 较高(加载后用SQL进行转换) |
| 成本 | 管道计算资源成本 | 数据仓库计算资源成本 |
| 趋势 | 传统/本地部署 | 现代/云原生 |
Pipeline Architecture
管道架构
[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
↑ ↓
| [Quality Checks at every stage] [Dashboard]
| [Monitoring & Alerting] [API]
└──────────────── [Orchestrator (Airflow/Prefect)] ──────┘[数据源] → [提取] → [暂存] → [转换] → [加载] → [服务]
↑ ↓
| [每个阶段的质量检查] [仪表盘]
| [监控与告警] [API]
└──────────────── [编排工具(Airflow/Prefect)] ──────┘Data Source Types
数据源类型
| Source | Extraction Method | Challenges |
|---|---|---|
| Database | CDC (Change Data Capture), bulk query, replication | Schema changes, performance impact on source |
| API | REST/GraphQL polling, webhooks | Rate limits, pagination, auth token refresh |
| Files | S3/GCS pickup, SFTP, email attachment | Format inconsistency, encoding issues |
| Streaming | Kafka, Kinesis, Pub/Sub | Ordering, exactly-once processing |
| SaaS tools | Pre-built connectors (Fivetran, Airbyte) | API changes, data model complexity |
| 数据源 | 提取方法 | 挑战 |
|---|---|---|
| 数据库 | CDC(变更数据捕获)、批量查询、复制 | Schema变更、对源系统的性能影响 |
| API | REST/GraphQL轮询、Webhooks | 速率限制、分页、认证令牌刷新 |
| 文件 | S3/GCS获取、SFTP、邮件附件 | 格式不一致、编码问题 |
| 流数据 | Kafka、Kinesis、Pub/Sub | 顺序性、精确一次处理 |
| SaaS工具 | 预构建连接器(Fivetran、Airbyte) | API变更、数据模型复杂度 |
Orchestration Tools
编排工具
| Tool | Type | Best For | Complexity |
|---|---|---|---|
| Airflow | Python DAGs | Complex pipelines, team of engineers | High |
| Prefect | Python, modern API | Simpler than Airflow, good DX | Medium |
| dbt | SQL transforms only | Transform layer in ELT | Low-Medium |
| Cron | Simple scheduling | Single script, low complexity | Low |
| Fivetran/Airbyte | Managed connectors | Extract + Load (no transform) | Low |
| 工具 | 类型 | 适用场景 | 复杂度 |
|---|---|---|---|
| Airflow | Python DAGs | 复杂管道、工程师团队 | 高 |
| Prefect | Python、现代API | 比Airflow简单、良好的开发体验 | 中 |
| dbt | 仅支持SQL转换 | ELT中的转换层 | 低-中 |
| Cron | 简单调度 | 单脚本、低复杂度场景 | 低 |
| Fivetran/Airbyte | 托管连接器 | 提取+加载(无转换) | 低 |
Data Quality Framework
数据质量框架
| Check | What It Validates | When |
|---|---|---|
| Row count | Expected number of rows (within ±10% of prior run) | After extract, after load |
| Null check | Critical columns have no unexpected nulls | After extract |
| Schema validation | Column names, types match expected | After extract |
| Freshness | Data is recent (not stale) | After load |
| Uniqueness | No duplicate primary keys | After load |
| Range check | Values within expected bounds | After transform |
| Referential integrity | Foreign keys match parent tables | After load |
| 检查项 | 验证内容 | 执行时机 |
|---|---|---|
| 行数校验 | 预期行数(与上一次运行结果偏差在±10%以内) | 提取后、加载后 |
| 空值检查 | 关键列无意外空值 | 提取后 |
| Schema验证 | 列名、类型与预期一致 | 提取后 |
| 新鲜度检查 | 数据是最新的(未过期) | 加载后 |
| 唯一性检查 | 无重复主键 | 加载后 |
| 范围检查 | 值在预期范围内 | 转换后 |
| 参照完整性检查 | 外键与父表匹配 | 加载后 |
Pipeline Design Steps
管道设计步骤
- Map sources and destinations: What data, from where, to where?
- Define freshness requirements: Real-time? Hourly? Daily?
- Choose architecture: ETL or ELT based on tools and team
- Build incrementally: Start with one source, one destination, one schedule
- Add quality checks: At minimum: row count + null check + freshness
- Set up monitoring: Alert on failure, quality check violations, latency
- Document: Data dictionary, pipeline diagram, SLAs
- 映射数据源与目标端:要处理什么数据?从哪里来?到哪里去?
- 定义新鲜度要求:实时?每小时?每天?
- 选择架构:根据工具和团队情况选择ETL或ELT
- 增量构建:从一个数据源、一个目标端、一个调度计划开始
- 添加质量检查:至少包含:行数校验 + 空值检查 + 新鲜度检查
- 设置监控:针对失败、质量检查不通过、延迟情况触发告警
- 文档记录:数据字典、管道流程图、服务水平协议(SLAs)
Output Format
输出格式
markdown
undefinedmarkdown
undefinedData Pipeline Design: {Project}
数据管道设计:{项目名称}
Sources & Destinations
数据源与目标端
| Source | Type | Destination | Freshness | Volume |
|---|---|---|---|---|
| {source} | DB/API/File | {dest} | {daily/hourly} | {rows/day} |
| 数据源 | 类型 | 目标端 | 新鲜度 | 数据量 |
|---|---|---|---|---|
| {source} | DB/API/File | {dest} | {daily/hourly} | {rows/day} |
Architecture
架构
- Pattern: ETL / ELT
- Orchestrator: {tool}
- Transform: {tool/SQL}
- Quality: {tool/custom checks}
- 模式:ETL / ELT
- 编排工具:{tool}
- 转换工具:{tool/SQL}
- 质量检查:{tool/自定义检查}
Pipeline Diagram
管道流程图
{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}
{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}
Quality Checks
质量检查项
| Stage | Check | Threshold | Alert |
|---|---|---|---|
| Extract | Row count | ±10% of prior | Slack alert |
| Load | Freshness | < 6 hours old | PagerDuty |
| 阶段 | 检查项 | 阈值 | 告警方式 |
|---|---|---|---|
| 提取 | 行数校验 | 与上一次偏差±10% | Slack告警 |
| 加载 | 新鲜度 | 不超过6小时 | PagerDuty |
Schedule
调度计划
| Pipeline | Frequency | Start Time | SLA |
|---|---|---|---|
| {name} | {daily/hourly} | {time} | Data ready by {time} |
undefined| 管道 | 频率 | 开始时间 | 服务水平协议 |
|---|---|---|---|
| {name} | {daily/hourly} | {time} | 数据需在{time}前准备就绪 |
undefinedGotchas
注意事项
- Idempotency is essential: A pipeline that runs twice should produce the same result as running once. Use upsert (not insert) and date-partitioned loads.
- Schema drift: Source systems change schemas without warning. Build schema detection and alerting.
- Backfill capability: When a pipeline fails for 3 days, can you rerun for those days without duplicating data? Design for this from day 1.
- Don't build what you can buy: Fivetran/Airbyte handle 200+ source connectors. Writing a custom Salesforce extractor is rarely worth the engineering time.
- Data warehouse vs data lake: Warehouse (BigQuery, Snowflake) = structured, SQL-queryable. Lake (S3, GCS) = raw, any format. Most modern stacks use both (lakehouse pattern).
- 幂等性至关重要:管道运行两次的结果应与运行一次的结果一致。使用更新插入(而非插入)和按日期分区加载。
- Schema漂移:源系统会在无预警的情况下变更Schema。需构建Schema检测和告警机制。
- 回填能力:当管道故障3天时,能否在不重复数据的情况下重新运行这3天的任务?从设计第一天就要考虑这一点。
- 能买就不要自己构建:Fivetran/Airbyte支持200+数据源连接器。编写自定义Salesforce提取器通常不值得投入工程时间。
- 数据仓库 vs 数据湖:数据仓库(BigQuery、Snowflake)= 结构化、支持SQL查询。数据湖(S3、GCS)= 原始数据、任意格式。大多数现代技术栈会同时使用两者(湖仓一体架构)。
References
参考资料
- For dbt project structure, see
references/dbt-guide.md - For data warehouse modeling (star schema), see
references/dimensional-modeling.md
- 关于dbt项目结构,请查看
references/dbt-guide.md - 关于数据仓库建模(星型模型),请查看
references/dimensional-modeling.md