data-lake-platform

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Lake Platform

数据湖平台

Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.
构建并运维生产级数据湖与湖仓:实现可靠的数据摄入、转换、开放格式存储及分析服务。

When to Use

适用场景

  • Design data lake/lakehouse architecture
  • Set up ingestion pipelines (batch, incremental, CDC)
  • Build SQL transformation layers (SQLMesh, dbt)
  • Choose table formats and catalogs (Iceberg, Delta, Hudi)
  • Deploy query/serving engines (Trino, ClickHouse, DuckDB)
  • Implement streaming pipelines (Kafka, Flink)
  • Set up orchestration (Dagster, Airflow, Prefect)
  • Add governance, lineage, data quality, and cost controls
  • 设计数据湖/湖仓架构
  • 搭建数据摄入管道(批量、增量、CDC)
  • 构建SQL转换层(SQLMesh、dbt)
  • 选择表格式与元数据目录(Iceberg、Delta、Hudi)
  • 部署查询/服务引擎(Trino、ClickHouse、DuckDB)
  • 实现流处理管道(Kafka、Flink)
  • 搭建编排调度系统(Dagster、Airflow、Prefect)
  • 添加数据治理、血缘追踪、数据质量管控及成本控制

Triage Questions

评估问题

  1. Batch, streaming, or hybrid? What is the freshness SLO?
  2. Append-only vs upserts/deletes (CDC)? Is time travel required?
  3. Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?
  4. PII/compliance: row/column-level access, retention, audit logging?
  5. Platform constraints: self-hosted vs cloud, preferred engines, team strengths?
  1. 批量处理、流处理还是混合模式?数据新鲜度SLO要求是什么?
  2. 仅追加写入 vs 插入/更新/删除(CDC)?是否需要时间旅行功能?
  3. 主要查询模式:BI仪表盘(高并发)、即席关联查询、嵌入式分析?
  4. 敏感数据/合规要求:行/列级权限控制、数据保留、审计日志?
  5. 平台约束:自建 vs 云部署、偏好的引擎、团队技术优势?

Default Baseline (Good Starting Point)

默认基准方案(入门优选)

  • Storage: object storage + open table format (usually Iceberg)
  • Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)
  • Transforms: SQLMesh or dbt (pick one and standardize)
  • Lake query: Trino (or Spark for heavy compute/ML workloads)
  • Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI
  • Governance: DataHub/OpenMetadata + OpenLineage
  • Orchestration: Dagster/Airflow/Prefect
  • 存储:对象存储 + 开放表格式(通常为Iceberg)
  • 元数据目录:REST/Hive/Glue/Nessie/Unity(匹配你的平台)
  • 转换工具:SQLMesh或dbt(二选一并标准化)
  • 湖查询:Trino(或Spark用于重计算/机器学习 workloads)
  • 服务层(可选):ClickHouse/StarRocks/Doris用于低延迟BI分析
  • 治理:DataHub/OpenMetadata + OpenLineage
  • 编排调度:Dagster/Airflow/Prefect

Workflow

实施流程

  1. Pick table format + catalog:
    references/storage-formats.md
    (use
    assets/cross-platform/template-schema-evolution.md
    and
    assets/cross-platform/template-partitioning-strategy.md
    )
  2. Design ingestion (batch/incremental/CDC):
    references/ingestion-patterns.md
    (use
    assets/cross-platform/template-ingestion-governance-checklist.md
    and
    assets/cross-platform/template-incremental-loading.md
    )
  3. Design transformations (bronze/silver/gold or data products):
    references/transformation-patterns.md
    (use
    assets/cross-platform/template-data-pipeline.md
    )
  4. Choose lake query vs serving engines:
    references/query-engine-patterns.md
  5. Add governance, lineage, and quality gates:
    references/governance-catalog.md
    (use
    assets/cross-platform/template-data-quality-governance.md
    and
    assets/cross-platform/template-data-quality.md
    )
  6. Plan operations + cost controls:
    references/operational-playbook.md
    and
    references/cost-optimization.md
    (use
    assets/cross-platform/template-data-quality-backfill-runbook.md
    and
    assets/cross-platform/template-cost-optimization.md
    )
  1. 选择表格式 + 元数据目录:参考
    references/storage-formats.md
    (可使用
    assets/cross-platform/template-schema-evolution.md
    assets/cross-platform/template-partitioning-strategy.md
  2. 设计数据摄入(批量/增量/CDC):参考
    references/ingestion-patterns.md
    (可使用
    assets/cross-platform/template-ingestion-governance-checklist.md
    assets/cross-platform/template-incremental-loading.md
  3. 设计数据转换(青铜/白银/黄金分层或数据产品):参考
    references/transformation-patterns.md
    (可使用
    assets/cross-platform/template-data-pipeline.md
  4. 选择湖查询 vs 服务引擎:参考
    references/query-engine-patterns.md
  5. 添加治理、血缘追踪与质量管控:参考
    references/governance-catalog.md
    (可使用
    assets/cross-platform/template-data-quality-governance.md
    assets/cross-platform/template-data-quality.md
  6. 规划运维 + 成本控制:参考
    references/operational-playbook.md
    references/cost-optimization.md
    (可使用
    assets/cross-platform/template-data-quality-backfill-runbook.md
    assets/cross-platform/template-cost-optimization.md

Architecture Patterns

架构模式

  • Medallion (bronze/silver/gold):
    references/architecture-patterns.md
  • Data mesh (domain-owned data products):
    references/architecture-patterns.md
  • Streaming-first (Kappa):
    references/streaming-patterns.md
  • Diagrams/mermaid snippets:
    references/overview.md
  • 勋章分层(青铜/白银/黄金):参考
    references/architecture-patterns.md
  • 数据网格(域所有权数据产品):参考
    references/architecture-patterns.md
  • 流优先(Kappa):参考
    references/streaming-patterns.md
  • 图表/Mermaid代码片段:参考
    references/overview.md

Quick Start

快速开始

dlt + ClickHouse

dlt + ClickHouse

bash
pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py
bash
pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py

SQLMesh + DuckDB

SQLMesh + DuckDB

bash
pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run
bash
pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run

Reliability and Safety

可靠性与安全规范

Do

建议做法

  • Define data contracts and owners up front
  • Add quality gates (freshness, volume, schema, distribution) per tier
  • Make every pipeline idempotent and re-runnable (backfills are normal)
  • Treat access control and audit logging as first-class requirements
  • 提前定义数据契约与数据所有者
  • 为每个分层添加质量管控(新鲜度、数据量、 schema、分布)
  • 确保每个管道具备幂等性与可重跑性(数据回填是常规操作)
  • 将权限控制与审计日志作为核心需求

Avoid

避免事项

  • Skipping validation to "move fast"
  • Storing PII without access controls
  • Pipelines that can't be re-run safely
  • Manual schema changes without version control
  • 为了“快速推进”而跳过验证环节
  • 存储敏感数据(PII)却未设置权限控制
  • 无法安全重跑的管道
  • 未通过版本控制的手动schema变更

Resources

资源列表

ResourcePurpose
references/overview.mdDiagrams and decision flows
references/architecture-patterns.mdMedallion, data mesh
references/ingestion-patterns.mddlt vs Airbyte, CDC
references/transformation-patterns.mdSQLMesh vs dbt
references/storage-formats.mdIceberg vs Delta
references/query-engine-patterns.mdClickHouse, DuckDB
references/streaming-patterns.mdKafka, Flink
references/orchestration-patterns.mdDagster, Airflow
references/bi-visualization-patterns.mdMetabase, Superset
references/cost-optimization.mdCost levers and maintenance
references/operational-playbook.mdMonitoring and incident response
references/governance-catalog.mdCatalog, lineage, access control
资源用途
references/overview.md图表与决策流程
references/architecture-patterns.md勋章分层、数据网格
references/ingestion-patterns.mddlt vs Airbyte、CDC
references/transformation-patterns.mdSQLMesh vs dbt
references/storage-formats.mdIceberg vs Delta
references/query-engine-patterns.mdClickHouse、DuckDB
references/streaming-patterns.mdKafka、Flink
references/orchestration-patterns.mdDagster、Airflow
references/bi-visualization-patterns.mdMetabase、Superset
references/cost-optimization.md成本优化与维护
references/operational-playbook.md监控与事件响应
references/governance-catalog.md元数据目录、血缘追踪、权限控制

Templates

模板列表

TemplatePurpose
assets/cross-platform/template-medallion-architecture.mdBaseline bronze/silver/gold plan
assets/cross-platform/template-data-pipeline.mdEnd-to-end pipeline skeleton
assets/cross-platform/template-ingestion-governance-checklist.mdSource onboarding checklist
assets/cross-platform/template-incremental-loading.mdIncremental + backfill plan
assets/cross-platform/template-schema-evolution.mdSchema change rules
assets/cross-platform/template-cost-optimization.mdCost control checklist
assets/cross-platform/template-data-quality-governance.mdQuality contracts + SLOs
assets/cross-platform/template-data-quality-backfill-runbook.mdBackfill incident/runbook
模板用途
assets/cross-platform/template-medallion-architecture.md青铜/白银/黄金分层基准方案
assets/cross-platform/template-data-pipeline.md端到端管道框架
assets/cross-platform/template-ingestion-governance-checklist.md数据源接入检查清单
assets/cross-platform/template-incremental-loading.md增量加载 + 数据回填方案
assets/cross-platform/template-schema-evolution.mdSchema变更规则
assets/cross-platform/template-cost-optimization.md成本控制检查清单
assets/cross-platform/template-data-quality-governance.md数据质量契约 + SLO
assets/cross-platform/template-data-quality-backfill-runbook.md数据回填事件/执行手册

Related Skills

相关技能

SkillPurpose
ai-mlopsML deployment
ai-ml-data-scienceFeature engineering
data-sql-optimizationOLTP optimization
技能用途
ai-mlops机器学习部署
ai-ml-data-science特征工程
data-sql-optimizationOLTP优化