data-lake-platform

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Lake Platform

数据湖平台

Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.

构建并运维生产级数据湖与湖仓：实现可靠的数据摄入、转换、开放格式存储及分析服务。

When to Use

适用场景

Design data lake/lakehouse architecture
Set up ingestion pipelines (batch, incremental, CDC)
Build SQL transformation layers (SQLMesh, dbt)
Choose table formats and catalogs (Iceberg, Delta, Hudi)
Deploy query/serving engines (Trino, ClickHouse, DuckDB)
Implement streaming pipelines (Kafka, Flink)
Set up orchestration (Dagster, Airflow, Prefect)
Add governance, lineage, data quality, and cost controls

设计数据湖/湖仓架构
搭建数据摄入管道（批量、增量、CDC）
构建SQL转换层（SQLMesh、dbt）
选择表格式与元数据目录（Iceberg、Delta、Hudi）
部署查询/服务引擎（Trino、ClickHouse、DuckDB）
实现流处理管道（Kafka、Flink）
搭建编排调度系统（Dagster、Airflow、Prefect）
添加数据治理、血缘追踪、数据质量管控及成本控制

Triage Questions

评估问题

Batch, streaming, or hybrid? What is the freshness SLO?
Append-only vs upserts/deletes (CDC)? Is time travel required?
Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?
PII/compliance: row/column-level access, retention, audit logging?
Platform constraints: self-hosted vs cloud, preferred engines, team strengths?

批量处理、流处理还是混合模式？数据新鲜度SLO要求是什么？
仅追加写入 vs 插入/更新/删除（CDC）？是否需要时间旅行功能？
主要查询模式：BI仪表盘（高并发）、即席关联查询、嵌入式分析？
敏感数据/合规要求：行/列级权限控制、数据保留、审计日志？
平台约束：自建 vs 云部署、偏好的引擎、团队技术优势？

Default Baseline (Good Starting Point)

默认基准方案（入门优选）

Storage: object storage + open table format (usually Iceberg)
Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)
Transforms: SQLMesh or dbt (pick one and standardize)
Lake query: Trino (or Spark for heavy compute/ML workloads)
Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI
Governance: DataHub/OpenMetadata + OpenLineage
Orchestration: Dagster/Airflow/Prefect

存储：对象存储 + 开放表格式（通常为Iceberg）
元数据目录：REST/Hive/Glue/Nessie/Unity（匹配你的平台）
转换工具：SQLMesh或dbt（二选一并标准化）
湖查询：Trino（或Spark用于重计算/机器学习 workloads）
服务层（可选）：ClickHouse/StarRocks/Doris用于低延迟BI分析
治理：DataHub/OpenMetadata + OpenLineage
编排调度：Dagster/Airflow/Prefect

Workflow

实施流程

Pick table format + catalog:

references/storage-formats.md

(use

assets/cross-platform/template-schema-evolution.md

and

assets/cross-platform/template-partitioning-strategy.md

)

Design ingestion (batch/incremental/CDC):

references/ingestion-patterns.md

(use

assets/cross-platform/template-ingestion-governance-checklist.md

and

assets/cross-platform/template-incremental-loading.md

)

Design transformations (bronze/silver/gold or data products):

references/transformation-patterns.md

(use

assets/cross-platform/template-data-pipeline.md

)

Choose lake query vs serving engines:
```
references/query-engine-patterns.md
```

Add governance, lineage, and quality gates:

references/governance-catalog.md

(use

assets/cross-platform/template-data-quality-governance.md

and

assets/cross-platform/template-data-quality.md

)

Plan operations + cost controls:

references/operational-playbook.md

and

references/cost-optimization.md

(use

assets/cross-platform/template-data-quality-backfill-runbook.md

and

assets/cross-platform/template-cost-optimization.md

)

选择表格式 + 元数据目录：参考

references/storage-formats.md

（可使用

assets/cross-platform/template-schema-evolution.md

和

assets/cross-platform/template-partitioning-strategy.md

）

设计数据摄入（批量/增量/CDC）：参考

references/ingestion-patterns.md

（可使用

assets/cross-platform/template-ingestion-governance-checklist.md

和

assets/cross-platform/template-incremental-loading.md

）

设计数据转换（青铜/白银/黄金分层或数据产品）：参考
```
references/transformation-patterns.md
```
（可使用
```
assets/cross-platform/template-data-pipeline.md
```
）
选择湖查询 vs 服务引擎：参考
```
references/query-engine-patterns.md
```

添加治理、血缘追踪与质量管控：参考

references/governance-catalog.md

（可使用

assets/cross-platform/template-data-quality-governance.md

和

assets/cross-platform/template-data-quality.md

）

规划运维 + 成本控制：参考

references/operational-playbook.md

和

references/cost-optimization.md

（可使用

assets/cross-platform/template-data-quality-backfill-runbook.md

和

assets/cross-platform/template-cost-optimization.md

）

Architecture Patterns

架构模式

Medallion (bronze/silver/gold):
```
references/architecture-patterns.md
```
Data mesh (domain-owned data products):
```
references/architecture-patterns.md
```
Streaming-first (Kappa):
```
references/streaming-patterns.md
```
Diagrams/mermaid snippets:
```
references/overview.md
```

勋章分层（青铜/白银/黄金）：参考
```
references/architecture-patterns.md
```
数据网格（域所有权数据产品）：参考
```
references/architecture-patterns.md
```
流优先（Kappa）：参考
```
references/streaming-patterns.md
```
图表/Mermaid代码片段：参考
```
references/overview.md
```

Quick Start

快速开始

dlt + ClickHouse

bash

pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py

bash

pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py

SQLMesh + DuckDB

bash

pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run

bash

pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run

Reliability and Safety

可靠性与安全规范

Do

建议做法

Define data contracts and owners up front
Add quality gates (freshness, volume, schema, distribution) per tier
Make every pipeline idempotent and re-runnable (backfills are normal)
Treat access control and audit logging as first-class requirements

提前定义数据契约与数据所有者
为每个分层添加质量管控（新鲜度、数据量、 schema、分布）
确保每个管道具备幂等性与可重跑性（数据回填是常规操作）
将权限控制与审计日志作为核心需求

Avoid

避免事项

Skipping validation to "move fast"
Storing PII without access controls
Pipelines that can't be re-run safely
Manual schema changes without version control

为了“快速推进”而跳过验证环节
存储敏感数据（PII）却未设置权限控制
无法安全重跑的管道
未通过版本控制的手动schema变更

Resources

资源列表

Resource	Purpose
references/overview.md	Diagrams and decision flows
references/architecture-patterns.md	Medallion, data mesh
references/ingestion-patterns.md	dlt vs Airbyte, CDC
references/transformation-patterns.md	SQLMesh vs dbt
references/storage-formats.md	Iceberg vs Delta
references/query-engine-patterns.md	ClickHouse, DuckDB
references/streaming-patterns.md	Kafka, Flink
references/orchestration-patterns.md	Dagster, Airflow
references/bi-visualization-patterns.md	Metabase, Superset
references/cost-optimization.md	Cost levers and maintenance
references/operational-playbook.md	Monitoring and incident response
references/governance-catalog.md	Catalog, lineage, access control

资源	用途
references/overview.md	图表与决策流程
references/architecture-patterns.md	勋章分层、数据网格
references/ingestion-patterns.md	dlt vs Airbyte、CDC
references/transformation-patterns.md	SQLMesh vs dbt
references/storage-formats.md	Iceberg vs Delta
references/query-engine-patterns.md	ClickHouse、DuckDB
references/streaming-patterns.md	Kafka、Flink
references/orchestration-patterns.md	Dagster、Airflow
references/bi-visualization-patterns.md	Metabase、Superset
references/cost-optimization.md	成本优化与维护
references/operational-playbook.md	监控与事件响应
references/governance-catalog.md	元数据目录、血缘追踪、权限控制

Templates

模板列表

Template	Purpose
assets/cross-platform/template-medallion-architecture.md	Baseline bronze/silver/gold plan
assets/cross-platform/template-data-pipeline.md	End-to-end pipeline skeleton
assets/cross-platform/template-ingestion-governance-checklist.md	Source onboarding checklist
assets/cross-platform/template-incremental-loading.md	Incremental + backfill plan
assets/cross-platform/template-schema-evolution.md	Schema change rules
assets/cross-platform/template-cost-optimization.md	Cost control checklist
assets/cross-platform/template-data-quality-governance.md	Quality contracts + SLOs
assets/cross-platform/template-data-quality-backfill-runbook.md	Backfill incident/runbook

模板	用途
assets/cross-platform/template-medallion-architecture.md	青铜/白银/黄金分层基准方案
assets/cross-platform/template-data-pipeline.md	端到端管道框架
assets/cross-platform/template-ingestion-governance-checklist.md	数据源接入检查清单
assets/cross-platform/template-incremental-loading.md	增量加载 + 数据回填方案
assets/cross-platform/template-schema-evolution.md	Schema变更规则
assets/cross-platform/template-cost-optimization.md	成本控制检查清单
assets/cross-platform/template-data-quality-governance.md	数据质量契约 + SLO
assets/cross-platform/template-data-quality-backfill-runbook.md	数据回填事件/执行手册

Skill	Purpose
ai-mlops	ML deployment
ai-ml-data-science	Feature engineering
data-sql-optimization	OLTP optimization

技能	用途
ai-mlops	机器学习部署
ai-ml-data-science	特征工程
data-sql-optimization	OLTP优化

data-lake-platform

Original

Translation

Data Lake Platform

数据湖平台

When to Use

适用场景

Triage Questions

评估问题

Default Baseline (Good Starting Point)

默认基准方案（入门优选）

Workflow

实施流程

Architecture Patterns

架构模式

Quick Start

快速开始

dlt + ClickHouse

dlt + ClickHouse

SQLMesh + DuckDB

SQLMesh + DuckDB

Reliability and Safety

可靠性与安全规范

Do

建议做法

Avoid

避免事项

Resources

资源列表

Templates

模板列表

Related Skills

相关技能