data-lake-platform
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Lake Platform
数据湖平台
Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.
构建并运维生产级数据湖与湖仓:实现可靠的数据摄入、转换、开放格式存储及分析服务。
When to Use
适用场景
- Design data lake/lakehouse architecture
- Set up ingestion pipelines (batch, incremental, CDC)
- Build SQL transformation layers (SQLMesh, dbt)
- Choose table formats and catalogs (Iceberg, Delta, Hudi)
- Deploy query/serving engines (Trino, ClickHouse, DuckDB)
- Implement streaming pipelines (Kafka, Flink)
- Set up orchestration (Dagster, Airflow, Prefect)
- Add governance, lineage, data quality, and cost controls
- 设计数据湖/湖仓架构
- 搭建数据摄入管道(批量、增量、CDC)
- 构建SQL转换层(SQLMesh、dbt)
- 选择表格式与元数据目录(Iceberg、Delta、Hudi)
- 部署查询/服务引擎(Trino、ClickHouse、DuckDB)
- 实现流处理管道(Kafka、Flink)
- 搭建编排调度系统(Dagster、Airflow、Prefect)
- 添加数据治理、血缘追踪、数据质量管控及成本控制
Triage Questions
评估问题
- Batch, streaming, or hybrid? What is the freshness SLO?
- Append-only vs upserts/deletes (CDC)? Is time travel required?
- Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?
- PII/compliance: row/column-level access, retention, audit logging?
- Platform constraints: self-hosted vs cloud, preferred engines, team strengths?
- 批量处理、流处理还是混合模式?数据新鲜度SLO要求是什么?
- 仅追加写入 vs 插入/更新/删除(CDC)?是否需要时间旅行功能?
- 主要查询模式:BI仪表盘(高并发)、即席关联查询、嵌入式分析?
- 敏感数据/合规要求:行/列级权限控制、数据保留、审计日志?
- 平台约束:自建 vs 云部署、偏好的引擎、团队技术优势?
Default Baseline (Good Starting Point)
默认基准方案(入门优选)
- Storage: object storage + open table format (usually Iceberg)
- Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)
- Transforms: SQLMesh or dbt (pick one and standardize)
- Lake query: Trino (or Spark for heavy compute/ML workloads)
- Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI
- Governance: DataHub/OpenMetadata + OpenLineage
- Orchestration: Dagster/Airflow/Prefect
- 存储:对象存储 + 开放表格式(通常为Iceberg)
- 元数据目录:REST/Hive/Glue/Nessie/Unity(匹配你的平台)
- 转换工具:SQLMesh或dbt(二选一并标准化)
- 湖查询:Trino(或Spark用于重计算/机器学习 workloads)
- 服务层(可选):ClickHouse/StarRocks/Doris用于低延迟BI分析
- 治理:DataHub/OpenMetadata + OpenLineage
- 编排调度:Dagster/Airflow/Prefect
Workflow
实施流程
- Pick table format + catalog: (use
references/storage-formats.mdandassets/cross-platform/template-schema-evolution.md)assets/cross-platform/template-partitioning-strategy.md - Design ingestion (batch/incremental/CDC): (use
references/ingestion-patterns.mdandassets/cross-platform/template-ingestion-governance-checklist.md)assets/cross-platform/template-incremental-loading.md - Design transformations (bronze/silver/gold or data products): (use
references/transformation-patterns.md)assets/cross-platform/template-data-pipeline.md - Choose lake query vs serving engines:
references/query-engine-patterns.md - Add governance, lineage, and quality gates: (use
references/governance-catalog.mdandassets/cross-platform/template-data-quality-governance.md)assets/cross-platform/template-data-quality.md - Plan operations + cost controls: and
references/operational-playbook.md(usereferences/cost-optimization.mdandassets/cross-platform/template-data-quality-backfill-runbook.md)assets/cross-platform/template-cost-optimization.md
- 选择表格式 + 元数据目录:参考(可使用
references/storage-formats.md和assets/cross-platform/template-schema-evolution.md)assets/cross-platform/template-partitioning-strategy.md - 设计数据摄入(批量/增量/CDC):参考(可使用
references/ingestion-patterns.md和assets/cross-platform/template-ingestion-governance-checklist.md)assets/cross-platform/template-incremental-loading.md - 设计数据转换(青铜/白银/黄金分层或数据产品):参考(可使用
references/transformation-patterns.md)assets/cross-platform/template-data-pipeline.md - 选择湖查询 vs 服务引擎:参考
references/query-engine-patterns.md - 添加治理、血缘追踪与质量管控:参考(可使用
references/governance-catalog.md和assets/cross-platform/template-data-quality-governance.md)assets/cross-platform/template-data-quality.md - 规划运维 + 成本控制:参考和
references/operational-playbook.md(可使用references/cost-optimization.md和assets/cross-platform/template-data-quality-backfill-runbook.md)assets/cross-platform/template-cost-optimization.md
Architecture Patterns
架构模式
- Medallion (bronze/silver/gold):
references/architecture-patterns.md - Data mesh (domain-owned data products):
references/architecture-patterns.md - Streaming-first (Kappa):
references/streaming-patterns.md - Diagrams/mermaid snippets:
references/overview.md
- 勋章分层(青铜/白银/黄金):参考
references/architecture-patterns.md - 数据网格(域所有权数据产品):参考
references/architecture-patterns.md - 流优先(Kappa):参考
references/streaming-patterns.md - 图表/Mermaid代码片段:参考
references/overview.md
Quick Start
快速开始
dlt + ClickHouse
dlt + ClickHouse
bash
pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.pybash
pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.pySQLMesh + DuckDB
SQLMesh + DuckDB
bash
pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh runbash
pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh runReliability and Safety
可靠性与安全规范
Do
建议做法
- Define data contracts and owners up front
- Add quality gates (freshness, volume, schema, distribution) per tier
- Make every pipeline idempotent and re-runnable (backfills are normal)
- Treat access control and audit logging as first-class requirements
- 提前定义数据契约与数据所有者
- 为每个分层添加质量管控(新鲜度、数据量、 schema、分布)
- 确保每个管道具备幂等性与可重跑性(数据回填是常规操作)
- 将权限控制与审计日志作为核心需求
Avoid
避免事项
- Skipping validation to "move fast"
- Storing PII without access controls
- Pipelines that can't be re-run safely
- Manual schema changes without version control
- 为了“快速推进”而跳过验证环节
- 存储敏感数据(PII)却未设置权限控制
- 无法安全重跑的管道
- 未通过版本控制的手动schema变更
Resources
资源列表
| Resource | Purpose |
|---|---|
| references/overview.md | Diagrams and decision flows |
| references/architecture-patterns.md | Medallion, data mesh |
| references/ingestion-patterns.md | dlt vs Airbyte, CDC |
| references/transformation-patterns.md | SQLMesh vs dbt |
| references/storage-formats.md | Iceberg vs Delta |
| references/query-engine-patterns.md | ClickHouse, DuckDB |
| references/streaming-patterns.md | Kafka, Flink |
| references/orchestration-patterns.md | Dagster, Airflow |
| references/bi-visualization-patterns.md | Metabase, Superset |
| references/cost-optimization.md | Cost levers and maintenance |
| references/operational-playbook.md | Monitoring and incident response |
| references/governance-catalog.md | Catalog, lineage, access control |
| 资源 | 用途 |
|---|---|
| references/overview.md | 图表与决策流程 |
| references/architecture-patterns.md | 勋章分层、数据网格 |
| references/ingestion-patterns.md | dlt vs Airbyte、CDC |
| references/transformation-patterns.md | SQLMesh vs dbt |
| references/storage-formats.md | Iceberg vs Delta |
| references/query-engine-patterns.md | ClickHouse、DuckDB |
| references/streaming-patterns.md | Kafka、Flink |
| references/orchestration-patterns.md | Dagster、Airflow |
| references/bi-visualization-patterns.md | Metabase、Superset |
| references/cost-optimization.md | 成本优化与维护 |
| references/operational-playbook.md | 监控与事件响应 |
| references/governance-catalog.md | 元数据目录、血缘追踪、权限控制 |
Templates
模板列表
| Template | Purpose |
|---|---|
| assets/cross-platform/template-medallion-architecture.md | Baseline bronze/silver/gold plan |
| assets/cross-platform/template-data-pipeline.md | End-to-end pipeline skeleton |
| assets/cross-platform/template-ingestion-governance-checklist.md | Source onboarding checklist |
| assets/cross-platform/template-incremental-loading.md | Incremental + backfill plan |
| assets/cross-platform/template-schema-evolution.md | Schema change rules |
| assets/cross-platform/template-cost-optimization.md | Cost control checklist |
| assets/cross-platform/template-data-quality-governance.md | Quality contracts + SLOs |
| assets/cross-platform/template-data-quality-backfill-runbook.md | Backfill incident/runbook |
| 模板 | 用途 |
|---|---|
| assets/cross-platform/template-medallion-architecture.md | 青铜/白银/黄金分层基准方案 |
| assets/cross-platform/template-data-pipeline.md | 端到端管道框架 |
| assets/cross-platform/template-ingestion-governance-checklist.md | 数据源接入检查清单 |
| assets/cross-platform/template-incremental-loading.md | 增量加载 + 数据回填方案 |
| assets/cross-platform/template-schema-evolution.md | Schema变更规则 |
| assets/cross-platform/template-cost-optimization.md | 成本控制检查清单 |
| assets/cross-platform/template-data-quality-governance.md | 数据质量契约 + SLO |
| assets/cross-platform/template-data-quality-backfill-runbook.md | 数据回填事件/执行手册 |
Related Skills
相关技能
| Skill | Purpose |
|---|---|
| ai-mlops | ML deployment |
| ai-ml-data-science | Feature engineering |
| data-sql-optimization | OLTP optimization |
| 技能 | 用途 |
|---|---|
| ai-mlops | 机器学习部署 |
| ai-ml-data-science | 特征工程 |
| data-sql-optimization | OLTP优化 |