architecting-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Architecture

数据架构

Purpose

目标

Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms.
为架构师和平台工程师提供现代云原生数据平台的战略数据架构决策指导。

When to Use This Skill

适用场景

Invoke this skill when:
  • Designing a new data platform or modernizing legacy systems
  • Choosing between data lake, data warehouse, or data lakehouse
  • Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)
  • Evaluating centralized vs data mesh architecture
  • Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)
  • Designing medallion architecture (bronze, silver, gold layers)
  • Implementing data governance and cataloging
在以下场景中调用本技能:
  • 设计新数据平台或现代化遗留系统
  • 选择数据湖、数据仓库或湖仓一体架构
  • 确定数据建模方法(维度建模、规范化建模、Data Vault、宽表)
  • 评估集中式与Data Mesh架构
  • 选择开放表格式(Apache Iceberg、Delta Lake、Apache Hudi)
  • 设计Medallion架构(Bronze原始层、Silver清洗层、Gold业务层)
  • 实施数据治理与编目

Core Concepts

核心概念

1. Storage Paradigms

1. 存储范式

Three primary patterns for analytical data storage:
Data Lake: Centralized repository for raw data at scale
  • Schema-on-read, cost-optimized ($0.02-0.03/GB/month)
  • Use when: Diverse data sources, exploratory analytics, ML/AI training data
Data Warehouse: Structured repository optimized for BI
  • Schema-on-write, ACID transactions, fast queries
  • Use when: Known BI requirements, strong governance needed
Data Lakehouse: Hybrid combining lake flexibility with warehouse reliability
  • Open table formats (Iceberg, Delta Lake), ACID on object storage
  • Use when: Mixed BI + ML workloads, cost optimization (60-80% cheaper than warehouse)
Decision Framework:
  • BI/Reporting only + Known queries → Data Warehouse
  • ML/AI primary + Raw data needed → Data Lake or Lakehouse
  • Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)
  • Exploratory/Unknown use cases → Data Lake
For detailed comparison, see references/storage-paradigms.md.
分析型数据存储的三种主要模式:
数据湖(Data Lake): 大规模原始数据的集中存储库
  • 读时模式(Schema-on-read),成本优化(每月0.02-0.03美元/GB)
  • 适用场景:多样化数据源、探索性分析、ML/AI训练数据
数据仓库(Data Warehouse): 针对BI优化的结构化存储库
  • 写时模式(Schema-on-write),支持ACID事务,查询速度快
  • 适用场景:明确的BI需求、需要强治理的场景
湖仓一体(Data Lakehouse): 融合数据湖灵活性与数据仓库可靠性的混合架构
  • 支持开放表格式(Iceberg、Delta Lake),在对象存储上实现ACID事务
  • 适用场景:混合BI + ML工作负载、成本优化(比数据仓库便宜60-80%)
决策框架:
  • 仅BI/报表 + 查询需求明确 → 数据仓库
  • 以ML/AI为主 + 需要原始数据 → 数据湖或湖仓一体
  • 混合BI + ML + 成本优化 → 湖仓一体(推荐)
  • 探索性/未知需求 → 数据湖
详细对比请参考 references/storage-paradigms.md

2. Data Modeling Approaches

2. 数据建模方法

Four primary modeling patterns:
Dimensional (Kimball): Star/snowflake schemas for BI
  • Use when: Known query patterns, BI dashboards, trend analysis
Normalized (3NF): Eliminate redundancy for transactional systems
  • Use when: OLTP systems, frequent updates, strong consistency
Data Vault 2.0: Flexible model with complete audit trail
  • Use when: Compliance requirements, multiple sources, agile warehousing
Wide Tables: Denormalized, optimized for columnar storage
  • Use when: ML feature stores, data science notebooks, high-performance dashboards
Decision Framework:
  • Analytical (BI) + Known queries → Dimensional (Star Schema)
  • Transactional (OLTP) → Normalized (3NF)
  • Compliance/Audit → Data Vault 2.0
  • Data Science/ML → Wide Tables
For detailed patterns, see references/modeling-approaches.md.
四种主要建模模式:
维度建模(Kimball): 用于BI的星型/雪花型架构
  • 适用场景:查询模式明确、BI仪表盘、趋势分析
规范化建模(3NF): 消除事务系统中的冗余
  • 适用场景:OLTP系统、频繁更新、强一致性要求
Data Vault 2.0: 具备完整审计追踪的灵活模型
  • 适用场景:合规要求、多数据源、敏捷数据仓库
宽表: 非规范化设计,针对列存储优化
  • 适用场景:ML特征存储、数据科学笔记本、高性能仪表盘
决策框架:
  • 分析型(BI)工作负载 + 查询需求明确 → 维度建模(星型架构)
  • 事务型(OLTP)工作负载 → 规范化建模(3NF)
  • 合规/审计需求 → Data Vault 2.0
  • 数据科学/ML工作负载 → 宽表
详细模式请参考 references/modeling-approaches.md

3. Data Mesh Principles

3. Data Mesh原则

Decentralized architecture for large organizations (>500 people).
Four Core Principles:
  1. Domain-oriented decentralization
  2. Data as a product (SLAs, quality, documentation)
  3. Self-serve data infrastructure
  4. Federated computational governance
Readiness Assessment (Score 1-5 each):
  1. Domain clarity
  2. Team maturity
  3. Platform capability
  4. Governance maturity
  5. Scale need
  6. Organizational buy-in
Scoring: 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized
Red Flags: Small org (<100 people), unclear domains, no platform team, weak governance
For full guide, see references/data-mesh-guide.md.
适用于大型组织(500人以上)的分布式架构。
四大核心原则:
  1. 面向领域的去中心化
  2. 数据即产品(包含SLA、质量、文档)
  3. 自助式数据基础设施
  4. 联邦计算治理
就绪度评估(每项1-5分):
  1. 领域清晰度
  2. 团队成熟度
  3. 平台能力
  4. 治理成熟度
  5. 规模需求
  6. 组织认可度
评分解读: 24-30分:高度适配 | 18-23分:混合架构 | 12-17分:先搭建基础 | 6-11分:集中式架构
警示信号: 小型组织(<100人)、领域不明确、无平台团队、治理薄弱
完整指南请参考 references/data-mesh-guide.md

4. Medallion Architecture

4. Medallion架构

Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)
Bronze Layer: Exact copy of source data, immutable, append-only
Silver Layer: Validated, deduplicated, typed data
Gold Layer: Business logic, aggregates, dimensional models, ML features
Data Quality by Layer:
  • Bronze → Silver: Schema validation, type checks, deduplication
  • Silver → Gold: Business rule validation, referential integrity
  • Gold: Anomaly detection, statistical checks
For patterns, see references/medallion-pattern.md.
标准湖仓一体模式:Bronze(原始)→ Silver(清洗)→ Gold(业务级)
Bronze层: 源数据的精确副本,不可变,仅追加
Silver层: 经过验证、去重、类型转换的数据
Gold层: 包含业务逻辑、聚合数据、维度模型、ML特征
各层数据质量要求:
  • Bronze → Silver: schema验证、类型检查、去重
  • Silver → Gold:业务规则验证、参照完整性
  • Gold层:异常检测、统计校验
模式详情请参考 references/medallion-pattern.md

5. Open Table Formats

5. 开放表格式

Enable ACID transactions on data lakes:
Apache Iceberg: Multi-engine, vendor-neutral (Context7: 79.7 score)
  • Use when: Avoid vendor lock-in, multi-engine flexibility
Delta Lake: Databricks ecosystem, Spark-optimized
  • Use when: Committed to Databricks
Apache Hudi: Optimized for CDC and frequent upserts
  • Use when: CDC-heavy workloads
Recommendation: Apache Iceberg for new projects (vendor-neutral, broadest support)
For comparison, see references/table-formats.md.
在数据湖上实现ACID事务:
Apache Iceberg: 多引擎支持、厂商中立(Context7评分:79.7)
  • 适用场景:避免厂商锁定、需要多引擎灵活性
Delta Lake: Databricks生态系统,针对Spark优化
  • 适用场景:已采用Databricks技术栈
Apache Hudi: 针对CDC和频繁更新优化
  • 适用场景:以CDC为主的工作负载
推荐: 新项目优先选择Apache Iceberg(厂商中立、支持范围最广)
对比详情请参考 references/table-formats.md

6. Modern Data Stack

6. 现代数据栈

Standard Layers:
  • Ingestion: Fivetran, Airbyte, Kafka
  • Storage: Snowflake, Databricks, BigQuery
  • Transformation: dbt (Context7: 87.0 score), Spark
  • Orchestration: Airflow, Dagster, Prefect
  • Visualization: Tableau, Looker, Power BI
  • Governance: DataHub, Alation, Great Expectations
Tool Selection:
  • Fivetran vs Airbyte: Pre-built connectors vs cost-sensitive
  • Snowflake vs Databricks: BI-focused vs ML-focused
  • dbt vs Spark: SQL-based vs large-scale processing
For detailed recommendations, see references/tool-recommendations.md and references/modern-data-stack.md.
标准分层:
  • 数据摄入:Fivetran、Airbyte、Kafka
  • 存储:Snowflake、Databricks、BigQuery
  • 数据转换:dbt(Context7评分:87.0)、Spark
  • 编排:Airflow、Dagster、Prefect
  • 可视化:Tableau、Looker、Power BI
  • 治理:DataHub、Alation、Great Expectations
工具选型:
  • Fivetran vs Airbyte:预构建连接器 vs 成本敏感场景
  • Snowflake vs Databricks:BI聚焦 vs ML聚焦
  • dbt vs Spark:基于SQL vs 大规模处理
详细推荐请参考 references/tool-recommendations.mdreferences/modern-data-stack.md

7. Data Governance

7. 数据治理

Data Catalog: Searchable inventory (DataHub, Alation, Collibra)
Data Lineage: Track data flow (OpenLineage, Marquez)
Data Quality: Validation and testing (Great Expectations, Soda, dbt tests)
Access Control:
  • RBAC: Role-based (sales_analyst role)
  • ABAC: Attribute-based (row-level security)
  • Column-level: Dynamic data masking for PII
For governance patterns, see references/governance-patterns.md.
数据目录: 可搜索的资产清单(DataHub、Alation、Collibra)
数据血缘: 追踪数据流(OpenLineage、Marquez)
数据质量: 验证与测试(Great Expectations、Soda、dbt tests)
访问控制:
  • RBAC:基于角色(如sales_analyst角色)
  • ABAC:基于属性(行级安全)
  • 列级控制:针对PII的动态数据脱敏
治理模式请参考 references/governance-patterns.md

Decision Frameworks

决策框架

Framework 1: Storage Paradigm Selection

框架1:存储范式选型

Step 1: Identify Primary Use Case
  • BI/Reporting only → Data Warehouse
  • ML/AI primary → Data Lake or Lakehouse
  • Mixed BI + ML → Data Lakehouse
  • Exploratory → Data Lake
Step 2: Evaluate Budget
  • High budget, known queries → Data Warehouse
  • Cost-sensitive, flexible → Data Lakehouse
Recommendation by Org Size:
  • Startup (<50): Data Warehouse (simplicity)
  • Growth (50-500): Data Lakehouse (balance)
  • Enterprise (>500): Hybrid or unified Lakehouse
See references/decision-frameworks.md.
步骤1:确定核心使用场景
  • 仅BI/报表 → 数据仓库
  • 以ML/AI为主 → 数据湖或湖仓一体
  • 混合BI + ML → 湖仓一体
  • 探索性需求 → 数据湖
步骤2:评估预算
  • 预算充足、查询需求明确 → 数据仓库
  • 成本敏感、需求灵活 → 湖仓一体
按组织规模推荐:
  • 初创公司(<50人):数据仓库(简洁性)
  • 成长型公司(50-500人):湖仓一体(平衡需求)
  • 企业级(>500人):混合或统一湖仓一体
详情请参考 references/decision-frameworks.md

Framework 2: Data Modeling Approach

框架2:数据建模方法选型

Decision Tree:
  • Analytical (BI) workload → Dimensional or Wide Tables
  • Transactional (OLTP) → Normalized (3NF)
  • Compliance/Audit → Data Vault 2.0
  • Data Science/ML → Wide Tables
See references/decision-frameworks.md.
决策树:
  • 分析型(BI)工作负载 → 维度建模或宽表
  • 事务型(OLTP)工作负载 → 规范化建模(3NF)
  • 合规/审计需求 → Data Vault 2.0
  • 数据科学/ML工作负载 → 宽表
详情请参考 references/decision-frameworks.md

Framework 3: Data Mesh Readiness

框架3:Data Mesh就绪度评估

Use 6-factor assessment. Score interpretation:
  • 24-30: Proceed with data mesh
  • 18-23: Hybrid approach
  • 12-17: Build foundation first
  • 6-11: Centralized
See references/decision-frameworks.md.
使用6因素评估模型,评分解读:
  • 24-30分:推进Data Mesh架构
  • 18-23分:混合架构
  • 12-17分:先搭建基础
  • 6-11分:集中式架构
详情请参考 references/decision-frameworks.md

Framework 4: Open Table Format Selection

框架4:开放表格式选型

Decision Tree:
  • Multi-engine flexibility → Apache Iceberg
  • Databricks ecosystem → Delta Lake
  • Frequent upserts/CDC → Apache Hudi
Recommendation: Apache Iceberg for new projects
See references/decision-frameworks.md.
决策树:
  • 需要多引擎灵活性 → Apache Iceberg
  • 基于Databricks生态 → Delta Lake
  • 频繁更新/CDC场景 → Apache Hudi
推荐: 新项目优先选择Apache Iceberg
详情请参考 references/decision-frameworks.md

Common Scenarios

常见场景

Startup Data Platform

初创公司数据平台

Context: 50-person startup, PostgreSQL + MongoDB + Stripe
Recommendation:
  • Storage: BigQuery or Snowflake
  • Ingestion: Airbyte or Fivetran
  • Transformation: dbt
  • Orchestration: dbt Cloud
  • Architecture: Simple data warehouse
See references/scenarios.md.
场景: 50人初创公司,数据源为PostgreSQL + MongoDB + Stripe
推荐方案:
  • 存储:BigQuery或Snowflake
  • 数据摄入:Airbyte或Fivetran
  • 数据转换:dbt
  • 编排:dbt Cloud
  • 架构:简单数据仓库
详情请参考 references/scenarios.md

Enterprise Modernization

企业级现代化改造

Context: Legacy Oracle warehouse, need cloud migration
Recommendation:
  • Storage: Data Lakehouse (Databricks or Snowflake with Iceberg)
  • Strategy: Incremental migration with CDC
  • Architecture: Medallion (bronze, silver, gold)
  • Cost Savings: 60-80%
See references/scenarios.md.
场景: 遗留Oracle数据仓库,需要迁移到云
推荐方案:
  • 存储:湖仓一体(Databricks或Snowflake + Iceberg)
  • 策略:基于CDC的增量迁移
  • 架构:Medallion(Bronze、Silver、Gold)
  • 成本节省:60-80%
详情请参考 references/scenarios.md

Data Mesh Assessment

Data Mesh评估

Context: 200-person company, 5-person central data team
Recommendation: NOT YET. Build foundation first.
  • Organization too small (<500 recommended)
  • Central team not yet bottleneck
  • Invest in self-serve platform and governance
See references/scenarios.md.
场景: 200人公司,5人中央数据团队
推荐: 暂不适合。先搭建基础能力。
  • 组织规模过小(推荐500人以上)
  • 中央团队尚未成为瓶颈
  • 优先投资自助式平台与治理能力
详情请参考 references/scenarios.md

Tool Recommendations

工具推荐

Research-Validated (Context7, December 2025)

经研究验证(Context7,2025年12月)

dbt: Score 87.0, 3,532+ code snippets
  • SQL-based transformations, version control, testing
  • Industry standard for data transformation
Apache Iceberg: Score 79.7, 832+ code snippets
  • Open table format, multi-engine, vendor-neutral
  • Production-ready (Netflix, Apple, Adobe)
Tool Stack by Use Case:
Startup: BigQuery + Airbyte + dbt + Metabase (<$1K/month)
Growth: Snowflake + Fivetran + dbt + Airflow + Tableau ($10K-50K/month)
Enterprise: Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Alation ($50K-500K/month)
See references/tool-recommendations.md.
dbt: 评分87.0,拥有3532+代码片段
  • 基于SQL的转换、版本控制、测试
  • 数据转换领域的行业标准
Apache Iceberg: 评分79.7,拥有832+代码片段
  • 开放表格式、多引擎支持、厂商中立
  • 生产就绪(已被Netflix、Apple、Adobe采用)
按场景推荐工具栈:
初创公司: BigQuery + Airbyte + dbt + Metabase(每月低于1000美元)
成长型公司: Snowflake + Fivetran + dbt + Airflow + Tableau(每月10000-50000美元)
企业级: Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Alation(每月50000-500000美元)
详情请参考 references/tool-recommendations.md

Implementation Patterns

实现模式

Pattern 1: Medallion Architecture

模式1:Medallion架构

sql
-- Bronze: Raw ingestion
CREATE TABLE bronze.raw_customers (_ingested_at TIMESTAMP, _raw_data STRING);

-- Silver: Cleaned
CREATE TABLE silver.customers AS
SELECT json_extract(_raw_data, '$.id') AS customer_id, ...
FROM bronze.raw_customers
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY _ingested_at DESC) = 1;

-- Gold: Business-level
CREATE TABLE gold.fact_sales AS
SELECT s.order_id, d.date_key, c.customer_key, ...
FROM silver.sales s
JOIN gold.dim_date d ON s.order_date = d.date;
sql
-- Bronze: Raw ingestion
CREATE TABLE bronze.raw_customers (_ingested_at TIMESTAMP, _raw_data STRING);

-- Silver: Cleaned
CREATE TABLE silver.customers AS
SELECT json_extract(_raw_data, '$.id') AS customer_id, ...
FROM bronze.raw_customers
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY _ingested_at DESC) = 1;

-- Gold: Business-level
CREATE TABLE gold.fact_sales AS
SELECT s.order_id, d.date_key, c.customer_key, ...
FROM silver.sales s
JOIN gold.dim_date d ON s.order_date = d.date;

Pattern 2: Apache Iceberg Table

模式2:Apache Iceberg表

sql
CREATE TABLE catalog.db.sales (order_id BIGINT, amount DECIMAL(10,2))
USING iceberg
PARTITIONED BY (days(order_date));

-- Time travel
SELECT * FROM catalog.db.sales TIMESTAMP AS OF '2025-01-01';
sql
CREATE TABLE catalog.db.sales (order_id BIGINT, amount DECIMAL(10,2))
USING iceberg
PARTITIONED BY (days(order_date));

-- Time travel
SELECT * FROM catalog.db.sales TIMESTAMP AS OF '2025-01-01';

Pattern 3: dbt Transformation

模式3:dbt转换

sql
-- models/staging/stg_customers.sql
WITH source AS (SELECT * FROM {{ source('raw', 'customers') }}),
cleaned AS (
  SELECT customer_id, UPPER(customer_name) AS customer_name
  FROM source WHERE customer_id IS NOT NULL
)
SELECT * FROM cleaned
For complete examples, see examples/.
sql
-- models/staging/stg_customers.sql
WITH source AS (SELECT * FROM {{ source('raw', 'customers') }}),
cleaned AS (
  SELECT customer_id, UPPER(customer_name) AS customer_name
  FROM source WHERE customer_id IS NOT NULL
)
SELECT * FROM cleaned
完整示例请参考 examples/

Best Practices

最佳实践

  1. Start simple: Avoid over-engineering; begin with warehouse or basic lakehouse
  2. Invest in governance early: Catalog, lineage, quality from day one
  3. Medallion architecture: Use bronze-silver-gold for clear quality layers
  4. Open table formats: Prefer Iceberg or Delta Lake to avoid vendor lock-in
  5. Assess mesh readiness: Don't decentralize prematurely (<500 people)
  6. Automate quality: Integrate tests (Great Expectations, dbt) into CI/CD
  7. Monitor pipelines: Observability is critical (freshness, quality, health)
  8. Document as code: Use dbt docs, DataHub, YAML for self-service
  9. Incremental loading: Only load new/changed data (watermark columns)
  10. Business alignment: Align architecture to outcomes, not just technologies
  1. 从简开始: 避免过度设计,从数据仓库或基础湖仓一体起步
  2. 尽早投入治理: 从第一天开始就建设目录、血缘、质量能力
  3. 采用Medallion架构: 使用Bronze-Silver-Gold分层明确数据质量等级
  4. 优先开放表格式: 选择Iceberg或Delta Lake避免厂商锁定
  5. 评估Mesh就绪度: 不要过早去中心化(500人以下组织)
  6. 自动化质量校验: 将测试(Great Expectations、dbt)集成到CI/CD流程
  7. 监控管道: 可观测性至关重要(数据新鲜度、质量、健康状态)
  8. 代码化文档: 使用dbt docs、DataHub、YAML实现自助服务
  9. 增量加载: 仅加载新增/变更数据(使用水印列)
  10. 业务对齐: 架构设计要对齐业务成果,而非仅关注技术

Anti-Patterns

反模式

  • ❌ Data swamp: Lake without governance or cataloging
  • ❌ Premature mesh: Mesh before organizational readiness
  • ❌ Tool sprawl: Too many tools without integration
  • ❌ No quality checks: "Garbage in, garbage out"
  • ❌ Centralized bottleneck: Single team in large org (>500 people)
  • ❌ Vendor lock-in: Proprietary formats without migration path
  • ❌ No lineage: Can't answer "where did this come from?"
  • ❌ Over-engineering: Complex architecture for simple use cases
  • ❌ 数据沼泽:缺乏治理或编目的数据湖
  • ❌ 过早实施Mesh:组织未就绪就推行Data Mesh
  • ❌ 工具泛滥:工具过多且缺乏集成
  • ❌ 无质量校验:“垃圾进,垃圾出”
  • ❌ 集中式瓶颈:大型组织(>500人)中单一团队垄断数据工作
  • ❌ 厂商锁定:使用专有格式且无迁移路径
  • ❌ 无数据血缘:无法回答“数据来自哪里?”
  • ❌ 过度设计:用复杂架构解决简单需求

Integration with Other Skills

与其他技能的集成

Direct Dependencies:
  • ingesting-data: ETL/ELT mechanics, Fivetran, Airbyte implementation
  • data-transformation: dbt and Dataform detailed implementation
  • streaming-data: Kafka, Flink for real-time pipelines
Complementary:
  • databases-relational: PostgreSQL, MySQL as source systems
  • databases-document: MongoDB, DynamoDB as sources
  • ai-data-engineering: Feature stores, ML training pipelines
  • designing-distributed-systems: CAP theorem, consistency models
  • observability: Monitoring pipeline health, data quality metrics
Downstream:
  • visualizing-data: BI and dashboard patterns
  • sql-optimization: Query performance tuning
Common Workflows:
End-to-End Analytics:
data-architecture (warehouse) → ingesting-data (Fivetran) →
data-transformation (dbt) → visualizing-data (Tableau)
Data Platform for AI/ML:
data-architecture (lakehouse) → ingesting-data (Kafka) →
data-transformation (dbt features) → ai-data-engineering (feature store)
直接依赖:
  • ingesting-data: ETL/ELT机制、Fivetran、Airbyte实现
  • data-transformation: dbt和Dataform详细实现
  • streaming-data: Kafka、Flink实时管道
互补技能:
  • databases-relational: PostgreSQL、MySQL作为源系统
  • databases-document: MongoDB、DynamoDB作为源系统
  • ai-data-engineering: 特征存储、ML训练管道
  • designing-distributed-systems: CAP定理、一致性模型
  • observability: 监控管道健康、数据质量指标
下游技能:
  • visualizing-data: BI与仪表盘模式
  • sql-optimization: 查询性能调优
常见工作流:
端到端分析:
data-architecture (warehouse) → ingesting-data (Fivetran) →
data-transformation (dbt) → visualizing-data (Tableau)
AI/ML数据平台:
data-architecture (lakehouse) → ingesting-data (Kafka) →
data-transformation (dbt features) → ai-data-engineering (feature store)

Further Reading

扩展阅读

Reference Files:
  • decision-frameworks.md - All 4 decision frameworks in detail
  • storage-paradigms.md - Lake vs warehouse vs lakehouse
  • modeling-approaches.md - Dimensional, normalized, data vault, wide
  • data-mesh-guide.md - Data mesh principles and implementation
  • medallion-pattern.md - Bronze, silver, gold layers
  • table-formats.md - Iceberg, Delta Lake, Hudi comparison
  • tool-recommendations.md - Tool analysis and recommendations
  • modern-data-stack.md - Tool categories and selection
  • governance-patterns.md - Catalog, lineage, quality, access control
  • scenarios.md - Startup, enterprise, data mesh scenarios
Examples:
  • examples/dbt-project/ - dbt project with medallion architecture
External Resources:
参考文件:
  • decision-frameworks.md - 所有4个决策框架的详细内容
  • storage-paradigms.md - 数据湖vs数据仓库vs湖仓一体
  • modeling-approaches.md - 维度建模、规范化建模、Data Vault、宽表
  • data-mesh-guide.md - Data Mesh原则与实现
  • medallion-pattern.md - Bronze、Silver、Gold分层
  • table-formats.md - Iceberg、Delta Lake、Hudi对比
  • tool-recommendations.md - 工具分析与推荐
  • modern-data-stack.md - 工具分类与选型
  • governance-patterns.md - 目录、血缘、质量、访问控制
  • scenarios.md - 初创公司、企业级、Data Mesh场景
示例:
  • examples/dbt-project/ - 基于Medallion架构的dbt项目
外部资源: