data-architect

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Architect

Overview

概述

Design data architecture at enterprise and solution levels. This skill covers data mesh, lakehouse, governance, domain-driven design, conceptual/logical/physical data modeling, platform selection, and compliance frameworks. Produce ADRs, data model diagrams, platform comparison matrices, and governance policy templates.

在企业级和解决方案层面设计数据架构。该技能涵盖data mesh、lakehouse、数据治理、domain-driven design（领域驱动设计）、概念/逻辑/物理数据建模、平台选型以及合规框架。可生成ADRs（架构决策记录）、数据模型图、平台对比矩阵和治理策略模板。

When to Use

适用场景

Choosing among warehouse, lake, lakehouse, mesh, or streaming-first patterns
Creating conceptual, logical, or physical data models and ADRs
Defining data governance, catalog, quality, and compliance frameworks
Evaluating data platforms and long-term TCO or vendor trade-offs

在数据仓库、数据湖、lakehouse、data mesh或流优先模式中进行选型
创建概念、逻辑或物理数据模型及ADRs
定义数据治理、数据目录、数据质量和合规框架
评估数据平台及长期总拥有成本（TCO）或供应商权衡

When NOT to Use

不适用场景

Day-to-day pipeline on-call, SLA breaches, or shift handoffs → use
```
data-system-ops-lead
```
Single-platform SQL tuning or star-schema implementation detail → use
```
data-warehouse-engineer
```
dbt project implementation, mart tests, and analytics CI → use
```
analytics-data-engineer
```
Team roadmaps, sprint cadence, or governance operations execution → use
```
data-manager
```
OWL/RDF ontologies or knowledge-graph construction → use
```
ontology-engineer
```
Application integration patterns and non-data system ADRs → use
```
senior-system-architecture
```
LLM/RAG/copilot solution architecture and AI ADRs → use
```
applied-ai-architect-commercial-enterprise
```

日常管线待命、SLA违约或轮班交接 → 使用
```
data-system-ops-lead
```
单一平台SQL调优或星型架构实现细节 → 使用
```
data-warehouse-engineer
```
dbt项目实施、数据集市测试和分析CI → 使用
```
analytics-data-engineer
```
团队路线图、迭代节奏或治理运营执行 → 使用
```
data-manager
```
OWL/RDF本体或知识图谱构建 → 使用
```
ontology-engineer
```
应用集成模式和非数据系统ADRs → 使用
```
senior-system-architecture
```
LLM/RAG/copilot解决方案架构和AI ADRs → 使用
```
applied-ai-architect-commercial-enterprise
```

Features

核心能力

Architecture decision framework with weighted criteria evaluation
Progressive data modeling workflow (conceptual → logical → physical)
Platform selection decision tree for warehouse/lake/lakehouse/mesh/streaming
Governance pillar planning with tool recommendations
ADR template generation and stakeholder review processes

带加权标准评估的架构决策框架
渐进式数据建模工作流（概念→逻辑→物理）
适用于数据仓库/数据湖/lakehouse/data mesh/流处理的平台选型决策树
含工具推荐的治理支柱规划
ADR模板生成及相关方评审流程

Usage

使用流程

Identify the user's data architecture need (platform choice, modeling, governance, or decision framework)
Follow the corresponding workflow below
Produce structured outputs: ADRs, data model diagrams, platform comparison matrices, or governance policies

识别用户的数据架构需求（平台选型、建模、治理或决策框架）
遵循下方对应的工作流
生成结构化输出：ADRs、数据模型图、平台对比矩阵或治理策略

Examples

示例

User: "Should we use a data lake or data warehouse for our analytics?" Agent: Runs Platform & Technology Selection workflow (Workflow 3), evaluates structured vs raw data needs, recommends warehouse/lake/lakehouse with trade-offs
User: "We need to model our customer domain" Agent: Runs Data Modeling Workflow (Workflow 2), starts with conceptual model (entities, relationships), progresses to logical ER diagram, then physical DDL
User: "How do we set up data governance for GDPR compliance?" Agent: Runs Governance & Compliance Planning (Workflow 4), maps GDPR requirements to governance pillars, recommends tools and controls

用户：“我们的分析工作应该使用数据湖还是数据仓库？” Agent：执行平台与技术选型工作流（工作流3），评估结构化数据与原始数据需求，推荐数据仓库/数据湖/lakehouse并说明权衡点
用户：“我们需要对客户领域进行建模” Agent：执行数据建模工作流（工作流2），从概念模型（实体、关系）开始，逐步生成逻辑ER图，再到物理DDL
用户：“我们如何为GDPR合规搭建数据治理体系？” Agent：执行治理与合规规划工作流（工作流4），将GDPR要求映射到治理支柱，推荐工具与管控措施

Core Workflows

核心工作流

1. Architecture Decision Framework

1. 架构决策框架

Use this 5-step process for any major data architecture decision:

Define the decision context
- Business drivers (scale, latency, cost, compliance)
- Constraints (budget, timeline, existing tech, team skills)
- Stakeholders (data engineers, analysts, product, legal)
Identify alternatives
- At least 3 options (do nothing, minimal change, transformative)
- Include cloud-native, hybrid, and open-source alternatives
Evaluate against criteria

Criterion Weight Score 1-5 each option
Scalability High
Cost (TCO 3yr) High
Time to value Medium
Operational complexity Medium
Team fit Medium
Vendor lock-in risk Low
Assess risks & mitigation
- Migration risk, talent risk, operational risk
- POC plan for the top 2 options
Document the decision
- ADR (Architecture Decision Record) with context, decision, consequences
- Share with stakeholders; revisit quarterly

Criterion	Weight	Score 1-5 each option
Scalability	High
Cost (TCO 3yr)	High
Time to value	Medium
Operational complexity	Medium
Team fit	Medium
Vendor lock-in risk	Low

任何重大数据架构决策均使用以下5步流程：

定义决策背景
- 业务驱动因素（规模、延迟、成本、合规）
- 约束条件（预算、时间线、现有技术、团队技能）
- 相关方（数据工程师、分析师、产品、法务）
确定备选方案
- 至少3种选项（维持现状、最小变更、转型变革）
- 包含云原生、混合云和开源方案

按标准评估

评估标准	权重	各选项评分（1-5分）
Scalability（可扩展性）	高
Cost（3年总拥有成本）	高
Time to value（价值实现时间）	中
Operational complexity（运维复杂度）	中
Team fit（团队适配度）	中
Vendor lock-in risk（供应商锁定风险）	低

评估风险与缓解措施
- 迁移风险、人才风险、运维风险
- 排名前2的方案的POC计划
记录决策
- 包含背景、决策内容及影响的ADR（架构决策记录）
- 与相关方共享；每季度重新审视

2. Data Modeling Workflow

2. 数据建模工作流

Progressive refinement from business to implementation:

Stage	Output	Audience	Key Activities
Conceptual	Entity list, relationships, business glossary	Business stakeholders	Workshops, domain events
Logical	Normalized ER diagram, attributes, keys	Data analysts, architects	Identify entities, resolve many-to-many
Physical	DB-specific DDL, partitions, indexes	Engineers	Platform optimization, denormalization

Key principles:

Start with the business question, not the technology
Use surrogate keys in physical model; natural keys in logical
Denormalize only when you have a performance requirement

从业务需求到落地实现的渐进式细化：

阶段	输出物	受众	核心活动
Conceptual（概念层）	实体列表、关系、业务术语表	业务相关方	研讨会、领域事件梳理
Logical（逻辑层）	规范化ER图、属性、键	数据分析师、架构师	识别实体、解决多对多关系
Physical（物理层）	数据库专属DDL、分区、索引	工程师	平台优化、反规范化

核心原则：

从业务问题出发，而非技术
物理模型使用代理键；逻辑模型使用自然键
仅在有性能需求时才进行反规范化

3. Platform & Technology Selection

3. 平台与技术选型

Decision tree:

Need structured analytics + BI at scale? → Data Warehouse (Snowflake, BigQuery, Redshift)
Need raw data + ML + flexible schemas? → Data Lake (S3 + Athena/Spark)
Need both with ACID guarantees? → Lakehouse (Databricks, Iceberg, Hudi)
Need domain ownership + federated governance? → Data Mesh (multiple warehouses/lakes per domain)
Need real-time + low latency? → Streaming-first (Kafka + Flink + materialized views)

决策树：

需要大规模结构化分析+BI？→ 数据仓库（Snowflake、BigQuery、Redshift）
需要原始数据+机器学习+灵活schema？→ 数据湖（S3 + Athena/Spark）
同时需要上述两者且需ACID保障？→ Lakehouse（Databricks、Iceberg、Hudi）
需要领域所有权+联邦治理？→ Data Mesh（每个领域对应多个数据仓库/数据湖）
需要实时+低延迟？→ 流优先（Kafka + Flink + 物化视图）

4. Governance & Compliance Planning

4. 治理与合规规划

Governance pillars:

Pillar	Activities	Tools
Data Quality	Profiling, validation rules, anomaly detection	dbt tests, Great Expectations, Monte Carlo
Data Catalog	Metadata, lineage, discovery	DataHub, Collibra, Alation
Access Control	RBAC, ABAC, masking, encryption	Platform-native + Immuta/Okera
Master Data Management	Golden records, deduplication	Informatica, Reltio, custom MDM
Compliance	GDPR, CCPA, HIPAA, SOC 2	Legal review + technical controls

治理支柱：

支柱	活动	工具
Data Quality（数据质量）	数据探查、验证规则、异常检测	dbt tests、Great Expectations、Monte Carlo
Data Catalog（数据目录）	元数据、数据血缘、数据发现	DataHub、Collibra、Alation
Access Control（访问控制）	RBAC、ABAC、数据脱敏、加密	平台原生工具 + Immuta/Okera
Master Data Management（主数据管理）	黄金记录、数据去重	Informatica、Reltio、自定义MDM
Compliance（合规）	GDPR、CCPA、HIPAA、SOC 2	法务评审 + 技术管控