data-architect
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Architect
Data Architect
Overview
概述
Design data architecture at enterprise and solution levels. This skill covers data mesh, lakehouse,
governance, domain-driven design, conceptual/logical/physical data modeling, platform selection,
and compliance frameworks. Produce ADRs, data model diagrams, platform comparison matrices,
and governance policy templates.
在企业级和解决方案层面设计数据架构。该技能涵盖data mesh、lakehouse、数据治理、domain-driven design(领域驱动设计)、概念/逻辑/物理数据建模、平台选型以及合规框架。可生成ADRs(架构决策记录)、数据模型图、平台对比矩阵和治理策略模板。
When to Use
适用场景
- Choosing among warehouse, lake, lakehouse, mesh, or streaming-first patterns
- Creating conceptual, logical, or physical data models and ADRs
- Defining data governance, catalog, quality, and compliance frameworks
- Evaluating data platforms and long-term TCO or vendor trade-offs
- 在数据仓库、数据湖、lakehouse、data mesh或流优先模式中进行选型
- 创建概念、逻辑或物理数据模型及ADRs
- 定义数据治理、数据目录、数据质量和合规框架
- 评估数据平台及长期总拥有成本(TCO)或供应商权衡
When NOT to Use
不适用场景
- Day-to-day pipeline on-call, SLA breaches, or shift handoffs → use
data-system-ops-lead - Single-platform SQL tuning or star-schema implementation detail → use
data-warehouse-engineer - dbt project implementation, mart tests, and analytics CI → use
analytics-data-engineer - Team roadmaps, sprint cadence, or governance operations execution → use
data-manager - OWL/RDF ontologies or knowledge-graph construction → use
ontology-engineer - Application integration patterns and non-data system ADRs → use
senior-system-architecture - LLM/RAG/copilot solution architecture and AI ADRs → use
applied-ai-architect-commercial-enterprise
- 日常管线待命、SLA违约或轮班交接 → 使用
data-system-ops-lead - 单一平台SQL调优或星型架构实现细节 → 使用
data-warehouse-engineer - dbt项目实施、数据集市测试和分析CI → 使用
analytics-data-engineer - 团队路线图、迭代节奏或治理运营执行 → 使用
data-manager - OWL/RDF本体或知识图谱构建 → 使用
ontology-engineer - 应用集成模式和非数据系统ADRs → 使用
senior-system-architecture - LLM/RAG/copilot解决方案架构和AI ADRs → 使用
applied-ai-architect-commercial-enterprise
Features
核心能力
- Architecture decision framework with weighted criteria evaluation
- Progressive data modeling workflow (conceptual → logical → physical)
- Platform selection decision tree for warehouse/lake/lakehouse/mesh/streaming
- Governance pillar planning with tool recommendations
- ADR template generation and stakeholder review processes
- 带加权标准评估的架构决策框架
- 渐进式数据建模工作流(概念→逻辑→物理)
- 适用于数据仓库/数据湖/lakehouse/data mesh/流处理的平台选型决策树
- 含工具推荐的治理支柱规划
- ADR模板生成及相关方评审流程
Usage
使用流程
- Identify the user's data architecture need (platform choice, modeling, governance, or decision framework)
- Follow the corresponding workflow below
- Produce structured outputs: ADRs, data model diagrams, platform comparison matrices, or governance policies
- 识别用户的数据架构需求(平台选型、建模、治理或决策框架)
- 遵循下方对应的工作流
- 生成结构化输出:ADRs、数据模型图、平台对比矩阵或治理策略
Examples
示例
-
User: "Should we use a data lake or data warehouse for our analytics?" Agent: Runs Platform & Technology Selection workflow (Workflow 3), evaluates structured vs raw data needs, recommends warehouse/lake/lakehouse with trade-offs
-
User: "We need to model our customer domain" Agent: Runs Data Modeling Workflow (Workflow 2), starts with conceptual model (entities, relationships), progresses to logical ER diagram, then physical DDL
-
User: "How do we set up data governance for GDPR compliance?" Agent: Runs Governance & Compliance Planning (Workflow 4), maps GDPR requirements to governance pillars, recommends tools and controls
-
用户:“我们的分析工作应该使用数据湖还是数据仓库?” Agent:执行平台与技术选型工作流(工作流3),评估结构化数据与原始数据需求,推荐数据仓库/数据湖/lakehouse并说明权衡点
-
用户:“我们需要对客户领域进行建模” Agent:执行数据建模工作流(工作流2),从概念模型(实体、关系)开始,逐步生成逻辑ER图,再到物理DDL
-
用户:“我们如何为GDPR合规搭建数据治理体系?” Agent:执行治理与合规规划工作流(工作流4),将GDPR要求映射到治理支柱,推荐工具与管控措施
Core Workflows
核心工作流
1. Architecture Decision Framework
1. 架构决策框架
Use this 5-step process for any major data architecture decision:
-
Define the decision context
- Business drivers (scale, latency, cost, compliance)
- Constraints (budget, timeline, existing tech, team skills)
- Stakeholders (data engineers, analysts, product, legal)
-
Identify alternatives
- At least 3 options (do nothing, minimal change, transformative)
- Include cloud-native, hybrid, and open-source alternatives
-
Evaluate against criteria
Criterion Weight Score 1-5 each option Scalability High Cost (TCO 3yr) High Time to value Medium Operational complexity Medium Team fit Medium Vendor lock-in risk Low -
Assess risks & mitigation
- Migration risk, talent risk, operational risk
- POC plan for the top 2 options
-
Document the decision
- ADR (Architecture Decision Record) with context, decision, consequences
- Share with stakeholders; revisit quarterly
任何重大数据架构决策均使用以下5步流程:
-
定义决策背景
- 业务驱动因素(规模、延迟、成本、合规)
- 约束条件(预算、时间线、现有技术、团队技能)
- 相关方(数据工程师、分析师、产品、法务)
-
确定备选方案
- 至少3种选项(维持现状、最小变更、转型变革)
- 包含云原生、混合云和开源方案
-
按标准评估
评估标准 权重 各选项评分(1-5分) Scalability(可扩展性) 高 Cost(3年总拥有成本) 高 Time to value(价值实现时间) 中 Operational complexity(运维复杂度) 中 Team fit(团队适配度) 中 Vendor lock-in risk(供应商锁定风险) 低 -
评估风险与缓解措施
- 迁移风险、人才风险、运维风险
- 排名前2的方案的POC计划
-
记录决策
- 包含背景、决策内容及影响的ADR(架构决策记录)
- 与相关方共享;每季度重新审视
2. Data Modeling Workflow
2. 数据建模工作流
Progressive refinement from business to implementation:
| Stage | Output | Audience | Key Activities |
|---|---|---|---|
| Conceptual | Entity list, relationships, business glossary | Business stakeholders | Workshops, domain events |
| Logical | Normalized ER diagram, attributes, keys | Data analysts, architects | Identify entities, resolve many-to-many |
| Physical | DB-specific DDL, partitions, indexes | Engineers | Platform optimization, denormalization |
Key principles:
- Start with the business question, not the technology
- Use surrogate keys in physical model; natural keys in logical
- Denormalize only when you have a performance requirement
从业务需求到落地实现的渐进式细化:
| 阶段 | 输出物 | 受众 | 核心活动 |
|---|---|---|---|
| Conceptual(概念层) | 实体列表、关系、业务术语表 | 业务相关方 | 研讨会、领域事件梳理 |
| Logical(逻辑层) | 规范化ER图、属性、键 | 数据分析师、架构师 | 识别实体、解决多对多关系 |
| Physical(物理层) | 数据库专属DDL、分区、索引 | 工程师 | 平台优化、反规范化 |
核心原则:
- 从业务问题出发,而非技术
- 物理模型使用代理键;逻辑模型使用自然键
- 仅在有性能需求时才进行反规范化
3. Platform & Technology Selection
3. 平台与技术选型
Decision tree:
- Need structured analytics + BI at scale? → Data Warehouse (Snowflake, BigQuery, Redshift)
- Need raw data + ML + flexible schemas? → Data Lake (S3 + Athena/Spark)
- Need both with ACID guarantees? → Lakehouse (Databricks, Iceberg, Hudi)
- Need domain ownership + federated governance? → Data Mesh (multiple warehouses/lakes per domain)
- Need real-time + low latency? → Streaming-first (Kafka + Flink + materialized views)
决策树:
- 需要大规模结构化分析+BI?→ 数据仓库(Snowflake、BigQuery、Redshift)
- 需要原始数据+机器学习+灵活schema?→ 数据湖(S3 + Athena/Spark)
- 同时需要上述两者且需ACID保障?→ Lakehouse(Databricks、Iceberg、Hudi)
- 需要领域所有权+联邦治理?→ Data Mesh(每个领域对应多个数据仓库/数据湖)
- 需要实时+低延迟?→ 流优先(Kafka + Flink + 物化视图)
4. Governance & Compliance Planning
4. 治理与合规规划
Governance pillars:
| Pillar | Activities | Tools |
|---|---|---|
| Data Quality | Profiling, validation rules, anomaly detection | dbt tests, Great Expectations, Monte Carlo |
| Data Catalog | Metadata, lineage, discovery | DataHub, Collibra, Alation |
| Access Control | RBAC, ABAC, masking, encryption | Platform-native + Immuta/Okera |
| Master Data Management | Golden records, deduplication | Informatica, Reltio, custom MDM |
| Compliance | GDPR, CCPA, HIPAA, SOC 2 | Legal review + technical controls |
治理支柱:
| 支柱 | 活动 | 工具 |
|---|---|---|
| Data Quality(数据质量) | 数据探查、验证规则、异常检测 | dbt tests、Great Expectations、Monte Carlo |
| Data Catalog(数据目录) | 元数据、数据血缘、数据发现 | DataHub、Collibra、Alation |
| Access Control(访问控制) | RBAC、ABAC、数据脱敏、加密 | 平台原生工具 + Immuta/Okera |
| Master Data Management(主数据管理) | 黄金记录、数据去重 | Informatica、Reltio、自定义MDM |
| Compliance(合规) | GDPR、CCPA、HIPAA、SOC 2 | 法务评审 + 技术管控 |