data-architect

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Architect

Data Architect

Overview

概述

Design data architecture at enterprise and solution levels. This skill covers data mesh, lakehouse, governance, domain-driven design, conceptual/logical/physical data modeling, platform selection, and compliance frameworks. Produce ADRs, data model diagrams, platform comparison matrices, and governance policy templates.
在企业级和解决方案层面设计数据架构。该技能涵盖data mesh、lakehouse、数据治理、domain-driven design(领域驱动设计)、概念/逻辑/物理数据建模、平台选型以及合规框架。可生成ADRs(架构决策记录)、数据模型图、平台对比矩阵和治理策略模板。

When to Use

适用场景

  • Choosing among warehouse, lake, lakehouse, mesh, or streaming-first patterns
  • Creating conceptual, logical, or physical data models and ADRs
  • Defining data governance, catalog, quality, and compliance frameworks
  • Evaluating data platforms and long-term TCO or vendor trade-offs
  • 在数据仓库、数据湖、lakehouse、data mesh或流优先模式中进行选型
  • 创建概念、逻辑或物理数据模型及ADRs
  • 定义数据治理、数据目录、数据质量和合规框架
  • 评估数据平台及长期总拥有成本(TCO)或供应商权衡

When NOT to Use

不适用场景

  • Day-to-day pipeline on-call, SLA breaches, or shift handoffs → use
    data-system-ops-lead
  • Single-platform SQL tuning or star-schema implementation detail → use
    data-warehouse-engineer
  • dbt project implementation, mart tests, and analytics CI → use
    analytics-data-engineer
  • Team roadmaps, sprint cadence, or governance operations execution → use
    data-manager
  • OWL/RDF ontologies or knowledge-graph construction → use
    ontology-engineer
  • Application integration patterns and non-data system ADRs → use
    senior-system-architecture
  • LLM/RAG/copilot solution architecture and AI ADRs → use
    applied-ai-architect-commercial-enterprise
  • 日常管线待命、SLA违约或轮班交接 → 使用
    data-system-ops-lead
  • 单一平台SQL调优或星型架构实现细节 → 使用
    data-warehouse-engineer
  • dbt项目实施、数据集市测试和分析CI → 使用
    analytics-data-engineer
  • 团队路线图、迭代节奏或治理运营执行 → 使用
    data-manager
  • OWL/RDF本体或知识图谱构建 → 使用
    ontology-engineer
  • 应用集成模式和非数据系统ADRs → 使用
    senior-system-architecture
  • LLM/RAG/copilot解决方案架构和AI ADRs → 使用
    applied-ai-architect-commercial-enterprise

Features

核心能力

  • Architecture decision framework with weighted criteria evaluation
  • Progressive data modeling workflow (conceptual → logical → physical)
  • Platform selection decision tree for warehouse/lake/lakehouse/mesh/streaming
  • Governance pillar planning with tool recommendations
  • ADR template generation and stakeholder review processes
  • 带加权标准评估的架构决策框架
  • 渐进式数据建模工作流(概念→逻辑→物理)
  • 适用于数据仓库/数据湖/lakehouse/data mesh/流处理的平台选型决策树
  • 含工具推荐的治理支柱规划
  • ADR模板生成及相关方评审流程

Usage

使用流程

  1. Identify the user's data architecture need (platform choice, modeling, governance, or decision framework)
  2. Follow the corresponding workflow below
  3. Produce structured outputs: ADRs, data model diagrams, platform comparison matrices, or governance policies
  1. 识别用户的数据架构需求(平台选型、建模、治理或决策框架)
  2. 遵循下方对应的工作流
  3. 生成结构化输出:ADRs、数据模型图、平台对比矩阵或治理策略

Examples

示例

  • User: "Should we use a data lake or data warehouse for our analytics?" Agent: Runs Platform & Technology Selection workflow (Workflow 3), evaluates structured vs raw data needs, recommends warehouse/lake/lakehouse with trade-offs
  • User: "We need to model our customer domain" Agent: Runs Data Modeling Workflow (Workflow 2), starts with conceptual model (entities, relationships), progresses to logical ER diagram, then physical DDL
  • User: "How do we set up data governance for GDPR compliance?" Agent: Runs Governance & Compliance Planning (Workflow 4), maps GDPR requirements to governance pillars, recommends tools and controls
  • 用户:“我们的分析工作应该使用数据湖还是数据仓库?” Agent:执行平台与技术选型工作流(工作流3),评估结构化数据与原始数据需求,推荐数据仓库/数据湖/lakehouse并说明权衡点
  • 用户:“我们需要对客户领域进行建模” Agent:执行数据建模工作流(工作流2),从概念模型(实体、关系)开始,逐步生成逻辑ER图,再到物理DDL
  • 用户:“我们如何为GDPR合规搭建数据治理体系?” Agent:执行治理与合规规划工作流(工作流4),将GDPR要求映射到治理支柱,推荐工具与管控措施

Core Workflows

核心工作流

1. Architecture Decision Framework

1. 架构决策框架

Use this 5-step process for any major data architecture decision:
  1. Define the decision context
    • Business drivers (scale, latency, cost, compliance)
    • Constraints (budget, timeline, existing tech, team skills)
    • Stakeholders (data engineers, analysts, product, legal)
  2. Identify alternatives
    • At least 3 options (do nothing, minimal change, transformative)
    • Include cloud-native, hybrid, and open-source alternatives
  3. Evaluate against criteria
    CriterionWeightScore 1-5 each option
    ScalabilityHigh
    Cost (TCO 3yr)High
    Time to valueMedium
    Operational complexityMedium
    Team fitMedium
    Vendor lock-in riskLow
  4. Assess risks & mitigation
    • Migration risk, talent risk, operational risk
    • POC plan for the top 2 options
  5. Document the decision
    • ADR (Architecture Decision Record) with context, decision, consequences
    • Share with stakeholders; revisit quarterly
任何重大数据架构决策均使用以下5步流程:
  1. 定义决策背景
    • 业务驱动因素(规模、延迟、成本、合规)
    • 约束条件(预算、时间线、现有技术、团队技能)
    • 相关方(数据工程师、分析师、产品、法务)
  2. 确定备选方案
    • 至少3种选项(维持现状、最小变更、转型变革)
    • 包含云原生、混合云和开源方案
  3. 按标准评估
    评估标准权重各选项评分(1-5分)
    Scalability(可扩展性)
    Cost(3年总拥有成本)
    Time to value(价值实现时间)
    Operational complexity(运维复杂度)
    Team fit(团队适配度)
    Vendor lock-in risk(供应商锁定风险)
  4. 评估风险与缓解措施
    • 迁移风险、人才风险、运维风险
    • 排名前2的方案的POC计划
  5. 记录决策
    • 包含背景、决策内容及影响的ADR(架构决策记录)
    • 与相关方共享;每季度重新审视

2. Data Modeling Workflow

2. 数据建模工作流

Progressive refinement from business to implementation:
StageOutputAudienceKey Activities
ConceptualEntity list, relationships, business glossaryBusiness stakeholdersWorkshops, domain events
LogicalNormalized ER diagram, attributes, keysData analysts, architectsIdentify entities, resolve many-to-many
PhysicalDB-specific DDL, partitions, indexesEngineersPlatform optimization, denormalization
Key principles:
  • Start with the business question, not the technology
  • Use surrogate keys in physical model; natural keys in logical
  • Denormalize only when you have a performance requirement
从业务需求到落地实现的渐进式细化:
阶段输出物受众核心活动
Conceptual(概念层)实体列表、关系、业务术语表业务相关方研讨会、领域事件梳理
Logical(逻辑层)规范化ER图、属性、键数据分析师、架构师识别实体、解决多对多关系
Physical(物理层)数据库专属DDL、分区、索引工程师平台优化、反规范化
核心原则:
  • 从业务问题出发,而非技术
  • 物理模型使用代理键;逻辑模型使用自然键
  • 仅在有性能需求时才进行反规范化

3. Platform & Technology Selection

3. 平台与技术选型

Decision tree:
  • Need structured analytics + BI at scale? → Data Warehouse (Snowflake, BigQuery, Redshift)
  • Need raw data + ML + flexible schemas? → Data Lake (S3 + Athena/Spark)
  • Need both with ACID guarantees? → Lakehouse (Databricks, Iceberg, Hudi)
  • Need domain ownership + federated governance? → Data Mesh (multiple warehouses/lakes per domain)
  • Need real-time + low latency? → Streaming-first (Kafka + Flink + materialized views)
决策树:
  • 需要大规模结构化分析+BI?→ 数据仓库(Snowflake、BigQuery、Redshift)
  • 需要原始数据+机器学习+灵活schema?→ 数据湖(S3 + Athena/Spark)
  • 同时需要上述两者且需ACID保障?→ Lakehouse(Databricks、Iceberg、Hudi)
  • 需要领域所有权+联邦治理?→ Data Mesh(每个领域对应多个数据仓库/数据湖)
  • 需要实时+低延迟?→ 流优先(Kafka + Flink + 物化视图)

4. Governance & Compliance Planning

4. 治理与合规规划

Governance pillars:
PillarActivitiesTools
Data QualityProfiling, validation rules, anomaly detectiondbt tests, Great Expectations, Monte Carlo
Data CatalogMetadata, lineage, discoveryDataHub, Collibra, Alation
Access ControlRBAC, ABAC, masking, encryptionPlatform-native + Immuta/Okera
Master Data ManagementGolden records, deduplicationInformatica, Reltio, custom MDM
ComplianceGDPR, CCPA, HIPAA, SOC 2Legal review + technical controls
治理支柱:
支柱活动工具
Data Quality(数据质量)数据探查、验证规则、异常检测dbt tests、Great Expectations、Monte Carlo
Data Catalog(数据目录)元数据、数据血缘、数据发现DataHub、Collibra、Alation
Access Control(访问控制)RBAC、ABAC、数据脱敏、加密平台原生工具 + Immuta/Okera
Master Data Management(主数据管理)黄金记录、数据去重Informatica、Reltio、自定义MDM
Compliance(合规)GDPR、CCPA、HIPAA、SOC 2法务评审 + 技术管控