data-model-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Star Schema as Default

以星型Schema作为默认选择

For analytics data products, default to star schema with strategic denormalization:
Fact tables contain events at the lowest useful grain:
  • One row per event (transaction, visit, measurement, interaction)
  • Foreign keys to dimension tables
  • Numeric measures (amount, count, duration)
  • Timestamps at the grain of the analysis
Dimension tables contain context:
  • Descriptive attributes for filtering and grouping
  • Human-readable labels alongside codes
  • Hierarchies for drill-down (region -> state -> city)
NEVER fully denormalize into One Big Table. Many-to-many relationships cause exponential row growth. A patient with 10 conditions and 5 medications creates 50 rows instead of 15.
ALWAYS start with the query patterns. What questions will consumers ask? Design the schema to make those queries simple. If 80% of queries filter by date and group by category, those should be the primary dimensions.
对于分析类数据产品,默认采用星型Schema并结合策略性反规范化:
事实表包含最细粒度的事件数据:
  • 每行对应一个事件(交易、访问、测量、交互)
  • 包含指向维度表的外键
  • 数值型度量指标(金额、数量、时长)
  • 符合分析粒度的时间戳
维度表包含上下文信息:
  • 用于筛选和分组的描述性属性
  • 与编码对应的可读标签
  • 支持下钻的层级结构(地区→州→城市)
绝对不要完全反规范化为单一大表。多对多关系会导致行数呈指数级增长。例如一名患者有10种病症和5种药物,会生成50行数据而非15行。
始终从查询模式入手。消费者会提出哪些问题?设计Schema时要让这些查询变得简单。如果80%的查询按日期筛选并按类别分组,那么这些字段应作为核心维度。

Slowly Changing Dimensions (SCD)

缓慢变化维度(SCD)

Type 1 - Overwrite the old value. Use when history doesn't matter (correcting a typo in a name).
Type 2 - Add a new row with effective dates. Use when you need to track what was true at a point in time (patient address at time of visit, product price at time of sale). Add
effective_start_date
,
effective_end_date
, and
is_current
flag.
Type 3 - Add a column for the previous value. Use when you only need one level of history (current_category, previous_category).
Default to Type 2 for any dimension where the business asks "what was it at the time of X?" Start with Type 1 for everything else and upgrade when the need emerges.
Type 1 - 覆盖旧值。适用于历史数据无关紧要的场景(例如修正名称中的拼写错误)。
Type 2 - 添加带有生效日期的新行。适用于需要追踪某一时刻状态的场景(患者就诊时的地址、销售时的产品价格)。需添加
effective_start_date
effective_end_date
is_current
标识字段。
Type 3 - 添加存储历史值的列。适用于仅需保留一级历史数据的场景(current_category、previous_category)。
当业务方询问“X发生时的状态是什么?”时,默认采用Type 2。其他场景先从Type 1开始,当需求出现时再升级。

Architecture Decision Records

架构决策记录(ADR)

For every non-obvious modeling decision, write a lightweight ADR:
  • Context: What situation prompted the decision?
  • Options: What alternatives were considered?
  • Decision: What was chosen and why?
  • Consequences: What are the tradeoffs?
Keep ADRs in the repo alongside the schema. Future team members will ask "why is it modeled this way?" The ADR answers before they have to ask.
对于所有非显而易见的建模决策,撰写轻量化的ADR:
  • 背景:是什么场景促使做出该决策?
  • 可选方案:考虑过哪些替代方案?
  • 决策结果:选择了什么方案,原因是什么?
  • 影响:存在哪些权衡?
将ADR与Schema一同保存在代码仓库中。未来团队成员会问“为什么要这样建模?”,ADR会在他们提问前给出答案。

Design Validation

设计验证

Before building, validate with limited data:
  1. Load a representative sample (1-5% of volume)
  2. Run the top 10 expected queries
  3. Verify results match expectations
  4. Check query performance against SLA
  5. Confirm grain is correct (no unexpected row multiplication)
在构建前,使用有限数据进行验证:
  1. 加载具有代表性的样本数据(占总量的1-5%)
  2. 运行排名前10的预期查询
  3. 验证结果是否符合预期
  4. 检查查询性能是否符合服务水平协议(SLA)
  5. 确认粒度是否正确(无意外的行倍增情况)

Source Schema Evaluation

源Schema评估

You probably won't design the OLTP schema. But you'll need to evaluate whether it's a reliable source for your data product.
Quick checklist before building on a source system:
  • Primary keys defined and enforced (not just implied by convention)
  • Foreign keys constrained at the database level (not just in the application)
  • Indexes on FK columns (missing indexes = slow extracts)
  • NOT NULL on required fields (nullable everything = garbage in)
  • Migrations are reversible (no destructive ALTER TABLE without a rollback plan)
Anti-patterns to flag immediately:
  • VARCHAR(255)
    on every string column (signals no thought about data types)
  • FLOAT
    or
    DOUBLE
    for money (use
    DECIMAL
    or integer cents)
  • Missing FK constraints ("the app handles it" means orphaned rows)
  • Dates stored as strings ("2024-01-15" in a
    VARCHAR
    breaks sorting and comparison)
See
schema-patterns.md
for common dimensional modeling patterns.
你可能不需要设计OLTP Schema,但需要评估它是否是数据产品的可靠数据源。
在基于源系统构建前的快速检查清单:
  • 主键已定义并强制执行(而非仅依赖约定)
  • 外键在数据库层面受到约束(而非仅由应用处理)
  • 外键列上存在索引(缺失索引会导致提取速度缓慢)
  • 必填字段设置为NOT NULL(所有字段可空会导致垃圾数据输入)
  • 迁移操作可回滚(无回滚计划的破坏性ALTER TABLE操作不可取)
需立即标记的反模式:
  • 所有字符串列都使用
    VARCHAR(255)
    (表明未考虑数据类型)
  • 使用
    FLOAT
    DOUBLE
    存储金额(应使用
    DECIMAL
    或整数形式的分)
  • 缺失外键约束(“由应用处理”意味着会出现孤立行)
  • 日期以字符串形式存储(
    VARCHAR
    类型的“2024-01-15”会破坏排序和比较功能)
常见维度建模模式请参见
schema-patterns.md