data-systems-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Systems Architecture
数据系统架构
Overview
概述
Core principle: Good data system architecture balances reliability (correct operation under faults), scalability (handling growth gracefully), and maintainability (enabling productive change over time). Every architectural decision involves trade-offs between these concerns.
This skill synthesizes knowledge from three foundational texts:
- Designing Data-Intensive Applications (Kleppmann) - distributed systems, storage engines, scaling
- The Art of PostgreSQL (Fontaine) - PostgreSQL-specific patterns, SQL as programming
- PostgreSQL Query Optimization (Dombrovskaya et al.) - execution plans, performance tuning
核心原则:优秀的数据系统架构需平衡可靠性(故障下仍能正确运行)、可扩展性(从容应对业务增长)和可维护性(支持长期高效迭代变更)。每一项架构决策都需要在这些关注点之间进行权衡。
本技能整合了三本经典著作的知识:
- 《Designing Data-Intensive Applications》(Kleppmann)- 分布式系统、存储引擎、扩展方案
- 《The Art of PostgreSQL》(Fontaine)- PostgreSQL专属模式、SQL编程
- 《PostgreSQL Query Optimization》(Dombrovskaya等)- 执行计划、性能调优
When to Use
适用场景
| Symptom | Start With |
|---|---|
| Designing a new database/schema | |
| Normalization vs denormalization decisions | |
| Need to understand OLTP vs OLAP | |
| Slow queries, index selection | |
| Planning for growth, read replicas | |
| Race conditions, deadlocks, isolation issues | |
| N+1 queries, ORM problems, application integration | |
| 场景 | 开始阅读 |
|---|---|
| 设计新数据库/模式 | |
| 规范化与反规范化决策 | |
| 需要理解OLTP与OLAP的区别 | |
| 查询缓慢、索引选择 | |
| 规划增长方案、只读副本 | |
| 竞态条件、死锁、隔离级别问题 | |
| N+1查询、ORM问题、应用集成 | |
Navigation
导航
Reference Files (Load as needed)
参考文件(按需加载)
01-foundational-principles.md - Reliability/Scalability/Maintainability, load parameters
02-data-modeling.md - Normalization, denormalization, schema design patterns
03-storage-engines.md - B-trees, LSM-trees, OLTP vs OLAP, PostgreSQL internals
04-indexing.md - Index types, compound indexes, covering indexes, maintenance
05-scaling-patterns.md - Replication, partitioning, sharding strategies
06-transactions-concurrency.md - ACID, isolation levels, MVCC, locking patterns
07-application-integration.md - ORM pitfalls, N+1, business logic placement, batch processing01-foundational-principles.md - 可靠性/可扩展性/可维护性、负载参数
02-data-modeling.md - 规范化、反规范化、模式设计模式
03-storage-engines.md - B-trees、LSM-trees、OLTP vs OLAP、PostgreSQL内部机制
04-indexing.md - 索引类型、复合索引、覆盖索引、维护
05-scaling-patterns.md - 复制、分区、分片策略
06-transactions-concurrency.md - ACID、隔离级别、MVCC、锁模式
07-application-integration.md - ORM陷阱、N+1问题、业务逻辑位置、批处理Quick Decision Framework
快速决策框架
New system design?
├─ Yes → Read 01, then 02 for data model
└─ No → What's the problem?
├─ "Queries are slow" → Read 04 (indexing) + 03 (storage patterns)
├─ "Data is inconsistent" → Read 02 (modeling) + 06 (transactions)
├─ "Can't handle the load" → Read 05 (scaling) + 03 (OLTP vs OLAP)
├─ "App makes too many queries" → Read 07 (N+1, ORM patterns)
└─ "Race conditions/deadlocks" → Read 06 (concurrency)新系统设计?
├─ 是 → 先阅读01,再阅读02了解数据模型
└─ 否 → 问题是什么?
├─ "查询缓慢" → 阅读04(索引) + 03(存储模式)
├─ "数据不一致" → 阅读02(建模) + 06(事务)
├─ "无法承载负载" → 阅读05(扩展) + 03(OLTP vs OLAP)
├─ "应用查询次数过多" → 阅读07(N+1、ORM模式)
└─ "竞态条件/死锁" → 阅读06(并发)Core Concepts (Quick Reference)
核心概念(快速参考)
The Three Pillars
三大支柱
| Concern | Definition | Key Question |
|---|---|---|
| Reliability | System works correctly under faults | What happens when things fail? |
| Scalability | Handles growth gracefully | What's 10x load look like? |
| Maintainability | Easy to operate and evolve | Can new engineers understand this? |
| 关注点 | 定义 | 关键问题 |
|---|---|---|
| 可靠性 | 系统在故障下仍能正确运行 | 出现故障时会发生什么? |
| 可扩展性 | 从容应对业务增长 | 10倍负载下系统表现如何? |
| 可维护性 | 易于运维和迭代 | 新工程师能快速理解吗? |
Data Model Selection
数据模型选择
| Model | Best For | Avoid When |
|---|---|---|
| Relational | Many-to-many relationships, joins, consistency | Highly hierarchical data, constant schema changes |
| Document | Self-contained docs, tree structures | Need for joins, many-to-many |
| Graph | Highly connected data, recursive queries | Simple CRUD, no relationship traversal |
| 模型 | 适用场景 | 避免场景 |
|---|---|---|
| 关系型 | 多对多关系、关联查询、一致性要求高 | 高度层级化数据、频繁变更模式 |
| 文档型 | 自包含文档、树形结构 | 需要关联查询、多对多关系 |
| 图型 | 高度关联数据、递归查询 | 简单CRUD操作、无需关系遍历 |
OLTP vs OLAP
OLTP vs OLAP
| Aspect | OLTP | OLAP |
|---|---|---|
| Query pattern | Point lookups, few rows | Aggregates, many rows |
| Optimization | Index everything used in WHERE | Fewer indexes, full scans OK |
| Storage | Row-oriented | Consider column-oriented |
| 维度 | OLTP | OLAP |
|---|---|---|
| 查询模式 | 点查询、少量行 | 聚合查询、大量行 |
| 优化方向 | 为WHERE子句中用到的字段建立索引 | 减少索引、允许全表扫描 |
| 存储方式 | 行式存储 | 考虑列式存储 |
Index Type Quick Reference
索引类型快速参考
| Type | Use Case | PostgreSQL |
|---|---|---|
| B-tree | Equality, range, sorting | Default, most queries |
| Hash | Equality only | Faster for exact match |
| GIN | Arrays, JSONB, full-text | |
| GiST | Geometric, range types | PostGIS, nearest-neighbor |
| BRIN | Large, naturally ordered tables | Time-series data |
| 类型 | 适用场景 | PostgreSQL支持情况 |
|---|---|---|
| B-tree | 等值查询、范围查询、排序 | 默认类型,适用于大多数查询 |
| Hash | 仅等值查询 | 精确匹配速度更快 |
| GIN | 数组、JSONB、全文检索 | 支持 |
| GiST | 几何类型、范围类型 | 支持PostGIS、最近邻查询 |
| BRIN | 大型、自然有序的表 | 时间序列数据 |
Isolation Levels
隔离级别
| Level | Prevents | PostgreSQL Default? |
|---|---|---|
| Read Committed | Dirty reads | Yes |
| Repeatable Read | + Non-repeatable reads | No |
| Serializable | All anomalies | No (uses SSI) |
| 级别 | 防止问题 | 是否为PostgreSQL默认? |
|---|---|---|
| Read Committed | 脏读 | 是 |
| Repeatable Read | + 不可重复读 | 否 |
| Serializable | 所有异常 | 否(使用SSI) |
Design Checklist
设计检查清单
Before finalizing a data architecture:
- Identified load parameters (read/write ratio, data volume, latency requirements)
- Chose appropriate data model (relational/document/graph hybrid?)
- Normalized to 3NF first, denormalized only with measured justification
- Designed indexes for actual query patterns (not hypothetical)
- Considered 10x growth scenario
- Established isolation level requirements
- Defined where business logic lives (app vs DB vs both)
- Planned for operations (backups, monitoring, migrations)
在最终确定数据架构前:
- 已明确负载参数(读写比、数据量、延迟要求)
- 选择了合适的数据模型(关系型/文档型/图型混合?)
- 先规范化到3NF,仅在有实测依据时才反规范化
- 根据实际查询模式设计索引(而非假设场景)
- 考虑了10倍增长的场景
- 确定了隔离级别要求
- 定义了业务逻辑的位置(应用/数据库/两者结合)
- 规划了运维方案(备份、监控、迁移)
References
参考资料
- Kleppmann, M. Designing Data-Intensive Applications (O'Reilly, 2017)
- Fontaine, D. The Art of PostgreSQL (2nd ed., 2020)
- Dombrovskaya, H., Novikov, B., Bailliekova, A. PostgreSQL Query Optimization (Apress, 2021)
- Kleppmann, M. 《Designing Data-Intensive Applications》(O'Reilly, 2017)
- Fontaine, D. 《The Art of PostgreSQL》(第2版, 2020)
- Dombrovskaya, H., Novikov, B., Bailliekova, A. 《PostgreSQL Query Optimization》(Apress, 2021)