nosql-expert

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NoSQL Expert Patterns (Cassandra & DynamoDB)

NoSQL专家模式（Cassandra & DynamoDB）

Overview

概述

This skill provides professional mental models and design patterns for distributed wide-column and key-value stores (specifically Apache Cassandra and Amazon DynamoDB).

Unlike SQL (where you model data entities), or document stores (like MongoDB), these distributed systems require you to model your queries first.

本技能为分布式宽列和键值存储（特别是Apache Cassandra和Amazon DynamoDB）提供专业的思维模型和设计模式。

与SQL（建模数据实体）或文档型数据库（如MongoDB）不同，这些分布式系统要求你优先建模查询。

When to Use

适用场景

Designing for Scale: Moving beyond simple single-node databases to distributed clusters.
Technology Selection: Evaluating or using Cassandra, ScyllaDB, or DynamoDB.
Performance Tuning: Troubleshooting "hot partitions" or high latency in existing NoSQL systems.
Microservices: Implementing "database-per-service" patterns where highly optimized reads are required.

面向可扩展性设计：从简单的单节点数据库迁移到分布式集群。
技术选型：评估或使用Cassandra、ScyllaDB或DynamoDB。
性能调优：排查现有NoSQL系统中的“热点分区”或高延迟问题。
微服务：在需要高度优化读取性能的场景中实现“每个服务对应一个数据库”模式。

The Mental Shift: SQL vs. Distributed NoSQL

思维转变：SQL vs 分布式NoSQL

Feature	SQL (Relational)	Distributed NoSQL (Cassandra/DynamoDB)
Data modeling	Model Entities + Relationships	Model Queries (Access Patterns)
Joins	CPU-intensive, at read time	Pre-computed (Denormalized) at write time
Storage cost	Expensive (minimize duplication)	Cheap (duplicate data for read speed)
Consistency	ACID (Strong)	BASE (Eventual) / Tunable
Scalability	Vertical (Bigger machine)	Horizontal (More nodes/shards)

The Golden Rule: In SQL, you design the data model to answer any query. In NoSQL, you design the data model to answer specific queries efficiently.

特性	SQL（关系型）	分布式NoSQL（Cassandra/DynamoDB）
数据建模	建模实体与关系	建模查询（访问模式）
关联查询	读取时CPU密集型操作	预计算（反规范化）在写入时完成
存储成本	高昂（尽量减少数据重复）	低廉（为了读取速度允许数据重复）
一致性	ACID（强一致性）	BASE（最终一致性） / 可调节
可扩展性	垂直扩展（使用更大的机器）	水平扩展（增加更多节点/分片）

黄金法则：在SQL中，你设计数据模型以支持任意查询。在NoSQL中，你设计数据模型以高效支持特定查询。

Core Design Patterns

核心设计模式

1. Query-First Modeling (Access Patterns)

1. 查询优先建模（访问模式）

You typically cannot "add a query later" without migration or creating a new table/index.

Process:

List all Entities (User, Order, Product).
List all Access Patterns ("Get User by Email", "Get Orders by User sorted by Date").
Design Table(s) specifically to serve those patterns with a single lookup.

通常你无法“后期添加查询”而不进行迁移或创建新表/索引。

流程：

列出所有实体（用户、订单、产品）。
列出所有访问模式（“通过邮箱获取用户”、“按日期排序获取用户的订单”）。
设计表结构专门用于通过单次查找来满足这些模式。

2. The Partition Key is King

2. 分区键是核心

Data is distributed across physical nodes based on the Partition Key (PK).

Goal: Even distribution of data and traffic.
Anti-Pattern: Using a low-cardinality PK (e.g.,
```
status="active"
```
or
```
gender="m"
```
) creates Hot Partitions, limiting throughput to a single node's capacity.
Best Practice: Use high-cardinality keys (User IDs, Device IDs, Composite Keys).

数据基于**分区键（PK）**分布在物理节点上。

目标： 数据和流量的均匀分布。
反模式： 使用低基数PK（例如
```
status="active"
```
或
```
gender="m"
```
）会创建热点分区，将吞吐量限制在单个节点的容量范围内。
最佳实践： 使用高基数键（用户ID、设备ID、复合键）。

3. Clustering / Sort Keys

3. 聚类/排序键

Within a partition, data is sorted on disk by the Clustering Key (Cassandra) or Sort Key (DynamoDB).

This allows for efficient Range Queries (e.g.,
```
WHERE user_id=X AND date > Y
```
).
It effectively pre-sorts your data for specific retrieval requirements.

在一个分区内，数据在磁盘上按**聚类键（Cassandra）或排序键（DynamoDB）**排序。

这允许高效的范围查询（例如
```
WHERE user_id=X AND date > Y
```
）。
它会根据特定的检索需求提前对数据进行排序。

4. Single-Table Design (Adjacency Lists)

4. 单表设计（邻接列表）

Primary use: DynamoDB (but concepts apply elsewhere)

Storing multiple entity types in one table to enable pre-joined reads.

PK (Partition)	SK (Sort)	Data Fields...
`USER#123`	`PROFILE`	`{ name: "Ian", email: "..." }`
`USER#123`	`ORDER#998`	`{ total: 50.00, status: "shipped" }`
`USER#123`	`ORDER#999`	`{ total: 12.00, status: "pending" }`

Query:
```
PK="USER#123"
```
Result: Fetches User Profile AND all Orders in one network request.

主要用途：DynamoDB（但概念适用于其他数据库）

在一个表中存储多种实体类型以支持预关联读取。

PK（分区键）	SK（排序键）	数据字段...
`USER#123`	`PROFILE`	`{ name: "Ian", email: "..." }`
`USER#123`	`ORDER#998`	`{ total: 50.00, status: "shipped" }`
`USER#123`	`ORDER#999`	`{ total: 12.00, status: "pending" }`

查询：
```
PK="USER#123"
```
结果： 在一次网络请求中获取用户资料以及所有订单。

5. Denormalization & Duplication

5. 反规范化与数据重复

Don't be afraid to store the same data in multiple tables to serve different query patterns.

Table A:
```
users_by_id
```
(PK: uuid)
Table B:
```
users_by_email
```
(PK: email)

Trade-off: You must manage data consistency across tables (often using eventual consistency or batch writes).

不要害怕在多个表中存储相同的数据以满足不同的查询模式。

表A：
```
users_by_id
```
（PK: uuid）
表B：
```
users_by_email
```
（PK: email）

权衡： 你必须管理跨表的数据一致性（通常使用最终一致性或批量写入）。

Specific Guidance

专项指导

Apache Cassandra / ScyllaDB

Primary Key Structure:
```
((Partition Key), Clustering Columns)
```
No Joins, No Aggregates: Do not try to
```
JOIN
```
or
```
GROUP BY
```
. Pre-calculate aggregates in a separate counter table.
Avoid
ALLOW FILTERING
: If you see this in production, your data model is wrong. It implies a full cluster scan.
Writes are Cheap: Inserts and Updates are just appends to the LSM tree. Don't worry about write volume as much as read efficiency.
Tombstones: Deletes are expensive markers. Avoid high-velocity delete patterns (like queues) in standard tables.

主键结构：
```
((Partition Key), Clustering Columns)
```
无关联查询、无聚合操作： 不要尝试使用
```
JOIN
```
或
```
GROUP BY
```
。在单独的计数器表中预计算聚合结果。
避免使用
ALLOW FILTERING
：如果在生产环境中看到这个语句，说明你的数据模型存在问题。它意味着需要全集群扫描。
写入成本低： 插入和更新操作只是追加到LSM树中。相比读取效率，不必过于担心写入量。
墓碑标记： 删除操作是昂贵的标记。避免在标准表中使用高频率删除模式（如队列）。

AWS DynamoDB

GSI (Global Secondary Index): Use GSIs to create alternative views of your data (e.g., "Search Orders by Date" instead of by User).
- Note: GSIs are eventually consistent.
LSI (Local Secondary Index): Sorts data differently within the same partition. Must be created at table creation time.
WCU / RCU: Understand capacity modes. Single-table design helps optimize consumed capacity units.
TTL: Use Time-To-Live attributes to automatically expire old data (free delete) without creating tombstones.

GSI（全局二级索引）： 使用GSI创建数据的替代视图（例如“按日期搜索订单”而非按用户搜索）。
- 注意： GSI是最终一致性的。
LSI（本地二级索引）： 在同一个分区内以不同方式对数据排序。必须在创建表时创建。
WCU / RCU： 了解容量模式。单表设计有助于优化消耗的容量单位。
TTL： 使用Time-To-Live属性自动过期旧数据（免费删除），而不会创建墓碑标记。

Expert Checklist

专家检查清单

Before finalizing your NoSQL schema:

Access Pattern Coverage: Does every query pattern map to a specific table or index?
Cardinality Check: Does the Partition Key have enough unique values to spread traffic evenly?
Split Partition Risk: For any single partition (e.g., a single user's orders), will it grow indefinitely? (If > 10GB, you need to "shard" the partition, e.g.,
```
USER#123#2024-01
```
).
Consistency Requirement: Can the application tolerate eventual consistency for this read pattern?

在最终确定NoSQL schema之前：

访问模式覆盖： 每个查询模式是否都对应到特定的表或索引？
基数检查： 分区键是否有足够多的唯一值来均匀分散流量？
分区拆分风险： 对于任何单个分区（例如单个用户的订单），是否会无限增长？（如果超过10GB，你需要“分片”该分区，例如
```
USER#123#2024-01
```
）。
一致性要求： 应用程序是否能容忍该读取模式的最终一致性？

Common Anti-Patterns

常见反模式

❌ Scatter-Gather: Querying all partitions to find one item (Scan). ❌ Hot Keys: Putting all "Monday" data into one partition. ❌ Relational Modeling: Creating

Author

and

Book

tables and trying to join them in code. (Instead, embed Book summaries in Author, or duplicate Author info in Books).

❌ 分散收集： 查询所有分区以查找单个条目（全表扫描）。 ❌ 热点键： 将所有“周一”的数据放入一个分区。 ❌ 关系型建模： 创建

Author

和

Book

表并尝试在代码中关联它们。（相反，应在Author中嵌入Book摘要，或在Books中复制Author信息）。