system-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When this skill is activated, always start your first response with the 🧢 emoji.

激活此技能后，首次回复请始终以🧢表情开头。

System Design

系统设计

A practical framework for designing distributed systems and architecting scalable services. This skill covers the core building blocks - load balancers, databases, caches, queues, and CDNs - plus the trade-off reasoning required to use them well. It is built around interview scenarios because they compress the full design process into a repeatable structure you can also apply in real-world architecture decisions. Agents can use this skill to work through any system design problem from capacity estimation through detailed component design.

这是一个用于设计分布式系统和构建可扩展服务的实用框架。此技能涵盖核心构建模块——负载均衡器、数据库、缓存、队列和CDN——以及合理使用它们所需的权衡考量。它围绕面试场景构建，因为面试场景将完整的设计过程压缩为可重复的结构，你也可以将其应用于实际的架构决策中。Agents可以使用此技能解决任何系统设计问题，从容量评估到详细组件设计。

When to use this skill

何时使用此技能

Trigger this skill when the user:

Asks "how would you design X?" where X is a product or service
Needs to choose between SQL and NoSQL databases
Is evaluating load balancing, sharding, or replication strategies
Asks about the CAP theorem or consistency vs availability trade-offs
Is designing a caching strategy (what to cache, where, how to invalidate)
Needs to estimate traffic, storage, or bandwidth for a system
Is preparing for a system design interview
Asks about rate limiting, API gateways, or CDN placement

Do NOT trigger this skill for:

Line-level code review or specific algorithm implementations (use a coding skill)
DevOps/infrastructure provisioning details like Terraform or Kubernetes manifests

当用户出现以下情况时触发此技能：

询问“如何设计X？”，其中X是某个产品或服务
需要在SQL和NoSQL数据库之间做出选择
评估负载均衡、分片或复制策略
询问CAP定理或一致性与可用性的权衡
设计缓存策略（缓存内容、缓存位置、失效方式）
需要估算系统的流量、存储或带宽
准备系统设计面试
询问限流、API网关或CDN部署

请勿在以下场景触发此技能：

代码行级审查或特定算法实现（使用编码技能）
DevOps/基础设施配置细节，如Terraform或Kubernetes清单

Key principles

核心原则

Start simple and justify complexity - Design the simplest system that satisfies the requirements. Introduce each new component (queue, cache, shard) only when you can name the specific constraint it solves. Complexity is a cost, not a feature.
Network partitions will happen - choose C or A - CAP theorem says distributed systems must sacrifice either consistency or availability during a partition. You cannot avoid partitions (P is not a choice). Pick CP for financial and inventory data; pick AP for feeds, caches, and preferences.
Scale horizontally, partition vertically - Stateless services scale out behind a load balancer. Data scales by separating hot from cold paths: read replicas before sharding, sharding before multi-region. Vertical scaling buys time; horizontal scaling buys headroom.
Design for failure at every layer - Every service will go down. Every disk will fill. Design fallback behavior before the happy path. Timeouts, retries with backoff, circuit breakers, and bulkheads are not optional refinements - they are table stakes.
Single responsibility for components - A component that does two things will be bad at both. Load balancers balance load. Caches serve reads. Queues decouple producers from consumers. Mixing responsibilities creates invisible coupling that makes the system fragile under load.

从简设计，为复杂性提供依据 - 设计满足需求的最简系统。只有当你能明确说出某个新组件（队列、缓存、分片）解决的特定约束时，才引入该组件。复杂性是一种成本，而非特性。
网络分区必然发生——选择一致性或可用性 - CAP定理指出，分布式系统在发生网络分区时必须牺牲一致性或可用性。你无法避免分区（P不是可选项）。对于财务和库存数据，选择CP；对于信息流、缓存和偏好设置，选择AP。
水平扩展，垂直分区 - 无状态服务可在负载均衡器后进行水平扩展。数据扩展通过分离热路径和冷路径实现：先使用只读副本，再进行分片，最后扩展至多区域。垂直扩展只能争取时间；水平扩展才能提供长期的扩容空间。
为每一层设计故障处理机制 - 每个服务都会宕机，每个磁盘都会被填满。在设计正常流程之前，先设计降级回退行为。超时、带退避的重试、断路器和舱壁不是可选的优化项——它们是必备的基础能力。
组件单一职责 - 一个承担两项功能的组件会在两方面都表现不佳。负载均衡器负责均衡负载，缓存负责处理读请求，队列负责解耦生产者和消费者。混合职责会产生隐形耦合，导致系统在负载下变得脆弱。

Core concepts

核心概念

System design assembles six core building blocks. Each solves a specific problem.

Load balancers distribute requests across backend instances. L4 balancers route by TCP/IP; L7 balancers route by HTTP path, headers, and cookies. Use L7 for HTTP services. Algorithms: round-robin (default), least-connections (when request latency varies), consistent hashing (when you need sticky routing, e.g., cache affinity).

Caches reduce read latency and database load. Sit in front of the database. Patterns: cache-aside (default), write-through (strong consistency), write-behind (high write throughput, tolerate loss). Key concerns: TTL, invalidation strategy, and stampede prevention. Redis is the default; Memcached only when pure key-value at massive scale.

Databases are the source of truth. SQL for structured data with ACID transactions; NoSQL for scale, flexible schemas, or specific access patterns. Read replicas for read-heavy workloads. Sharding for write-heavy workloads that exceed one node.

Message queues decouple producers from consumers and absorb traffic spikes. Use for async work, fan-out events, and unreliable downstream dependencies. Always configure a dead-letter queue. SQS for AWS-native work; Kafka for high-throughput event streaming or replay.

CDNs cache static assets and edge-terminate TLS close to users. Reduces origin load and cuts latency for geographically distributed users. Use for images, JS/CSS, and any content with high read-to-write ratio.

API gateways enforce cross-cutting concerns - auth, rate limiting, request logging, TLS termination - at a single entry point. Never build a custom gateway; use Kong, Envoy, or a managed provider.

系统设计由六个核心构建模块组成，每个模块解决特定问题。

负载均衡器将请求分发到后端实例。L4负载均衡器基于TCP/IP路由；L7负载均衡器基于HTTP路径、标头和Cookie路由。HTTP服务使用L7负载均衡器。算法：轮询（默认）、最少连接数（当请求延迟变化时）、一致性哈希（当需要粘性路由时，如缓存亲和性）。

缓存降低读取延迟和数据库负载，部署在数据库前端。模式：旁路缓存（默认）、写穿（强一致性）、写回（高写入吞吐量，可容忍数据丢失）。关键关注点：TTL、失效策略和缓存击穿预防。默认使用Redis；仅在超大规模纯键值场景下使用Memcached。

数据库是可信数据源。SQL适用于具有ACID事务的结构化数据；NoSQL适用于需要扩展、灵活架构或特定访问模式的场景。只读副本适用于读密集型工作负载；当写入负载超过单节点能力时使用分片。

消息队列解耦生产者和消费者，吸收流量峰值。用于异步工作、扇出事件和不可靠的下游依赖。始终配置死信队列。AWS原生场景使用SQS；高吞吐量事件流或重放场景使用Kafka。

CDN缓存静态资源并在靠近用户的边缘节点终止TLS连接。减少源站负载，降低地理分布用户的延迟。用于图片、JS/CSS以及任何读远多于写的内容。

API网关在单一入口点实施横切关注点——认证、限流、请求日志、TLS终止。切勿自行构建网关；使用Kong、Envoy或托管服务提供商的网关。

Common tasks

常见任务

Design a URL shortener

设计URL短链接服务

Clarifying questions: Read-heavy or write-heavy? Need analytics? Custom slugs? Global or single-region?

Components:

API service (stateless, horizontally scaled) behind L7 load balancer
Key generation service - pre-generate Base62 short codes in batches and store in a pool; avoids hot write path
Database - a relational DB works at moderate scale; switch to Cassandra for multi-region or >100k writes/sec
Cache (Redis) - store short_code -> long_url mappings; TTL 24 hours; cache-aside

Redirect flow: Client hits CDN -> cache hit returns 301/302 -> cache miss reads DB -> populates cache -> returns redirect.

Scale signal: 100M URLs stored, 10B reads/day -> cache hit rate must be >99% to protect the DB.

澄清问题： 读密集型还是写密集型？需要分析功能吗？支持自定义短码吗？全球服务还是单区域服务？

组件：

API服务（无状态，可水平扩展）部署在L7负载均衡器后
密钥生成服务——批量预先生成Base62短码并存储在池中；避免写入热路径
数据库——中等规模下使用关系型数据库；多区域或写入量>10万次/秒时切换为Cassandra
缓存（Redis）——存储short_code -> long_url映射；TTL为24小时；采用旁路缓存模式

重定向流程： 客户端请求CDN -> 缓存命中返回301/302 -> 缓存缺失则读取数据库 -> 填充缓存 -> 返回重定向。

扩容信号： 存储1亿条URL，每日100亿次读取 -> 缓存命中率必须>99%才能保护数据库。

Design a rate limiter

设计限流系统

Algorithm choices:

Token bucket (default) - allows bursts up to bucket capacity; fills at a constant rate. Best for user-facing APIs.
Fixed window - simple counter per time window. Prone to burst at window edge.
Sliding window log - exact, but memory-intensive.
Sliding window counter - approximation using two fixed windows. Good balance.

Storage: Redis with atomic INCR and EXPIRE. Single Redis node is enough up to ~50k RPS per rule; use Redis Cluster for more.

Placement: In the API gateway (preferred) or as middleware. Always return

X-RateLimit-Remaining

and

Retry-After

headers with 429 responses.

Distributed concern: With multiple gateway nodes, the counter must be centralized (Redis) - local counters undercount.

算法选择：

令牌桶（默认）——允许突发流量至桶容量上限；以恒定速率填充令牌。最适合面向用户的API。
固定窗口——每个时间窗口使用简单计数器。在窗口边缘容易出现突发流量。
滑动窗口日志——精确但内存占用高。
滑动窗口计数器——使用两个固定窗口的近似算法。平衡了精度和性能。

存储： 使用Redis的原子INCR和EXPIRE命令。单Redis节点足以处理每条规则约5万次/秒的RPS；更高负载时使用Redis Cluster。

部署位置： 优先部署在API网关中，或作为中间件。返回429响应时，始终附带

X-RateLimit-Remaining

和

Retry-After

标头。

分布式问题： 当存在多个网关节点时，计数器必须集中存储（Redis）——本地计数器会导致计数不足。

Design a notification system

设计通知系统

Components:

Notification API - accepts events from internal services
Router service - reads user preferences and determines channels (push, email, SMS)
Channel-specific workers (separate services) - dequeued from per-channel queues
Template service - renders notification copy
Delivery tracking - records sent/delivered/failed per notification

Queue design: One queue per channel (push-queue, email-queue, sms-queue). Isolates failure - SMS provider outage does not back up email delivery.

Critical path vs non-critical path:

OTP and security alerts: synchronous, priority queue
Marketing and social notifications: async, best-effort, can be batched

组件：

通知API——接收来自内部服务的事件
路由服务——读取用户偏好并确定通知渠道（推送、邮件、短信）
渠道专属工作者（独立服务）——从各渠道队列中获取任务
模板服务——渲染通知内容
投递跟踪——记录每条通知的已发送/已投递/失败状态

队列设计： 每个渠道对应一个队列（push-queue、email-queue、sms-queue）。隔离故障——短信提供商宕机不会影响邮件投递。

关键路径与非关键路径：

OTP和安全警报：同步处理，使用优先级队列
营销和社交通知：异步处理，尽力而为，可批量发送

Design a chat system

设计聊天系统

Protocol: WebSockets for real-time bidirectional messaging. Long-polling as fallback for restrictive networks.

Storage split:

Message history: Cassandra, keyed by (channel_id, timestamp). Append-only, high write throughput, easy time-range queries.
User presence and metadata: Redis (in-memory, fast reads).
User and channel info: PostgreSQL (relational, ACID).

Fanout: When a user sends a message, the server writes to the DB and then publishes to a pub/sub channel (Redis Pub/Sub or Kafka). Each recipient's connection server subscribes to relevant channels and pushes to the WebSocket.

Scale concern: Connection servers are stateful (WebSockets). Route users to the same connection server with consistent hashing. Use a service mesh for connection server discovery.

协议： 使用WebSocket实现实时双向消息传递。针对受限网络，使用长轮询作为 fallback。

存储拆分：

消息历史：使用Cassandra，以(channel_id, timestamp)作为键。仅追加写入，高写入吞吐量，便于时间范围查询。
用户在线状态和元数据：使用Redis（内存存储，读取速度快）。
用户和频道信息：使用PostgreSQL（关系型，支持ACID）。

扇出机制： 用户发送消息时，服务器写入数据库，然后发布到发布/订阅频道（Redis Pub/Sub或Kafka）。每个接收者的连接服务器订阅相关频道并将消息推送到WebSocket。

扩容关注点： 连接服务器是有状态的（WebSocket）。使用一致性哈希将用户路由到同一连接服务器。使用服务网格实现连接服务器的服务发现。

Choose between SQL vs NoSQL

SQL与NoSQL选型

Use this decision table:

Need	Choose
ACID transactions across multiple entities	SQL
Complex joins and ad-hoc queries	SQL
Strict schema with referential integrity	SQL
Horizontal write scaling beyond single node	NoSQL (Cassandra, DynamoDB)
Flexible or evolving schema	NoSQL (MongoDB, DynamoDB)
Graph traversals	Graph DB (Neo4j)
Time-series data at high ingestion rate	TimescaleDB or InfluxDB
Key-value at very high throughput	Redis or DynamoDB

Default: Start with PostgreSQL. It handles far more scale than most teams expect and its JSONB column covers flexible-schema needs up to moderate scale. Migrate to specialized stores when you have a measured bottleneck.

使用以下决策表：

需求	选择
跨多个实体的ACID事务	SQL
复杂关联查询和临时查询	SQL
严格架构与引用完整性	SQL
单节点无法满足的水平写入扩展	NoSQL（Cassandra、DynamoDB）
灵活或演进的架构	NoSQL（MongoDB、DynamoDB）
图遍历	图数据库（Neo4j）
高摄入率的时间序列数据	TimescaleDB或InfluxDB
超高吞吐量的键值存储	Redis或DynamoDB

默认选择：从PostgreSQL开始。 它能处理的规模远超大多数团队的预期，其JSONB列可满足中等规模下的灵活架构需求。只有当你测量到明确的性能瓶颈时，再迁移到专用存储。

Estimate system capacity

系统容量估算

Use the following rough constants in back-of-envelope estimates:

Metric	Value
Seconds per day	~86,400 (~100k rounded)
Bytes per ASCII character	1
Average tweet/post size	~300 bytes
Average image (compressed)	~300 KB
Average video (1 min, 720p)	~50 MB
QPS from 1M DAU, 10 actions/day	~115 QPS

Process:

Clarify scale (DAU, requests per user per day)
Derive QPS:
```
(DAU * requests_per_day) / 86400
```
Derive peak QPS:
```
average QPS * 2-3x
```

Derive storage:

writes_per_day * record_size * retention_days

Derive bandwidth:
```
peak QPS * average_response_size
```

State assumptions explicitly. Interviewers care about your reasoning, not the exact number.

在粗略估算中使用以下常量：

指标	数值
每日秒数	~86400（约10万，取整）
每个ASCII字符的字节数	1
平均推文/帖子大小	~300字节
平均压缩图片大小	~300 KB
平均1分钟720p视频大小	~50 MB
100万日活用户，每日10次操作的QPS	~115 QPS

流程：

明确规模（日活用户数、每位用户每日请求数）

计算平均QPS：

(日活用户数 * 每日请求数) / 86400

计算峰值QPS：
```
平均QPS * 2-3倍
```

计算存储需求：

每日写入数 * 记录大小 * 保留天数

计算带宽需求：
```
峰值QPS * 平均响应大小
```

明确说明假设条件。面试官关注的是你的推理过程，而非精确数字。

Design caching strategy

设计缓存策略

Step 1 - Identify what to cache:

Expensive reads that change infrequently (user profiles, product catalog)
Computed aggregations (dashboard stats, leaderboards)
Session tokens and auth lookups

Do NOT cache: frequently mutated data, financial balances, anything requiring strong consistency.

Step 2 - Choose pattern:

Default: cache-aside with TTL
Strong read-after-write: write-through
High write throughput, loss acceptable: write-behind

Step 3 - Define invalidation:

TTL expiry for most cases
Explicit DELETE on write for cache-aside
Never try to update a cached value in-place; DELETE then let the next read repopulate

Step 4 - Prevent stampede:

Use a distributed lock (Redis SETNX) for high-traffic keys
Add jitter to TTLs (base TTL +/- 10-20%) to spread expiry

步骤1 - 确定缓存内容：

计算成本高且变更不频繁的读请求（用户资料、产品目录）
计算得出的聚合数据（仪表盘统计、排行榜）
会话令牌和认证查询

请勿缓存：频繁变更的数据、财务余额、任何需要强一致性的数据。

步骤2 - 选择缓存模式：

默认：带TTL的旁路缓存
强读写一致性：写穿
高写入吞吐量，可容忍丢失：写回

步骤3 - 定义失效策略：

大多数场景使用TTL过期
旁路缓存模式下，写入时显式执行DELETE操作
切勿尝试原地更新缓存值；先执行DELETE，让下一次读请求重新填充缓存

步骤4 - 预防缓存击穿：

对高流量键使用分布式锁（Redis SETNX）
为TTL添加抖动（基础TTL +/- 10-20%）以分散过期时间

Anti-patterns / common mistakes

反模式/常见错误

Mistake	Why it's wrong	What to do instead
Designing without clarifying requirements	You optimize for the wrong bottleneck and miss key constraints	Always spend 5 minutes on scope: scale, consistency needs, latency SLAs
Sharding before replication	Sharding is complex and expensive; replication + caching handles most read bottlenecks	Add read replicas and caching first; only shard when writes are the bottleneck
Shared database between services	Creates hidden coupling; one service's slow query can kill another	One database per service; expose data through APIs or events
Cache without invalidation plan	Stale reads cause data inconsistency; cache-DB drift grows silently	Define TTL and invalidation triggers before adding any cache
Ignoring the tail: all QPS estimates as average	p99 latency matters more than p50; a 2x peak multiplier is the minimum	Always model peak QPS (2-3x average) and design capacity for it
Single point of failure at every layer	Load balancer with no standby, single queue broker, one region	Identify SPOFs explicitly; add redundancy for any component whose failure kills the system

错误	原因	正确做法
未澄清需求就开始设计	你会针对错误的瓶颈进行优化，遗漏关键约束	始终花5分钟明确范围：规模、一致性需求、延迟SLA
先分片再使用副本	分片复杂且成本高；副本+缓存可处理大多数读瓶颈	先添加只读副本和缓存；仅当写入成为瓶颈时再进行分片
服务间共享数据库	产生隐形耦合；一个服务的慢查询会影响另一个服务	每个服务对应一个数据库；通过API或事件暴露数据
缓存但未制定失效计划	脏读会导致数据不一致；缓存与数据库的差异会悄然扩大	在添加任何缓存之前，先定义TTL和失效触发条件
忽略长尾流量：所有QPS估算均使用平均值	p99延迟比p50更重要；峰值至少是平均值的2倍	始终建模峰值QPS（平均QPS的2-3倍）并针对该容量进行设计
每一层都存在单点故障	负载均衡器无备用节点、单队列代理、单区域部署	明确识别单点故障；为任何故障会导致系统瘫痪的组件添加冗余

References

参考资料

For detailed frameworks and opinionated defaults, read the relevant file from the

references/

folder:

```
references/interview-framework.md
```
- step-by-step interview process (RESHADED), time allocation, common follow-up questions, and how to communicate trade-offs

Only load the references file when the task requires it - it is long and will consume context.

如需详细框架和默认选型建议，请阅读

references/

文件夹中的相关文件：

```
references/interview-framework.md
```
- 分步面试流程（RESHADED）、时间分配、常见跟进问题以及如何沟通权衡决策

仅在任务需要时加载参考文件——文件较长，会占用上下文空间。