system-architecture

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

System Architecture Expert

系统架构专家

When to use this Skill

何时使用该Skill

Use this Skill when:

Designing distributed systems
Writing system design documentation
Preparing for system design interviews
Creating architecture diagrams
Analyzing trade-offs between design choices
Reviewing or improving existing system designs

在以下场景使用本Skill：

设计分布式系统
编写系统设计文档
准备系统设计面试
创建架构图
分析设计方案间的权衡
评审或优化现有系统设计

System Design Framework

系统设计框架

1. Requirements Gathering (5-10 minutes)

1. 需求收集（5-10分钟）

Functional Requirements:

What are the core features?
What actions can users perform?
What are the inputs and outputs?

Non-Functional Requirements:

Scale: How many users? How much data?
Performance: Latency requirements? (p50, p95, p99)
Availability: What uptime is needed? (99.9%, 99.99%)
Consistency: Strong or eventual consistency?

Constraints:

Budget limitations
Technology stack constraints
Team expertise
Timeline

Example Questions:

- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?

功能性需求：

核心功能有哪些？
用户可执行哪些操作？
输入和输出分别是什么？

非功能性需求：

规模：用户数量？数据量？
性能：延迟要求？（p50、p95、p99）
可用性：需要达到多少 uptime？（99.9%、99.99%）
一致性：强一致性还是最终一致性？

约束条件：

预算限制
技术栈约束
团队技术能力
项目 timeline

示例问题：

- 日活跃用户数是多少？
- 读写比是多少？
- 平均数据大小是多少？
- 峰值负载 vs 平均负载？
- 是否需要实时更新？
- 允许数据丢失吗？

2. Capacity Estimation (Back-of-the-envelope)

2. 容量估算（粗略计算）

Calculate:

Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s

Memory/Cache:

80-20 rule: 20% of data gets 80% of traffic
Cache = 20% of total data for hot data

计算内容：

Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s

内存/缓存：

80-20法则：20%的数据承载80%的流量
缓存容量 = 总数据量的20%（用于存储热点数据）

3. High-Level Design

3. 高层设计

Core Components:

Client Layer (Web, Mobile, Desktop)
API Gateway / Load Balancer
Application Servers (Business logic)
Cache Layer (Redis, Memcached)
Database (SQL, NoSQL, or both)
Message Queue (Kafka, RabbitMQ)
Object Storage (S3, GCS)
CDN (CloudFront, Akamai)

Draw Architecture:

[Clients] → [CDN]
            ↓
        [Load Balancer]
            ↓
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                            ↓
                      [Object Storage]

核心组件：

客户端层（Web、Mobile、Desktop）
API网关/负载均衡器
应用服务器（业务逻辑）
缓存层（Redis、Memcached）
数据库（SQL、NoSQL或两者结合）
消息队列（Kafka、RabbitMQ）
对象存储（S3、GCS）
CDN（CloudFront、Akamai）

架构图绘制：

[Clients] → [CDN]
            ↓
        [Load Balancer]
            ↓
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                            ↓
                      [Object Storage]

4. Database Design

4. 数据库设计

SQL vs NoSQL Decision:

Use SQL when:

ACID transactions required
Complex queries with JOINs
Structured data with relationships
Examples: PostgreSQL, MySQL

Use NoSQL when:

Massive scale (horizontal scaling)
Flexible schema
High write throughput
Examples: Cassandra, DynamoDB, MongoDB

Sharding Strategy:

Hash-based:
```
user_id % num_shards
```
Range-based: Users 1-100M on shard 1
Geographic: US users on US shard
Consistent hashing: For even distribution

Schema Design:

sql

-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);

SQL vs NoSQL 决策：

选择SQL的场景：

需要ACID事务
涉及JOIN的复杂查询
具有关联关系的结构化数据
示例：PostgreSQL、MySQL

选择NoSQL的场景：

超大规模（水平扩展）
灵活的Schema
高写入吞吐量
示例：Cassandra、DynamoDB、MongoDB

分片策略：

哈希分片：
```
user_id % num_shards
```
范围分片：用户1-1亿分配到分片1
地理分片：美国用户分配到美国分片
一致性哈希：实现均匀分布

Schema设计：

sql

-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);

5. Deep Dive Components

5. 组件深度剖析

Caching Strategy:

Cache-Aside: App reads from cache, loads from DB on miss
Write-Through: Write to cache and DB together
Write-Behind: Write to cache, async write to DB

Eviction Policies:

LRU (Least Recently Used) - Most common
LFU (Least Frequently Used)
TTL (Time To Live)

Load Balancing:

Round Robin: Simple, equal distribution
Least Connections: Route to least busy server
Consistent Hashing: Minimize redistribution
Weighted: Based on server capacity

Message Queue Patterns:

Pub/Sub: One-to-many (notifications)
Work Queue: Task distribution (job processing)
Fan-out: Broadcast to multiple queues

缓存策略：

Cache-Aside：应用先读取缓存，缓存未命中时从数据库加载
Write-Through：同时写入缓存和数据库
Write-Behind：先写入缓存，异步写入数据库

淘汰策略：

LRU（最近最少使用）- 最常用
LFU（最不经常使用）
TTL（生存时间）

负载均衡：

轮询（Round Robin）：简单，均衡分配
最少连接数：路由到最空闲的服务器
一致性哈希：最小化重新分配
加权分配：基于服务器容量

消息队列模式：

Pub/Sub：一对多（通知场景）
工作队列：任务分发（作业处理）
Fan-out：广播到多个队列

6. Scalability Patterns

6. 可扩展性模式

Horizontal Scaling:

Add more servers
Use load balancers
Stateless application servers
Session stored in cache/DB

Vertical Scaling:

Add more CPU/RAM to servers
Limited by hardware
Simpler but has limits

Microservices:

Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]

Benefits:

Independent scaling
Technology flexibility
Fault isolation

Drawbacks:

Increased complexity
Network latency
Distributed transactions

水平扩展：

添加更多服务器
使用负载均衡器
无状态应用服务器
会话存储在缓存/数据库中

垂直扩展：

为服务器添加更多CPU/RAM
受硬件限制
实现简单但有上限

微服务：

Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]

优势：

独立扩展
技术栈灵活
故障隔离

劣势：

复杂度提升
网络延迟
分布式事务

7. Reliability & Availability

7. 可靠性与可用性

Replication:

Master-Slave: One writer, multiple readers
Master-Master: Multiple writers (conflict resolution needed)
Multi-region: Geographic redundancy

Failover:

Active-Passive: Standby server takes over
Active-Active: Both servers handle traffic

Rate Limiting:

Token bucket algorithm
Leaky bucket algorithm
Fixed window counter
Sliding window log

Circuit Breaker:

States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered

复制：

主从模式：一个写节点，多个读节点
主主模式：多个写节点（需要冲突解决机制）
多区域：地理冗余

故障转移：

主备模式：备用服务器接管
双活模式：两台服务器均处理流量

限流：

令牌桶算法
漏桶算法
固定窗口计数器
滑动窗口日志

断路器：

States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered

8. Common System Design Patterns

8. 常见系统设计模式

Content Delivery:

Use CDN for static assets
Geo-distributed edge servers
Cache at edge locations

Data Consistency:

Strong Consistency: Read reflects latest write (ACID)
Eventual Consistency: Reads eventually reflect write (BASE)
CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance

API Design:

RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}

内容分发：

为静态资源使用CDN
地理分布式边缘服务器
在边缘位置缓存

数据一致性：

强一致性：读取反映最新写入（ACID）
最终一致性：读取最终会反映写入（BASE）
CAP Theorem：三选二：一致性、可用性、分区容错性

API设计：

RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}

9. System Design Template

9. 系统设计模板

Use this structure (based on

system_design/00_template.md

markdown

undefined

使用以下结构（基于

system_design/00_template.md

）：

markdown

undefined

{System Name}

{系统名称}

1. Requirements

1. 需求

Functional

功能性

[List core features]

[列出核心功能]

Non-Functional

非功能性

Scale: [Users, QPS, Data]
Performance: [Latency requirements]
Availability: [Uptime target]

规模：[用户数、QPS、数据量]
性能：[延迟要求]
可用性：[Uptime目标]

2. Capacity Estimation

2. 容量估算

Traffic: [QPS calculations]
Storage: [Data size, growth]
Bandwidth: [Network requirements]

流量：[QPS计算]
存储：[数据大小、增长]
带宽：[网络需求]

3. API Design

3. API设计

[endpoint] - [description]

[endpoint] - [描述]

4. High-Level Architecture

4. 高层架构

[Diagram]

[架构图]

5. Database Schema

5. 数据库Schema

[Tables and relationships]

[表和关系]

6. Detailed Design

6. 详细设计

Component 1

组件1

[Deep dive]

[深度剖析]

Component 2

组件2

[Deep dive]

[深度剖析]

7. Scalability

7. 可扩展性

[How to scale each component]

[各组件的扩展方式]

8. Trade-offs

8. 权衡分析

[Decisions and alternatives]

undefined

[决策及替代方案]

undefined

10. Real-World Examples

10. 真实世界示例

Reference case studies in
system_design/
:

Netflix: Video streaming, recommendation
Twitter: Timeline, tweet storage, trending
Uber: Real-time matching, location tracking
Instagram: Image storage, feed generation
WhatsApp: Message delivery, presence

Common Patterns:

News Feed: Fan-out on write vs fan-out on read
Rate Limiter: Token bucket with Redis
URL Shortener: Base62 encoding, hash collision
Chat System: WebSocket, message queue
Notification: Push notification service, APNs/FCM

参考
system_design/
中的案例研究：

Netflix：视频流、推荐系统
Twitter：时间线、推文存储、趋势
Uber：实时匹配、位置追踪
Instagram：图片存储、Feed生成
WhatsApp：消息传递、在线状态

常见模式：

新闻Feed：写时扩散 vs 读时扩散
限流器：基于Redis的令牌桶
URL短链接：Base62编码、哈希冲突处理
聊天系统：WebSocket、消息队列
通知：推送通知服务、APNs/FCM

Interview Tips

面试技巧

Time Management:

Requirements: 10%
High-level design: 25%
Deep dive: 50%
Wrap up: 15%

Communication:

Think out loud
Ask clarifying questions
Discuss trade-offs
Acknowledge limitations

What interviewers look for:

Problem-solving approach
Technical depth
Trade-off analysis
Scale awareness
Communication skills

时间管理：

需求分析：10%
高层设计：25%
深度剖析：50%
总结收尾：15%

沟通技巧：

边思考边表达
提出澄清问题
讨论权衡方案
承认局限性

面试官关注要点：

问题解决思路
技术深度
权衡分析能力
规模意识
沟通能力

Common Mistakes to Avoid

常见错误规避

Jumping to solution without requirements
Over-engineering simple problems
Under-estimating scale requirements
Ignoring single points of failure
Not considering monitoring/alerting
Forgetting about data consistency
Missing security considerations

未明确需求就直接给出解决方案
对简单问题过度设计
低估规模需求
忽略单点故障
未考虑监控/告警
忘记数据一致性
遗漏安全考量

Project Context

项目背景

Templates in
```
system_design/00_template.md
```
Case studies in
```
system_design/*.md
```
Reference materials in
```
doc/system_design/
```
Follow the established documentation pattern

模板位于
```
system_design/00_template.md
```
案例研究位于
```
system_design/*.md
```
参考资料位于
```
doc/system_design/
```
遵循既定的文档规范