system-architecture

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

System Architecture Expert

系统架构专家

When to use this Skill

何时使用该Skill

Use this Skill when:
  • Designing distributed systems
  • Writing system design documentation
  • Preparing for system design interviews
  • Creating architecture diagrams
  • Analyzing trade-offs between design choices
  • Reviewing or improving existing system designs
在以下场景使用本Skill:
  • 设计分布式系统
  • 编写系统设计文档
  • 准备系统设计面试
  • 创建架构图
  • 分析设计方案间的权衡
  • 评审或优化现有系统设计

System Design Framework

系统设计框架

1. Requirements Gathering (5-10 minutes)

1. 需求收集(5-10分钟)

Functional Requirements:
  • What are the core features?
  • What actions can users perform?
  • What are the inputs and outputs?
Non-Functional Requirements:
  • Scale: How many users? How much data?
  • Performance: Latency requirements? (p50, p95, p99)
  • Availability: What uptime is needed? (99.9%, 99.99%)
  • Consistency: Strong or eventual consistency?
Constraints:
  • Budget limitations
  • Technology stack constraints
  • Team expertise
  • Timeline
Example Questions:
- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?
功能性需求:
  • 核心功能有哪些?
  • 用户可执行哪些操作?
  • 输入和输出分别是什么?
非功能性需求:
  • 规模:用户数量?数据量?
  • 性能:延迟要求?(p50、p95、p99)
  • 可用性:需要达到多少 uptime?(99.9%、99.99%)
  • 一致性:强一致性还是最终一致性?
约束条件:
  • 预算限制
  • 技术栈约束
  • 团队技术能力
  • 项目 timeline
示例问题:
- 日活跃用户数是多少?
- 读写比是多少?
- 平均数据大小是多少?
- 峰值负载 vs 平均负载?
- 是否需要实时更新?
- 允许数据丢失吗?

2. Capacity Estimation (Back-of-the-envelope)

2. 容量估算(粗略计算)

Calculate:
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s
Memory/Cache:
  • 80-20 rule: 20% of data gets 80% of traffic
  • Cache = 20% of total data for hot data
计算内容:
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s
内存/缓存:
  • 80-20法则:20%的数据承载80%的流量
  • 缓存容量 = 总数据量的20%(用于存储热点数据)

3. High-Level Design

3. 高层设计

Core Components:
  1. Client Layer (Web, Mobile, Desktop)
  2. API Gateway / Load Balancer
  3. Application Servers (Business logic)
  4. Cache Layer (Redis, Memcached)
  5. Database (SQL, NoSQL, or both)
  6. Message Queue (Kafka, RabbitMQ)
  7. Object Storage (S3, GCS)
  8. CDN (CloudFront, Akamai)
Draw Architecture:
[Clients] → [CDN]
        [Load Balancer]
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                      [Object Storage]
核心组件:
  1. 客户端层(Web、Mobile、Desktop)
  2. API网关/负载均衡器
  3. 应用服务器(业务逻辑)
  4. 缓存层(Redis、Memcached)
  5. 数据库(SQL、NoSQL或两者结合)
  6. 消息队列(Kafka、RabbitMQ)
  7. 对象存储(S3、GCS)
  8. CDN(CloudFront、Akamai)
架构图绘制:
[Clients] → [CDN]
        [Load Balancer]
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                      [Object Storage]

4. Database Design

4. 数据库设计

SQL vs NoSQL Decision:
Use SQL when:
  • ACID transactions required
  • Complex queries with JOINs
  • Structured data with relationships
  • Examples: PostgreSQL, MySQL
Use NoSQL when:
  • Massive scale (horizontal scaling)
  • Flexible schema
  • High write throughput
  • Examples: Cassandra, DynamoDB, MongoDB
Sharding Strategy:
  • Hash-based:
    user_id % num_shards
  • Range-based: Users 1-100M on shard 1
  • Geographic: US users on US shard
  • Consistent hashing: For even distribution
Schema Design:
sql
-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);
SQL vs NoSQL 决策:
选择SQL的场景:
  • 需要ACID事务
  • 涉及JOIN的复杂查询
  • 具有关联关系的结构化数据
  • 示例:PostgreSQL、MySQL
选择NoSQL的场景:
  • 超大规模(水平扩展)
  • 灵活的Schema
  • 高写入吞吐量
  • 示例:Cassandra、DynamoDB、MongoDB
分片策略:
  • 哈希分片:
    user_id % num_shards
  • 范围分片:用户1-1亿分配到分片1
  • 地理分片:美国用户分配到美国分片
  • 一致性哈希:实现均匀分布
Schema设计:
sql
-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);

5. Deep Dive Components

5. 组件深度剖析

Caching Strategy:
  • Cache-Aside: App reads from cache, loads from DB on miss
  • Write-Through: Write to cache and DB together
  • Write-Behind: Write to cache, async write to DB
Eviction Policies:
  • LRU (Least Recently Used) - Most common
  • LFU (Least Frequently Used)
  • TTL (Time To Live)
Load Balancing:
  • Round Robin: Simple, equal distribution
  • Least Connections: Route to least busy server
  • Consistent Hashing: Minimize redistribution
  • Weighted: Based on server capacity
Message Queue Patterns:
  • Pub/Sub: One-to-many (notifications)
  • Work Queue: Task distribution (job processing)
  • Fan-out: Broadcast to multiple queues
缓存策略:
  • Cache-Aside:应用先读取缓存,缓存未命中时从数据库加载
  • Write-Through:同时写入缓存和数据库
  • Write-Behind:先写入缓存,异步写入数据库
淘汰策略:
  • LRU(最近最少使用)- 最常用
  • LFU(最不经常使用)
  • TTL(生存时间)
负载均衡:
  • 轮询(Round Robin):简单,均衡分配
  • 最少连接数:路由到最空闲的服务器
  • 一致性哈希:最小化重新分配
  • 加权分配:基于服务器容量
消息队列模式:
  • Pub/Sub:一对多(通知场景)
  • 工作队列:任务分发(作业处理)
  • Fan-out:广播到多个队列

6. Scalability Patterns

6. 可扩展性模式

Horizontal Scaling:
  • Add more servers
  • Use load balancers
  • Stateless application servers
  • Session stored in cache/DB
Vertical Scaling:
  • Add more CPU/RAM to servers
  • Limited by hardware
  • Simpler but has limits
Microservices:
Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]
Benefits:
  • Independent scaling
  • Technology flexibility
  • Fault isolation
Drawbacks:
  • Increased complexity
  • Network latency
  • Distributed transactions
水平扩展:
  • 添加更多服务器
  • 使用负载均衡器
  • 无状态应用服务器
  • 会话存储在缓存/数据库中
垂直扩展:
  • 为服务器添加更多CPU/RAM
  • 受硬件限制
  • 实现简单但有上限
微服务:
Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]
优势:
  • 独立扩展
  • 技术栈灵活
  • 故障隔离
劣势:
  • 复杂度提升
  • 网络延迟
  • 分布式事务

7. Reliability & Availability

7. 可靠性与可用性

Replication:
  • Master-Slave: One writer, multiple readers
  • Master-Master: Multiple writers (conflict resolution needed)
  • Multi-region: Geographic redundancy
Failover:
  • Active-Passive: Standby server takes over
  • Active-Active: Both servers handle traffic
Rate Limiting:
  • Token bucket algorithm
  • Leaky bucket algorithm
  • Fixed window counter
  • Sliding window log
Circuit Breaker:
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered
复制:
  • 主从模式:一个写节点,多个读节点
  • 主主模式:多个写节点(需要冲突解决机制)
  • 多区域:地理冗余
故障转移:
  • 主备模式:备用服务器接管
  • 双活模式:两台服务器均处理流量
限流:
  • 令牌桶算法
  • 漏桶算法
  • 固定窗口计数器
  • 滑动窗口日志
断路器:
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered

8. Common System Design Patterns

8. 常见系统设计模式

Content Delivery:
  • Use CDN for static assets
  • Geo-distributed edge servers
  • Cache at edge locations
Data Consistency:
  • Strong Consistency: Read reflects latest write (ACID)
  • Eventual Consistency: Reads eventually reflect write (BASE)
  • CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance
API Design:
RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}
内容分发:
  • 为静态资源使用CDN
  • 地理分布式边缘服务器
  • 在边缘位置缓存
数据一致性:
  • 强一致性:读取反映最新写入(ACID)
  • 最终一致性:读取最终会反映写入(BASE)
  • CAP Theorem:三选二:一致性、可用性、分区容错性
API设计:
RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}

9. System Design Template

9. 系统设计模板

Use this structure (based on
system_design/00_template.md
):
markdown
undefined
使用以下结构(基于
system_design/00_template.md
):
markdown
undefined

{System Name}

{系统名称}

1. Requirements

1. 需求

Functional

功能性

  • [List core features]
  • [列出核心功能]

Non-Functional

非功能性

  • Scale: [Users, QPS, Data]
  • Performance: [Latency requirements]
  • Availability: [Uptime target]
  • 规模:[用户数、QPS、数据量]
  • 性能:[延迟要求]
  • 可用性:[Uptime目标]

2. Capacity Estimation

2. 容量估算

  • Traffic: [QPS calculations]
  • Storage: [Data size, growth]
  • Bandwidth: [Network requirements]
  • 流量:[QPS计算]
  • 存储:[数据大小、增长]
  • 带宽:[网络需求]

3. API Design

3. API设计

[endpoint] - [description]
[endpoint] - [描述]

4. High-Level Architecture

4. 高层架构

[Diagram]
[架构图]

5. Database Schema

5. 数据库Schema

[Tables and relationships]
[表和关系]

6. Detailed Design

6. 详细设计

Component 1

组件1

[Deep dive]
[深度剖析]

Component 2

组件2

[Deep dive]
[深度剖析]

7. Scalability

7. 可扩展性

[How to scale each component]
[各组件的扩展方式]

8. Trade-offs

8. 权衡分析

[Decisions and alternatives]
undefined
[决策及替代方案]
undefined

10. Real-World Examples

10. 真实世界示例

Reference case studies in
system_design/
:
  • Netflix: Video streaming, recommendation
  • Twitter: Timeline, tweet storage, trending
  • Uber: Real-time matching, location tracking
  • Instagram: Image storage, feed generation
  • WhatsApp: Message delivery, presence
Common Patterns:
  • News Feed: Fan-out on write vs fan-out on read
  • Rate Limiter: Token bucket with Redis
  • URL Shortener: Base62 encoding, hash collision
  • Chat System: WebSocket, message queue
  • Notification: Push notification service, APNs/FCM
参考
system_design/
中的案例研究:
  • Netflix:视频流、推荐系统
  • Twitter:时间线、推文存储、趋势
  • Uber:实时匹配、位置追踪
  • Instagram:图片存储、Feed生成
  • WhatsApp:消息传递、在线状态
常见模式:
  • 新闻Feed:写时扩散 vs 读时扩散
  • 限流器:基于Redis的令牌桶
  • URL短链接:Base62编码、哈希冲突处理
  • 聊天系统:WebSocket、消息队列
  • 通知:推送通知服务、APNs/FCM

Interview Tips

面试技巧

Time Management:
  • Requirements: 10%
  • High-level design: 25%
  • Deep dive: 50%
  • Wrap up: 15%
Communication:
  • Think out loud
  • Ask clarifying questions
  • Discuss trade-offs
  • Acknowledge limitations
What interviewers look for:
  • Problem-solving approach
  • Technical depth
  • Trade-off analysis
  • Scale awareness
  • Communication skills
时间管理:
  • 需求分析:10%
  • 高层设计:25%
  • 深度剖析:50%
  • 总结收尾:15%
沟通技巧:
  • 边思考边表达
  • 提出澄清问题
  • 讨论权衡方案
  • 承认局限性
面试官关注要点:
  • 问题解决思路
  • 技术深度
  • 权衡分析能力
  • 规模意识
  • 沟通能力

Common Mistakes to Avoid

常见错误规避

  • Jumping to solution without requirements
  • Over-engineering simple problems
  • Under-estimating scale requirements
  • Ignoring single points of failure
  • Not considering monitoring/alerting
  • Forgetting about data consistency
  • Missing security considerations
  • 未明确需求就直接给出解决方案
  • 对简单问题过度设计
  • 低估规模需求
  • 忽略单点故障
  • 未考虑监控/告警
  • 忘记数据一致性
  • 遗漏安全考量

Project Context

项目背景

  • Templates in
    system_design/00_template.md
  • Case studies in
    system_design/*.md
  • Reference materials in
    doc/system_design/
  • Follow the established documentation pattern
  • 模板位于
    system_design/00_template.md
  • 案例研究位于
    system_design/*.md
  • 参考资料位于
    doc/system_design/
  • 遵循既定的文档规范