system-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystem Architecture Expert
系统架构专家
When to use this Skill
何时使用该Skill
Use this Skill when:
- Designing distributed systems
- Writing system design documentation
- Preparing for system design interviews
- Creating architecture diagrams
- Analyzing trade-offs between design choices
- Reviewing or improving existing system designs
在以下场景使用本Skill:
- 设计分布式系统
- 编写系统设计文档
- 准备系统设计面试
- 创建架构图
- 分析设计方案间的权衡
- 评审或优化现有系统设计
System Design Framework
系统设计框架
1. Requirements Gathering (5-10 minutes)
1. 需求收集(5-10分钟)
Functional Requirements:
- What are the core features?
- What actions can users perform?
- What are the inputs and outputs?
Non-Functional Requirements:
- Scale: How many users? How much data?
- Performance: Latency requirements? (p50, p95, p99)
- Availability: What uptime is needed? (99.9%, 99.99%)
- Consistency: Strong or eventual consistency?
Constraints:
- Budget limitations
- Technology stack constraints
- Team expertise
- Timeline
Example Questions:
- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?功能性需求:
- 核心功能有哪些?
- 用户可执行哪些操作?
- 输入和输出分别是什么?
非功能性需求:
- 规模:用户数量?数据量?
- 性能:延迟要求?(p50、p95、p99)
- 可用性:需要达到多少 uptime?(99.9%、99.99%)
- 一致性:强一致性还是最终一致性?
约束条件:
- 预算限制
- 技术栈约束
- 团队技术能力
- 项目 timeline
示例问题:
- 日活跃用户数是多少?
- 读写比是多少?
- 平均数据大小是多少?
- 峰值负载 vs 平均负载?
- 是否需要实时更新?
- 允许数据丢失吗?2. Capacity Estimation (Back-of-the-envelope)
2. 容量估算(粗略计算)
Calculate:
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS
Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year
Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/sMemory/Cache:
- 80-20 rule: 20% of data gets 80% of traffic
- Cache = 20% of total data for hot data
计算内容:
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS
Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year
Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s内存/缓存:
- 80-20法则:20%的数据承载80%的流量
- 缓存容量 = 总数据量的20%(用于存储热点数据)
3. High-Level Design
3. 高层设计
Core Components:
- Client Layer (Web, Mobile, Desktop)
- API Gateway / Load Balancer
- Application Servers (Business logic)
- Cache Layer (Redis, Memcached)
- Database (SQL, NoSQL, or both)
- Message Queue (Kafka, RabbitMQ)
- Object Storage (S3, GCS)
- CDN (CloudFront, Akamai)
Draw Architecture:
[Clients] → [CDN]
↓
[Load Balancer]
↓
[Application Servers]
↙ ↓ ↘
[Cache] [DB] [Queue] → [Workers]
↓
[Object Storage]核心组件:
- 客户端层(Web、Mobile、Desktop)
- API网关/负载均衡器
- 应用服务器(业务逻辑)
- 缓存层(Redis、Memcached)
- 数据库(SQL、NoSQL或两者结合)
- 消息队列(Kafka、RabbitMQ)
- 对象存储(S3、GCS)
- CDN(CloudFront、Akamai)
架构图绘制:
[Clients] → [CDN]
↓
[Load Balancer]
↓
[Application Servers]
↙ ↓ ↘
[Cache] [DB] [Queue] → [Workers]
↓
[Object Storage]4. Database Design
4. 数据库设计
SQL vs NoSQL Decision:
Use SQL when:
- ACID transactions required
- Complex queries with JOINs
- Structured data with relationships
- Examples: PostgreSQL, MySQL
Use NoSQL when:
- Massive scale (horizontal scaling)
- Flexible schema
- High write throughput
- Examples: Cassandra, DynamoDB, MongoDB
Sharding Strategy:
- Hash-based:
user_id % num_shards - Range-based: Users 1-100M on shard 1
- Geographic: US users on US shard
- Consistent hashing: For even distribution
Schema Design:
sql
-- Example: URL Shortener
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count INT DEFAULT 0,
INDEX (short_url),
INDEX (user_id)
);SQL vs NoSQL 决策:
选择SQL的场景:
- 需要ACID事务
- 涉及JOIN的复杂查询
- 具有关联关系的结构化数据
- 示例:PostgreSQL、MySQL
选择NoSQL的场景:
- 超大规模(水平扩展)
- 灵活的Schema
- 高写入吞吐量
- 示例:Cassandra、DynamoDB、MongoDB
分片策略:
- 哈希分片:
user_id % num_shards - 范围分片:用户1-1亿分配到分片1
- 地理分片:美国用户分配到美国分片
- 一致性哈希:实现均匀分布
Schema设计:
sql
-- Example: URL Shortener
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count INT DEFAULT 0,
INDEX (short_url),
INDEX (user_id)
);5. Deep Dive Components
5. 组件深度剖析
Caching Strategy:
- Cache-Aside: App reads from cache, loads from DB on miss
- Write-Through: Write to cache and DB together
- Write-Behind: Write to cache, async write to DB
Eviction Policies:
- LRU (Least Recently Used) - Most common
- LFU (Least Frequently Used)
- TTL (Time To Live)
Load Balancing:
- Round Robin: Simple, equal distribution
- Least Connections: Route to least busy server
- Consistent Hashing: Minimize redistribution
- Weighted: Based on server capacity
Message Queue Patterns:
- Pub/Sub: One-to-many (notifications)
- Work Queue: Task distribution (job processing)
- Fan-out: Broadcast to multiple queues
缓存策略:
- Cache-Aside:应用先读取缓存,缓存未命中时从数据库加载
- Write-Through:同时写入缓存和数据库
- Write-Behind:先写入缓存,异步写入数据库
淘汰策略:
- LRU(最近最少使用)- 最常用
- LFU(最不经常使用)
- TTL(生存时间)
负载均衡:
- 轮询(Round Robin):简单,均衡分配
- 最少连接数:路由到最空闲的服务器
- 一致性哈希:最小化重新分配
- 加权分配:基于服务器容量
消息队列模式:
- Pub/Sub:一对多(通知场景)
- 工作队列:任务分发(作业处理)
- Fan-out:广播到多个队列
6. Scalability Patterns
6. 可扩展性模式
Horizontal Scaling:
- Add more servers
- Use load balancers
- Stateless application servers
- Session stored in cache/DB
Vertical Scaling:
- Add more CPU/RAM to servers
- Limited by hardware
- Simpler but has limits
Microservices:
Monolith:
[Single App] → [DB]
Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]Benefits:
- Independent scaling
- Technology flexibility
- Fault isolation
Drawbacks:
- Increased complexity
- Network latency
- Distributed transactions
水平扩展:
- 添加更多服务器
- 使用负载均衡器
- 无状态应用服务器
- 会话存储在缓存/数据库中
垂直扩展:
- 为服务器添加更多CPU/RAM
- 受硬件限制
- 实现简单但有上限
微服务:
Monolith:
[Single App] → [DB]
Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]优势:
- 独立扩展
- 技术栈灵活
- 故障隔离
劣势:
- 复杂度提升
- 网络延迟
- 分布式事务
7. Reliability & Availability
7. 可靠性与可用性
Replication:
- Master-Slave: One writer, multiple readers
- Master-Master: Multiple writers (conflict resolution needed)
- Multi-region: Geographic redundancy
Failover:
- Active-Passive: Standby server takes over
- Active-Active: Both servers handle traffic
Rate Limiting:
- Token bucket algorithm
- Leaky bucket algorithm
- Fixed window counter
- Sliding window log
Circuit Breaker:
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered复制:
- 主从模式:一个写节点,多个读节点
- 主主模式:多个写节点(需要冲突解决机制)
- 多区域:地理冗余
故障转移:
- 主备模式:备用服务器接管
- 双活模式:两台服务器均处理流量
限流:
- 令牌桶算法
- 漏桶算法
- 固定窗口计数器
- 滑动窗口日志
断路器:
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered8. Common System Design Patterns
8. 常见系统设计模式
Content Delivery:
- Use CDN for static assets
- Geo-distributed edge servers
- Cache at edge locations
Data Consistency:
- Strong Consistency: Read reflects latest write (ACID)
- Eventual Consistency: Reads eventually reflect write (BASE)
- CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance
API Design:
RESTful:
GET /api/users/{id}
POST /api/users
PUT /api/users/{id}
DELETE /api/users/{id}
GraphQL:
query {
user(id: "123") {
name
posts {
title
}
}
}内容分发:
- 为静态资源使用CDN
- 地理分布式边缘服务器
- 在边缘位置缓存
数据一致性:
- 强一致性:读取反映最新写入(ACID)
- 最终一致性:读取最终会反映写入(BASE)
- CAP Theorem:三选二:一致性、可用性、分区容错性
API设计:
RESTful:
GET /api/users/{id}
POST /api/users
PUT /api/users/{id}
DELETE /api/users/{id}
GraphQL:
query {
user(id: "123") {
name
posts {
title
}
}
}9. System Design Template
9. 系统设计模板
Use this structure (based on ):
system_design/00_template.mdmarkdown
undefined使用以下结构(基于 ):
system_design/00_template.mdmarkdown
undefined{System Name}
{系统名称}
1. Requirements
1. 需求
Functional
功能性
- [List core features]
- [列出核心功能]
Non-Functional
非功能性
- Scale: [Users, QPS, Data]
- Performance: [Latency requirements]
- Availability: [Uptime target]
- 规模:[用户数、QPS、数据量]
- 性能:[延迟要求]
- 可用性:[Uptime目标]
2. Capacity Estimation
2. 容量估算
- Traffic: [QPS calculations]
- Storage: [Data size, growth]
- Bandwidth: [Network requirements]
- 流量:[QPS计算]
- 存储:[数据大小、增长]
- 带宽:[网络需求]
3. API Design
3. API设计
[endpoint] - [description][endpoint] - [描述]4. High-Level Architecture
4. 高层架构
[Diagram]
[架构图]
5. Database Schema
5. 数据库Schema
[Tables and relationships]
[表和关系]
6. Detailed Design
6. 详细设计
Component 1
组件1
[Deep dive]
[深度剖析]
Component 2
组件2
[Deep dive]
[深度剖析]
7. Scalability
7. 可扩展性
[How to scale each component]
[各组件的扩展方式]
8. Trade-offs
8. 权衡分析
[Decisions and alternatives]
undefined[决策及替代方案]
undefined10. Real-World Examples
10. 真实世界示例
Reference case studies in :
system_design/- Netflix: Video streaming, recommendation
- Twitter: Timeline, tweet storage, trending
- Uber: Real-time matching, location tracking
- Instagram: Image storage, feed generation
- WhatsApp: Message delivery, presence
Common Patterns:
- News Feed: Fan-out on write vs fan-out on read
- Rate Limiter: Token bucket with Redis
- URL Shortener: Base62 encoding, hash collision
- Chat System: WebSocket, message queue
- Notification: Push notification service, APNs/FCM
参考 中的案例研究:
system_design/- Netflix:视频流、推荐系统
- Twitter:时间线、推文存储、趋势
- Uber:实时匹配、位置追踪
- Instagram:图片存储、Feed生成
- WhatsApp:消息传递、在线状态
常见模式:
- 新闻Feed:写时扩散 vs 读时扩散
- 限流器:基于Redis的令牌桶
- URL短链接:Base62编码、哈希冲突处理
- 聊天系统:WebSocket、消息队列
- 通知:推送通知服务、APNs/FCM
Interview Tips
面试技巧
Time Management:
- Requirements: 10%
- High-level design: 25%
- Deep dive: 50%
- Wrap up: 15%
Communication:
- Think out loud
- Ask clarifying questions
- Discuss trade-offs
- Acknowledge limitations
What interviewers look for:
- Problem-solving approach
- Technical depth
- Trade-off analysis
- Scale awareness
- Communication skills
时间管理:
- 需求分析:10%
- 高层设计:25%
- 深度剖析:50%
- 总结收尾:15%
沟通技巧:
- 边思考边表达
- 提出澄清问题
- 讨论权衡方案
- 承认局限性
面试官关注要点:
- 问题解决思路
- 技术深度
- 权衡分析能力
- 规模意识
- 沟通能力
Common Mistakes to Avoid
常见错误规避
- Jumping to solution without requirements
- Over-engineering simple problems
- Under-estimating scale requirements
- Ignoring single points of failure
- Not considering monitoring/alerting
- Forgetting about data consistency
- Missing security considerations
- 未明确需求就直接给出解决方案
- 对简单问题过度设计
- 低估规模需求
- 忽略单点故障
- 未考虑监控/告警
- 忘记数据一致性
- 遗漏安全考量
Project Context
项目背景
- Templates in
system_design/00_template.md - Case studies in
system_design/*.md - Reference materials in
doc/system_design/ - Follow the established documentation pattern
- 模板位于
system_design/00_template.md - 案例研究位于
system_design/*.md - 参考资料位于
doc/system_design/ - 遵循既定的文档规范