scalability-playbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scalability Playbook

可扩展性实施手册

Systematic approach to identifying and resolving scalability bottlenecks.
系统性识别与解决可扩展性瓶颈的方法指南。

Bottleneck Analysis

瓶颈分析

Current System Profile

当前系统概况

Traffic: 1,000 req/min
Users: 10,000 active
Data: 100GB database
Response time: p95 = 500ms
Traffic: 1,000 req/min
Users: 10,000 active
Data: 100GB database
Response time: p95 = 500ms

Identified Bottlenecks

已识别的瓶颈

1. Database Queries

1. 数据库查询

Symptom: Slow page loads (2-3s) Measurement: Query time p95 = 800ms Impact: HIGH - affects all reads Trigger: When p95 >500ms
症状: 页面加载缓慢(2-3秒) 测量数据: 查询时间p95 = 800ms 影响程度: 高 - 影响所有读操作 触发条件: 当p95 >500ms时

2. Single Server

2. 单服务器架构

Symptom: High CPU (>80%) Measurement: Load average >4 Impact: MEDIUM - intermittent slowdowns Trigger: When CPU >70%
症状: CPU占用率过高(>80%) 测量数据: 系统负载均值>4 影响程度: 中 - 间歇性性能下降 触发条件: 当CPU占用率>70%时

3. No Caching

3. 无缓存机制

Symptom: Repeated DB queries Measurement: Cache hit rate = 0% Impact: MEDIUM - unnecessary load Trigger: When query volume >10k/min
症状: 重复执行数据库查询 测量数据: 缓存命中率=0% 影响程度: 中 - 产生不必要的负载 触发条件: 当查询量>10k/分钟时

Scaling Strategies (Ordered)

有序扩容策略

Level 1: Quick Wins (Days)

第一级:快速优化(数天)

1.1 Add Database Indexes

1.1 添加数据库索引

Problem: Slow queries Solution:
sql
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Expected Impact: 80% faster queries Cost: $0 Effort: 1 day
问题: 查询速度缓慢 解决方案:
sql
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
预期效果: 查询速度提升80% 成本: $0 工作量: 1天

1.2 Enable Query Caching

1.2 启用查询缓存

Problem: Repeated queries Solution: Redis cache layer
typescript
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);

const user = await db.users.findById(userId);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
Expected Impact: 60% reduction in DB load Cost: $50/month Effort: 2 days
问题: 重复查询 解决方案: 引入Redis缓存层
typescript
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);

const user = await db.users.findById(userId);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
预期效果: 数据库负载降低60% 成本: $50/月 工作量: 2天

Level 2: Horizontal Scaling (Weeks)

第二级:水平扩容(数周)

2.1 Add Read Replicas

2.1 添加只读副本

Problem: Read-heavy workload Solution: Route reads to replicas
Write Load: Primary DB
Read Load: 3x Read Replicas
Expected Impact: 3x read capacity Cost: $300/month Effort: 1 week
问题: 读操作负载过重 解决方案: 将读请求路由至副本节点
Write Load: Primary DB
Read Load: 3x Read Replicas
预期效果: 读能力提升3倍 成本: $300/月 工作量: 1周

2.2 Load Balancer + Multiple Servers

2.2 负载均衡器+多服务器架构

Problem: Single point of failure Solution:
ALB
 ├── Server 1
 ├── Server 2
 └── Server 3
Expected Impact: 3x throughput Cost: $400/month Effort: 1 week
问题: 单点故障风险 解决方案:
ALB
 ├── Server 1
 ├── Server 2
 └── Server 3
预期效果: 吞吐量提升3倍 成本: $400/月 工作量: 1周

Level 3: Architecture Changes (Months)

第三级:架构调整(数月)

3.1 CDN for Static Assets

3.1 静态资源CDN

Problem: Slow asset delivery Solution: CloudFront CDN Expected Impact: 90% faster asset loads Cost: $100/month Effort: 1 week
问题: 静态资源交付缓慢 解决方案: 部署CloudFront CDN 预期效果: 静态资源加载速度提升90% 成本: $100/月 工作量: 1周

3.2 Async Processing

3.2 异步处理

Problem: Slow sync operations Solution: Background job queues
typescript
// Before: Sync
await sendEmail(user);
await processPayment(order);
await updateAnalytics(event);
return response; // Waits 5+ seconds

// After: Async
await queue.add("send-email", { userId });
await queue.add("process-payment", { orderId });
await queue.add("update-analytics", { event });
return response; // Returns immediately
Expected Impact: 80% faster responses Cost: $50/month (SQS) Effort: 2 weeks
问题: 同步操作耗时过长 解决方案: 引入后台任务队列
typescript
// Before: Sync
await sendEmail(user);
await processPayment(order);
await updateAnalytics(event);
return response; // Waits 5+ seconds

// After: Async
await queue.add("send-email", { userId });
await queue.add("process-payment", { orderId });
await queue.add("update-analytics", { event });
return response; // Returns immediately
预期效果: 响应速度提升80% 成本: $50/月(SQS) 工作量: 2周

Level 4: Data Layer Optimization (Months)

第四级:数据层优化(数月)

4.1 Database Sharding

4.1 数据库分片(Sharding)

Problem: Single DB too large Solution: Shard by user_id
Shard 1: user_id 0-24999
Shard 2: user_id 25000-49999
Shard 3: user_id 50000-74999
Shard 4: user_id 75000-99999
Expected Impact: 4x capacity Cost: $1,200/month Effort: 2 months
问题: 单数据库容量不足 解决方案: 按user_id分片
Shard 1: user_id 0-24999
Shard 2: user_id 25000-49999
Shard 3: user_id 50000-74999
Shard 4: user_id 75000-99999
预期效果: 容量提升4倍 成本: $1,200/月 工作量: 2个月

4.2 Event-Driven Architecture

4.2 事件驱动架构

Problem: Tight coupling, cascading failures Solution: Message broker (Kafka)
Service A → Kafka → Service B
          ↘        ↗ Service C
Expected Impact: Better isolation, resilience Cost: $500/month Effort: 3 months
问题: 系统耦合度高、级联故障风险 解决方案: 引入消息中间件(Kafka)
Service A → Kafka → Service B
          ↘        ↗ Service C
预期效果: 更好的隔离性与容错性 成本: $500/月 工作量: 3个月

Scaling Triggers

扩容触发条件

markdown
| Metric           | Current | Warning | Critical | Action                  |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU              | 40%     | 70%     | 85%      | Add servers             |
| Memory           | 50%     | 75%     | 90%      | Upgrade instances       |
| DB Connections   | 20      | 40      | 50       | Add read replicas       |
| Query Time (p95) | 200ms   | 500ms   | 1000ms   | Add indexes             |
| Queue Depth      | 100     | 1000    | 5000     | Add workers             |
| Error Rate       | 0.1%    | 1%      | 5%       | Investigate immediately |
markdown
| 指标           | 当前值 | 警告阈值 | 紧急阈值 | 对应操作                |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU              | 40%     | 70%     | 85%      | 添加服务器              |
| Memory           | 50%     | 75%     | 90%      | 升级实例规格            |
| DB Connections   | 20      | 40      | 50       | 添加只读副本            |
| Query Time (p95) | 200ms   | 500ms   | 1000ms   | 添加数据库索引          |
| Queue Depth      | 100     | 1000    | 5000     | 添加工作节点            |
| Error Rate       | 0.1%    | 1%      | 5%       | 立即排查问题            |

Phased Scaling Plan

分阶段扩容计划

Phase 1: Current → 10x (0-3 months)

阶段1:当前规模 → 10倍(0-3个月)

Target: 10,000 req/min, 100K users
Actions:
  1. Add database indexes (Week 1)
  2. Implement Redis caching (Week 2)
  3. Add 3x read replicas (Week 4)
  4. Horizontal scale app servers (Week 6)
  5. CDN for static assets (Week 8)
Cost: $500 → $1,000/month
目标: 10,000 请求/分钟,10万活跃用户
执行动作:
  1. 添加数据库索引(第1周)
  2. 部署Redis缓存(第2周)
  3. 添加3个只读副本(第4周)
  4. 水平扩容应用服务器(第6周)
  5. 静态资源CDN部署(第8周)
成本变化: $500 → $1,000/月

Phase 2: 10x → 100x (3-12 months)

阶段2:10倍 → 100倍(3-12个月)

Target: 100,000 req/min, 1M users
Actions:
  1. Database sharding (Month 4-6)
  2. Multi-region deployment (Month 6-8)
  3. Microservices extraction (Month 8-12)
  4. Event-driven architecture (Month 10-12)
Cost: $1,000 → $10,000/month
目标: 100,000 请求/分钟,100万活跃用户
执行动作:
  1. 数据库分片(第4-6个月)
  2. 多区域部署(第6-8个月)
  3. 微服务拆分(第8-12个月)
  4. 事件驱动架构落地(第10-12个月)
成本变化: $1,000 → $10,000/月

Phase 3: 100x → 1000x (12-24 months)

阶段3:100倍 → 1000倍(12-24个月)

Target: 1M req/min, 10M users
Actions:
  1. Global CDN (Month 13)
  2. Advanced caching (L1/L2) (Month 14-15)
  3. Custom DB solutions (Month 16-18)
  4. Edge computing (Month 18-20)
Cost: $10,000 → $100,000/month
目标: 100万 请求/分钟,1000万活跃用户
执行动作:
  1. 全球CDN部署(第13个月)
  2. 多级缓存(L1/L2)(第14-15个月)
  3. 定制化数据库方案(第16-18个月)
  4. 边缘计算落地(第18-20个月)
成本变化: $10,000 → $100,000/月

Load Testing Plan

负载测试计划

bash
undefined
bash
undefined

Current baseline

Current baseline

hey -n 10000 -c 100 https://api.example.com/users
hey -n 10000 -c 100 https://api.example.com/users

Target 10x

Target 10x

hey -n 100000 -c 1000 https://api.example.com/users
hey -n 100000 -c 1000 https://api.example.com/users

Measure:

Measure:

- Requests/sec

- Requests/sec

- p50, p95, p99 latency

- p50, p95, p99 latency

- Error rate

- Error rate

- Resource utilization

- Resource utilization

undefined
undefined

Cost-Benefit Analysis

成本效益分析

markdown
| Strategy      | Cost/Month | Expected Impact    | ROI | Priority |
| ------------- | ---------- | ------------------ | --- | -------- |
| DB Indexes    | $0         | 80% faster queries || HIGH     |
| Redis Cache   | $50        | 60% less DB load   | 12x | HIGH     |
| Read Replicas | $300       | 3x capacity        | 10x | MEDIUM   |
| Load Balancer | $400       | 3x throughput      | 7x  | MEDIUM   |
| DB Sharding   | $1,200     | 4x capacity        | 3x  | LOW      |
markdown
| 策略           | 月成本   | 预期效果           | 投资回报率 | 优先级 |
| ------------- | -------- | ------------------ | ---------- | ------ |
| 数据库索引     | $0       | 查询速度提升80%    |||
| Redis缓存      | $50      | 数据库负载降低60%  | 12x        ||
| 只读副本       | $300     | 容量提升3倍        | 10x        ||
| 负载均衡器     | $400     | 吞吐量提升3倍      | 7x         ||
| 数据库分片     | $1,200   | 容量提升4倍        | 3x         ||

Best Practices

最佳实践

  1. Measure first: Don't optimize blindly
  2. Low-hanging fruit: Start with easy wins
  3. Load test: Validate before production
  4. Monitor continuously: Set up alerts
  5. Plan ahead: Scale before hitting limits
  6. Cost-conscious: ROI-driven decisions
  7. Incremental: Small, safe changes
  1. 先测量再优化:避免盲目优化
  2. 优先快速优化:从低门槛高收益的方案入手
  3. 负载测试验证:上线前验证效果
  4. 持续监控:设置告警机制
  5. 提前规划:在达到极限前完成扩容
  6. 成本敏感:基于投资回报率做决策
  7. 增量迭代:小步安全调整

Output Checklist

输出检查清单

  • Current system profile
  • Bottlenecks identified and measured
  • Scaling strategies ordered by effort
  • Triggers defined for each action
  • Phased plan (1x → 10x → 100x)
  • Cost estimates per phase
  • Load testing plan
  • Monitoring dashboard
  • Rollback procedures
  • 当前系统概况
  • 已识别并量化的瓶颈
  • 按工作量排序的扩容策略
  • 各动作对应的触发条件
  • 分阶段扩容计划(1倍→10倍→100倍)
  • 各阶段成本估算
  • 负载测试计划
  • 监控仪表盘
  • 回滚流程