scalability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/scalability — Scalable Design Enforcement
/scalability — 可扩展性设计强制规范
Every design, plan, and implementation MUST handle current load efficiently AND accommodate 10x growth without architectural changes. Design for the load you expect in 18 months, not the load you have today.
Why this matters: Systems that aren't designed to scale hit walls — and those walls always appear at the worst time (launch day, viral moment, enterprise customer onboarding). Retrofitting scalability is 10-100x more expensive than building it in.
When to invoke: During PLANNING (after brainstorming, before or alongside writing-plans) and during REVIEW (as part of code review criteria). This skill applies to both new code and modifications to existing code.
所有设计、规划和落地实现都必须能高效处理当前负载,同时无需调整架构即可支撑10倍的负载增长。要按照18个月后的预期负载做设计,而非只适配当前的负载规模。
重要性说明: 没有做扩展性设计的系统迟早会遇到瓶颈,而且瓶颈总是会在最糟糕的时间点出现(上线日、内容爆火时、企业客户接入期)。后期改造扩展性的成本是初期就内置扩展性能力的10-100倍。
适用场景: 规划阶段(头脑风暴后,撰写方案时或之前)和审查阶段(作为代码评审标准的一部分)。该规范同时适用于新代码开发和存量代码修改。
The Rules
设计规则
Rule 1: Stateless by Default
规则1:默认无状态
Every service, function, and handler MUST be stateless unless there is an explicit, documented reason for state.
- No in-memory state that would break with multiple instances.
- No local file system for data that must survive restarts or be shared.
- Session state goes in a shared store (Redis, database), never in process memory.
- Caches must be external (Redis, Memcached) or have invalidation strategies for multi-instance.
Test: Can you run 5 instances of this service behind a load balancer with no shared state? If no, fix it.
所有服务、函数和处理器都必须是无状态的,除非有明确的、书面记录的需保留状态的理由。
- 不允许存在会导致多实例运行异常的内存内状态
- 不允许使用本地文件系统存储需要在服务重启后保留、或需要多实例共享的数据
- 会话状态必须存储在共享存储中(Redis、数据库),绝对不能放在进程内存里
- 缓存必须是外置的(Redis、Memcached),或者适配多实例场景的失效策略
验证方式: 你能否在负载均衡后启动5个服务实例,且无需共享状态即可正常运行?如果不能,就要调整设计。
Rule 2: Efficient Data Access
规则2:高效数据访问
Every database query and data access pattern MUST be designed for scale:
| Pattern | Requirement |
|---|---|
| Queries | Must use indexes. No full table scans on tables that will grow. |
| Pagination | Required for any list endpoint. No unbounded |
| N+1 queries | Forbidden. Use joins, batch loading, or dataloader patterns. |
| Write amplification | Minimize. Don't update entire records when one field changes. |
| Connection pooling | Required. Never open/close connections per request. |
| Read replicas | Design for eventual consistency where appropriate. |
Test: Run an on every query. If it says "full table scan" on a table with >10K rows, add an index.
EXPLAIN所有数据库查询和数据访问模式都必须适配规模化场景:
| 模式 | 要求 |
|---|---|
| 查询 | 必须使用索引。未来会扩容的表不允许做全表扫描。 |
| 分页 | 所有列表类接口必须支持分页,不允许无边界的 |
| N+1 查询 | 严格禁止。使用关联查询、批量加载或dataloader模式解决。 |
| 写放大 | 尽可能降低。不要在仅修改单个字段时更新整条记录。 |
| 连接池 | 必须使用。绝对不能为每个请求单独打开/关闭连接。 |
| 读副本 | 合适的场景下按最终一致性设计。 |
验证方式: 对所有查询执行分析,如果在行数超过1万的表上出现「全表扫描」,就要添加索引。
EXPLAINRule 3: Async Where Possible
规则3:尽可能异步化
Any operation that doesn't need an immediate response MUST be asynchronous:
- Email/SMS sending — queue it.
- Report generation — queue it, notify on completion.
- External API calls — if the user doesn't need the result immediately, queue it.
- Data processing — stream or batch, never block the request.
- File uploads — accept, acknowledge, process asynchronously.
Synchronous is acceptable for: auth checks, data reads <100ms, input validation.
所有不需要立即返回结果的操作都必须做异步处理:
- 邮件/短信发送 — 放入队列处理
- 报表生成 — 放入队列,完成后通知用户
- 外部API调用 — 如果用户不需要立即拿到结果,就放入队列
- 数据处理 — 用流式或批量处理,绝对不能阻塞请求
- 文件上传 — 先接收请求、返回确认,再异步处理后续流程
可使用同步处理的场景: 鉴权校验、耗时<100ms的数据读取、输入校验。
Rule 4: Caching Strategy
规则4:缓存策略
Every read-heavy path MUST have a caching strategy:
| Cache layer | TTL | Use when |
|---|---|---|
| HTTP cache (CDN, browser) | Minutes to hours | Static assets, API responses that change infrequently |
| Application cache (Redis) | Seconds to minutes | Computed results, session data, frequent queries |
| Database query cache | Seconds | Identical queries hitting the DB frequently |
| No cache | — | Write paths, real-time data, personalized content |
Every cache MUST have:
- A defined TTL (no infinite caches).
- An invalidation strategy (time-based, event-based, or both).
- A cache-miss path that works correctly (no assumption that cache is always warm).
所有读路径占比高的链路都必须有配套的缓存策略:
| 缓存层 | TTL | 适用场景 |
|---|---|---|
| HTTP缓存(CDN、浏览器) | 数分钟到数小时 | 静态资源、变更不频繁的API响应 |
| 应用缓存(Redis) | 数秒到数分钟 | 计算结果、会话数据、高频查询 |
| 数据库查询缓存 | 数秒 | 频繁触发的相同数据库查询 |
| 不使用缓存 | — | 写路径、实时数据、个性化内容 |
所有缓存必须满足:
- 明确定义TTL(不允许无限期缓存)
- 配置失效策略(基于时间、基于事件,或两者结合)
- 缓存未命中时的链路可正常运行(不要假设缓存永远是预热状态)
Rule 5: Resource Limits
规则5:资源限制
Every resource consumer MUST have explicit limits:
| Resource | Limit | What happens at limit |
|---|---|---|
| HTTP request body | Max size (e.g., 10MB) | 413 Payload Too Large |
| Query results | Max rows (e.g., 1000) | Pagination required |
| Batch operations | Max batch size (e.g., 100) | Split into chunks |
| Concurrent connections | Pool size (e.g., 20) | Queue or reject |
| Background jobs | Max concurrent (e.g., 10) | Queue with backpressure |
| File uploads | Max size + count | Reject with clear error |
No unbounded anything. Every loop, query, queue, and buffer has a maximum.
所有资源消耗方都必须有明确的上限:
| 资源 | 限制 | 触达上限后的处理 |
|---|---|---|
| HTTP请求体 | 最大大小(例如10MB) | 返回413 Payload Too Large |
| 查询结果 | 最大行数(例如1000条) | 强制要求分页 |
| 批量操作 | 最大批次大小(例如100条) | 拆分为多个小批次处理 |
| 并发连接 | 连接池大小(例如20个) | 排队等待或直接拒绝 |
| 后台任务 | 最大并发数(例如10个) | 带背压机制的队列处理 |
| 文件上传 | 最大大小+数量限制 | 返回清晰的错误提示拒绝请求 |
任何操作都不能没有边界,所有循环、查询、队列、缓冲区都必须有最大值限制。
Rule 6: Horizontal Scaling Design
规则6:水平扩展设计
Architecture MUST support horizontal scaling:
- No singleton dependencies — no "there can be only one instance" of any service.
- Idempotent operations — safe to retry, safe to run in parallel.
- Distributed locking only when absolutely necessary (and with TTL).
- Event-driven over request-driven for inter-service communication.
- Partitionable data — design schemas so data can be sharded by tenant, region, or time.
架构必须支持水平扩展:
- 无单点依赖 — 所有服务都不能存在「仅能运行单实例」的限制
- 操作幂等 — 重试、并行运行都不会产生异常结果
- 仅在绝对必要时使用分布式锁(且必须配置TTL)
- 服务间通信优先用事件驱动而非请求驱动
- 数据可分区 — 设计Schema时要支持按租户、区域或时间做数据分片
Rule 7: Performance Budgets
规则7:性能预算
Every user-facing operation MUST have a performance budget:
| Operation type | Budget |
|---|---|
| API response (P95) | <200ms |
| Page load (LCP) | <2.5s |
| Database query | <50ms |
| Background job start | <1s from event |
| Search | <500ms |
If an operation exceeds its budget, it MUST be optimized before shipping. "It works" is not the same as "it scales."
所有面向用户的操作都必须有性能预算:
| 操作类型 | 预算 |
|---|---|
| API响应(P95) | <200ms |
| 页面加载(LCP) | <2.5s |
| 数据库查询 | <50ms |
| 后台任务启动 | 事件触发后<1s内启动 |
| 搜索 | <500ms |
如果操作耗时超过预算,必须先优化再上线。「能运行」和「能扩展」是两个完全不同的概念。
Applying This Skill
规范落地方式
During Planning (brainstorming / writing-plans)
规划阶段(头脑风暴/撰写方案时)
Before finalizing any design or plan, run the Scalability Checklist:
- All services are stateless (or state is externalized with justification)
- All database queries use indexes and pagination where appropriate
- Long-running operations are asynchronous
- Read-heavy paths have a caching strategy with TTL and invalidation
- All resource consumers have explicit limits
- Architecture supports horizontal scaling (no singletons, idempotent operations)
- Performance budgets are defined for user-facing operations
If any item fails: redesign before proceeding to implementation.
在最终确定任何设计或方案前,先过一遍可扩展性检查清单:
- 所有服务都是无状态的(或状态已经外置且有合理说明)
- 所有数据库查询都合理使用索引和分页
- 长耗时操作已做异步化处理
- 读占比高的链路有完整的缓存策略,包含TTL和失效机制
- 所有资源消耗方都有明确的上限限制
- 架构支持水平扩展(无单点、操作幂等)
- 面向用户的操作已定义性能预算
如果有任何一项不满足:先调整设计,再进入开发阶段。
During Implementation (executing-plans)
开发阶段(执行方案时)
As you write code:
- Run on new queries. Add indexes proactively.
EXPLAIN - Add pagination to every list endpoint from day one.
- Set explicit timeouts on every external call (HTTP, DB, cache).
- Add resource limits to every input (body size, array length, string length).
- Use connection pooling for every external resource.
写代码时要注意:
- 对所有新增查询执行分析,提前添加索引
EXPLAIN - 所有列表类接口从第一天就支持分页
- 所有外部调用(HTTP、数据库、缓存)都设置明确的超时时间
- 所有输入都加资源限制(请求体大小、数组长度、字符串长度)
- 所有外部资源访问都使用连接池
During Review (code-review / receiving-code-review)
评审阶段(代码评审/接收代码评审时)
Verify these as part of every code review:
- No unbounded queries or loops
- No in-process state that breaks with multiple instances
- Proper caching with TTL and invalidation
- Async processing for non-immediate operations
- Resource limits on all inputs
- Performance budgets documented and met
每次代码评审都要验证以下内容:
- 无边界查询或循环
- 不存在会导致多实例运行异常的进程内状态
- 缓存配置合理,包含TTL和失效策略
- 非即时操作已做异步处理
- 所有输入都配置了资源限制
- 性能预算已明确说明且符合要求
When Modifying Existing Code
修改存量代码时
If existing code violates these rules:
- You are NOT required to fix all scalability issues in unrelated code.
- You ARE required to not make scalability worse.
- If adding a new query to an endpoint, ensure it's indexed and paginated.
- If adding a new external dependency, ensure it has timeouts and connection pooling.
如果存量代码不符合这些规则:
- 你不需要修复不相关代码里的所有扩展性问题
- 你必须保证修改后的代码不会让扩展性变得更差
- 如果给接口加了新的查询,要确保查询使用了索引且支持分页
- 如果新增了外部依赖,要确保配置了超时时间和连接池
Anti-Patterns
反模式
| Pattern | Problem | Fix |
|---|---|---|
| In-memory sessions | Breaks with multiple instances | External session store |
| Unbounded queries | Memory explosion at scale | Pagination + limits |
| Synchronous emails | Request blocked for seconds | Queue + async worker |
| No connection pooling | Connection exhaustion under load | Pool with limits |
| Cache without TTL | Stale data forever | TTL + invalidation strategy |
| SELECT * | Transfers unnecessary data | Select only needed columns |
| Fat payloads | Network bottleneck | Paginate, compress, or stream |
| 模式 | 问题 | 修复方案 |
|---|---|---|
| 内存内存储会话 | 多实例运行时异常 | 使用外置会话存储 |
| 无边界查询 | 规模化后内存溢出 | 分页+上限限制 |
| 同步发送邮件 | 请求阻塞数秒 | 队列+异步 worker |
| 无连接池 | 高负载下连接耗尽 | 带上限的连接池 |
| 无TTL的缓存 | 数据永久过期 | TTL+缓存失效策略 |
| SELECT * | 传输不必要的数据 | 仅查询需要的字段 |
| 超大 payload | 网络瓶颈 | 分页、压缩或流式传输 |
Rationalization Prevention
避免不合理的辩解
| Excuse | Reality |
|---|---|
| "We only have 100 users" | You'll have 10,000 before you know it. Design now. |
| "We can optimize later" | Optimization is cheap. Redesigning architecture is not. |
| "Premature optimization" | Scalability design ≠ micro-optimization. These are architectural. |
| "It's fast enough on my machine" | Your machine has 1 user. Production has thousands. |
| "We'll add caching when we need it" | By then you'll need it urgently. Design the strategy now. |
| "This is just an internal tool" | Internal tools scale with the company. Design accordingly. |
| 借口 | 事实 |
|---|---|
| 「我们现在只有100个用户」 | 你可能不知不觉就会有10000个用户,现在就做好设计。 |
| 「我们可以以后再优化」 | 优化成本很低,架构重构的成本却极高。 |
| 「这是过早优化」 | 扩展性设计≠微观优化,这些都是架构层面的基础要求。 |
| 「在我机器上跑的很快」 | 你的机器只有1个用户,生产环境有数千个用户。 |
| 「等需要的时候再加缓存就行」 | 等你需要的时候往往已经很紧急了,现在就把策略设计好。 |
| 「这只是个内部工具」 | 内部工具会跟着公司规模一起扩张,也要做对应设计。 |