performance-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePerformance Optimization Skill
性能优化技能
Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.
在为代码做性能优化时遵循以下原则。聚焦真正影响性能的关键3%代码——在工程领域,12%的性能提升绝非微不足道。
Core Philosophy
核心理念
- Measure First: Never optimize without profiling data
- Estimate Costs: Back-of-envelope calculations before implementation
- Avoid Work: The fastest code is code that doesn't run
- Reduce Allocations: Memory allocation is often the hidden bottleneck
- Cache Locality: Memory access patterns dominate modern performance
- 先测量:绝不未获取profiling数据就进行优化
- 估算成本:在实现前先做粗略的成本计算
- 避免冗余操作:最快的代码是无需运行的代码
- 减少内存分配:内存分配往往是隐藏的性能瓶颈
- 缓存局部性:内存访问模式决定了现代系统的性能表现
Reference Latency Numbers
参考延迟数值
Use these for back-of-envelope calculations:
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| Branch mispredict | 5 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 25 ns |
| Main memory reference | 100 ns |
| Compress 1KB (Snappy) | 3 us |
| SSD random read (4KB) | 20 us |
| Round trip in datacenter | 50 us |
| Disk seek | 5 ms |
以下数值可用于粗略计算:
| 操作 | 延迟 |
|---|---|
| L1 cache读取 | 0.5 ns |
| 分支预测失败 | 5 ns |
| L2 cache读取 | 7 ns |
| 互斥锁加锁/解锁 | 25 ns |
| 主内存读取 | 100 ns |
| 压缩1KB数据(Snappy) | 3 us |
| SSD随机读取(4KB) | 20 us |
| 数据中心内往返 | 50 us |
| 磁盘寻道 | 5 ms |
Optimization Techniques (Priority Order)
优化技术(优先级排序)
1. Algorithmic Improvements (Highest Impact)
1. 算法优化(影响最大)
Always check algorithm complexity first:
O(N^2) → O(N log N) = 1000x faster for N=1M
O(N) → O(1) = unbounded improvementCommon patterns:
- Replace nested loops with hash lookups
- Use sorted-list intersection → hash table lookups
- Process in reverse post-order to eliminate per-element checks
- Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed
始终先检查算法复杂度:
O(N^2) → O(N log N) = 当N=1M时,速度提升1000倍
O(N) → O(1) = 性能提升无上限常见优化模式:
- 用哈希查找替换嵌套循环
- 将有序列表交集操作改为哈希表查找
- 采用后序逆序处理以消除逐元素检查
- 当无需范围查询时,用哈希表(O(1))替换区间树(O(log N))
2. Reduce Memory Allocations
2. 减少内存分配
Allocation is expensive (~25-100ns + GC pressure)
python
undefined内存分配成本高昂(约25-100ns + GC压力)
python
undefinedBAD: Allocates on every call
不佳:每次调用都会分配内存
def process(items):
result = [] # New allocation
for item in items:
result.append(transform(item))
return result
def process(items):
result = [] # 新分配内存
for item in items:
result.append(transform(item))
return result
GOOD: Pre-allocate or reuse
推荐:预分配或复用内存
def process(items, out=None):
if out is None:
out = [None] * len(items)
for i, item in enumerate(items):
out[i] = transform(item)
return out
**Techniques:**
- Pre-size containers with `reserve()` or known capacity
- Hoist temporary containers outside loops
- Reuse buffers across iterations (clear instead of recreate)
- Move instead of copy large structures
- Use stack allocation for bounded-lifetime objectsdef process(items, out=None):
if out is None:
out = [None] * len(items)
for i, item in enumerate(items):
out[i] = transform(item)
return out
**优化技巧:**
- 用`reserve()`或已知容量预先设置容器大小
- 将临时容器移到循环外部
- 跨迭代复用缓冲区(清空而非重建)
- 移动而非复制大型结构
- 为有限生命周期的对象使用栈分配3. Compact Data Structures
3. 紧凑数据结构
Minimize memory footprint and cache lines touched:
rust
// BAD: 24 bytes due to padding
struct Item {
flag: bool, // 1 byte + 7 padding
value: i64, // 8 bytes
count: i32, // 4 bytes + 4 padding
}
// GOOD: 16 bytes with reordering
struct Item {
value: i64, // 8 bytes
count: i32, // 4 bytes
flag: bool, // 1 byte + 3 padding
}Techniques:
- Reorder struct fields by size (largest first)
- Use smaller numeric types when range permits (u8 vs i32)
- Replace 64-bit pointers with 32-bit indices
- Keep hot fields together, move cold fields to separate struct
- Use bit vectors for small-domain sets
- Flatten nested maps: →
Map<A, Map<B, C>>Map<(A,B), C>
最小化内存占用和缓存行命中次数:
rust
// 不佳:因内存对齐占用24字节
struct Item {
flag: bool, // 1字节 + 7字节填充
value: i64, // 8字节
count: i32, // 4字节 + 4字节填充
}
// 推荐:重新排序后仅占用16字节
struct Item {
value: i64, // 8字节
count: i32, // 4字节
flag: bool, // 1字节 + 3字节填充
}优化技巧:
- 按字段大小从大到小重新排序结构体字段
- 在范围允许时使用更小的数值类型(如u8替代i32)
- 用32位索引替换64位指针
- 将热点字段放在一起,冷字段移至单独结构体
- 对小值域集合使用位向量
- 扁平化嵌套映射:→
Map<A, Map<B, C>>Map<(A,B), C>
4. Fast Paths for Common Cases
4. 常见场景快速路径
Optimize the common case without hurting rare cases:
python
undefined优化常见场景且不影响罕见场景:
python
undefinedBAD: Always takes slow path
不佳:始终走慢路径
def parse_varint(data):
return generic_varint_parser(data)
def parse_varint(data):
return generic_varint_parser(data)
GOOD: Fast path for common 1-byte case
推荐:为常见的单字节场景提供快速路径
def parse_varint(data):
if data[0] < 128: # Single byte - 90% of cases
return data[0], 1
return generic_varint_parser(data) # Rare multi-byte
**Techniques:**
- Handle common dimensions inline (1-D, 2-D tensors)
- Check for empty/trivial inputs early
- Specialize for common sizes (small strings, few elements)
- Process trailing elements separately to avoid slow generic codedef parse_varint(data):
if data[0] < 128: # 单字节场景占90%
return data[0], 1
return generic_varint_parser(data) # 罕见的多字节场景
**优化技巧:**
- 直接处理常见维度(如1D、2D张量)
- 提前检查空输入或简单输入
- 针对常见大小做专门优化(如短字符串、少量元素)
- 单独处理末尾元素以避免慢通用代码5. Precompute Expensive Information
5. 预计算昂贵信息
Trade memory for compute when beneficial:
python
undefined在收益大于成本时用内存换取计算资源:
python
undefinedBAD: Recomputes on every access
不佳:每次访问都重新计算
def is_vowel(char):
return char.lower() in 'aeiou'
def is_vowel(char):
return char.lower() in 'aeiou'
GOOD: Lookup table
推荐:使用查找表
VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))]
def is_vowel(char):
return VOWEL_TABLE[ord(char)]
**Techniques:**
- Precompute flags/properties at construction time
- Build lookup tables for character classification
- Cache expensive computed properties
- Validate at boundaries once, not repeatedly insideVOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))]
def is_vowel(char):
return VOWEL_TABLE[ord(char)]
**优化技巧:**
- 在构造阶段预计算标志/属性
- 为字符分类构建查找表
- 缓存开销大的计算属性
- 在边界处验证一次,而非在内部重复验证6. Bulk/Batch APIs
6. 批量API调用
Amortize fixed costs across multiple operations:
python
undefined将固定成本分摊到多个操作上:
python
undefinedBAD: N round trips
不佳:N次往返
for item in items:
result = db.lookup(item)
for item in items:
result = db.lookup(item)
GOOD: 1 round trip
推荐:1次往返
results = db.lookup_many(items)
**Design APIs that support:**
- Batch lookups instead of individual fetches
- Vectorized operations over loops
- Streaming interfaces for large datasetsresults = db.lookup_many(items)
**设计支持以下特性的API:**
- 批量查找替代单次获取
- 用向量化操作替代循环
- 为大型数据集提供流式接口7. Avoid Unnecessary Work
7. 避免冗余操作
python
undefinedpython
undefinedBAD: Always computes expensive value
不佳:始终计算昂贵的值
def process(data, config):
expensive = compute_expensive(data) # Always runs
if config.needs_expensive:
use(expensive)
def process(data, config):
expensive = compute_expensive(data) # 始终执行
if config.needs_expensive:
use(expensive)
GOOD: Defer until needed
推荐:延迟计算直到需要时
def process(data, config):
if config.needs_expensive:
expensive = compute_expensive(data) # Only when needed
use(expensive)
**Techniques:**
- Lazy evaluation for expensive operations
- Short-circuit evaluation in conditions
- Move loop-invariant code outside loops
- Specialize instead of using general-purpose libraries in hot pathsdef process(data, config):
if config.needs_expensive:
expensive = compute_expensive(data) # 仅在需要时执行
use(expensive)
**优化技巧:**
- 对昂贵操作使用延迟求值
- 条件判断中使用短路求值
- 将循环不变代码移到循环外部
- 在热点路径中使用专门实现而非通用库8. Help the Compiler/Runtime
8. 协助编译器/运行时优化
Lower-level optimizations when profiling shows need:
rust
// Avoid function call overhead in hot loops
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }
// Copy to local variable for better alias analysis
fn process(data: &mut [i32], factor: &i32) {
let f = *factor; // Compiler knows this won't change
for x in data {
*x *= f;
}
}Techniques:
- Use raw pointers/indices instead of iterators in critical loops
- Hand-unroll very hot loops (4+ iterations)
- Move slow-path code to separate non-inlined functions
- Avoid abstractions that hide costs in hot paths
当profiling显示需要时进行底层优化:
rust
// 避免热点循环中的函数调用开销
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }
// 复制到局部变量以提升别名分析效果
fn process(data: &mut [i32], factor: &i32) {
let f = *factor; // 编译器知道该值不会改变
for x in data {
*x *= f;
}
}优化技巧:
- 在关键循环中使用原始指针/索引而非迭代器
- 手动展开热点循环(4次及以上迭代)
- 将慢路径代码移至单独的非内联函数
- 避免在热点路径中使用隐藏成本的抽象
9. Reduce Synchronization
9. 减少同步操作
Minimize lock contention and atomic operations:
- Default to thread-compatible (external sync) not thread-safe
- Sample statistics (1-in-32) instead of tracking everything
- Batch updates to shared state
- Use thread-local storage for per-thread data
最小化锁竞争和原子操作:
- 默认使用线程兼容(外部同步)而非线程安全实现
- 采用抽样统计(如每32次操作抽样1次)而非跟踪所有操作
- 批量更新共享状态
- 为线程本地数据使用线程本地存储(TLS)
Profiling Workflow
性能分析工作流
- Identify hotspots: Use profiler (pprof, perf, py-spy)
- Measure baseline: Write benchmark before optimizing
- Estimate improvement: Calculate expected gain
- Implement change: Focus on one optimization at a time
- Verify improvement: Run benchmark, confirm gain
- Check for regressions: Ensure no other code paths slowed down
- 定位热点:使用性能分析工具(pprof, perf, py-spy)
- 测量基准:在优化前编写benchmark
- 估算提升效果:计算预期性能增益
- 实现优化:每次只聚焦一项优化
- 验证提升:运行benchmark,确认性能提升
- 检查退化:确保其他代码路径未变慢
When Profile is Flat (No Clear Hotspots)
当性能分析结果无明显热点时
- Pursue multiple small 1-2% improvements collectively
- Look for restructuring opportunities higher in call stack
- Profile allocations specifically (often hidden cost)
- Gather hardware performance counters (cache misses, branch mispredicts)
- Consider if the code is already well-optimized
- 同时进行多个1-2%的小幅优化,累积收益
- 寻找调用栈上层的重构机会
- 专门分析内存分配情况(往往是隐藏成本)
- 收集硬件性能计数器数据(缓存未命中、分支预测失败等)
- 评估代码是否已处于最优状态
Anti-Patterns to Avoid
需避免的反模式
- Premature optimization in non-critical code
- Optimizing without measurement - changes based on intuition
- Micro-optimizing while ignoring algorithmic complexity
- Copying when moving would suffice
- Growing containers one element at a time (quadratic)
- Allocating in loops when reuse is possible
- String formatting in hot paths (use pre-built templates)
- Regex when simple string matching suffices
- 过早优化:在非关键代码中进行优化
- 无测量优化:仅凭直觉修改代码
- 忽略算法复杂度的微优化:只做微优化却忽略算法层面的问题
- 可移动时却复制:本可移动对象却进行复制操作
- 逐个元素扩展容器:导致二次方时间复杂度
- 循环内分配内存:本可复用内存却重复分配
- 热点路径中进行字符串格式化:应使用预构建模板
- 能用简单字符串匹配却用正则表达式:过度使用Regex
Estimation Template
估算模板
Before optimizing, estimate:
Operation cost: ___ ns/us/ms
Frequency: ___ times per second/request
Total time: cost × frequency = ___
Improvement target: ___% reduction
Expected new time: ___
Is this worth it? [ ] Yes [ ] No优化前,先进行估算:
操作成本: ___ ns/us/ms
执行频率: ___ 次/秒/请求
总耗时: 成本 × 频率 = ___
优化目标: ___% 耗时减少
预期新耗时: ___
是否值得优化? [ ] 是 [ ] 否