performance-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Performance Optimization Skill

性能优化技能

Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.
在为代码做性能优化时遵循以下原则。聚焦真正影响性能的关键3%代码——在工程领域,12%的性能提升绝非微不足道。

Core Philosophy

核心理念

  1. Measure First: Never optimize without profiling data
  2. Estimate Costs: Back-of-envelope calculations before implementation
  3. Avoid Work: The fastest code is code that doesn't run
  4. Reduce Allocations: Memory allocation is often the hidden bottleneck
  5. Cache Locality: Memory access patterns dominate modern performance
  1. 先测量:绝不未获取profiling数据就进行优化
  2. 估算成本:在实现前先做粗略的成本计算
  3. 避免冗余操作:最快的代码是无需运行的代码
  4. 减少内存分配:内存分配往往是隐藏的性能瓶颈
  5. 缓存局部性:内存访问模式决定了现代系统的性能表现

Reference Latency Numbers

参考延迟数值

Use these for back-of-envelope calculations:
OperationLatency
L1 cache reference0.5 ns
Branch mispredict5 ns
L2 cache reference7 ns
Mutex lock/unlock25 ns
Main memory reference100 ns
Compress 1KB (Snappy)3 us
SSD random read (4KB)20 us
Round trip in datacenter50 us
Disk seek5 ms
以下数值可用于粗略计算:
操作延迟
L1 cache读取0.5 ns
分支预测失败5 ns
L2 cache读取7 ns
互斥锁加锁/解锁25 ns
主内存读取100 ns
压缩1KB数据(Snappy)3 us
SSD随机读取(4KB)20 us
数据中心内往返50 us
磁盘寻道5 ms

Optimization Techniques (Priority Order)

优化技术(优先级排序)

1. Algorithmic Improvements (Highest Impact)

1. 算法优化(影响最大)

Always check algorithm complexity first:
O(N^2) → O(N log N)  = 1000x faster for N=1M
O(N)   → O(1)        = unbounded improvement
Common patterns:
  • Replace nested loops with hash lookups
  • Use sorted-list intersection → hash table lookups
  • Process in reverse post-order to eliminate per-element checks
  • Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed
始终先检查算法复杂度:
O(N^2) → O(N log N)  = 当N=1M时,速度提升1000倍
O(N)   → O(1)        = 性能提升无上限
常见优化模式:
  • 用哈希查找替换嵌套循环
  • 将有序列表交集操作改为哈希表查找
  • 采用后序逆序处理以消除逐元素检查
  • 当无需范围查询时,用哈希表(O(1))替换区间树(O(log N))

2. Reduce Memory Allocations

2. 减少内存分配

Allocation is expensive (~25-100ns + GC pressure)
python
undefined
内存分配成本高昂(约25-100ns + GC压力)
python
undefined

BAD: Allocates on every call

不佳:每次调用都会分配内存

def process(items): result = [] # New allocation for item in items: result.append(transform(item)) return result
def process(items): result = [] # 新分配内存 for item in items: result.append(transform(item)) return result

GOOD: Pre-allocate or reuse

推荐:预分配或复用内存

def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out

**Techniques:**
- Pre-size containers with `reserve()` or known capacity
- Hoist temporary containers outside loops
- Reuse buffers across iterations (clear instead of recreate)
- Move instead of copy large structures
- Use stack allocation for bounded-lifetime objects
def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out

**优化技巧:**
- 用`reserve()`或已知容量预先设置容器大小
- 将临时容器移到循环外部
- 跨迭代复用缓冲区(清空而非重建)
- 移动而非复制大型结构
- 为有限生命周期的对象使用栈分配

3. Compact Data Structures

3. 紧凑数据结构

Minimize memory footprint and cache lines touched:
rust
// BAD: 24 bytes due to padding
struct Item {
    flag: bool,      // 1 byte + 7 padding
    value: i64,      // 8 bytes
    count: i32,      // 4 bytes + 4 padding
}

// GOOD: 16 bytes with reordering
struct Item {
    value: i64,      // 8 bytes
    count: i32,      // 4 bytes
    flag: bool,      // 1 byte + 3 padding
}
Techniques:
  • Reorder struct fields by size (largest first)
  • Use smaller numeric types when range permits (u8 vs i32)
  • Replace 64-bit pointers with 32-bit indices
  • Keep hot fields together, move cold fields to separate struct
  • Use bit vectors for small-domain sets
  • Flatten nested maps:
    Map<A, Map<B, C>>
    Map<(A,B), C>
最小化内存占用和缓存行命中次数:
rust
// 不佳:因内存对齐占用24字节
struct Item {
    flag: bool,      // 1字节 + 7字节填充
    value: i64,      // 8字节
    count: i32,      // 4字节 + 4字节填充
}

// 推荐:重新排序后仅占用16字节
struct Item {
    value: i64,      // 8字节
    count: i32,      // 4字节
    flag: bool,      // 1字节 + 3字节填充
}
优化技巧:
  • 按字段大小从大到小重新排序结构体字段
  • 在范围允许时使用更小的数值类型(如u8替代i32)
  • 用32位索引替换64位指针
  • 将热点字段放在一起,冷字段移至单独结构体
  • 对小值域集合使用位向量
  • 扁平化嵌套映射:
    Map<A, Map<B, C>>
    Map<(A,B), C>

4. Fast Paths for Common Cases

4. 常见场景快速路径

Optimize the common case without hurting rare cases:
python
undefined
优化常见场景且不影响罕见场景:
python
undefined

BAD: Always takes slow path

不佳:始终走慢路径

def parse_varint(data): return generic_varint_parser(data)
def parse_varint(data): return generic_varint_parser(data)

GOOD: Fast path for common 1-byte case

推荐:为常见的单字节场景提供快速路径

def parse_varint(data): if data[0] < 128: # Single byte - 90% of cases return data[0], 1 return generic_varint_parser(data) # Rare multi-byte

**Techniques:**
- Handle common dimensions inline (1-D, 2-D tensors)
- Check for empty/trivial inputs early
- Specialize for common sizes (small strings, few elements)
- Process trailing elements separately to avoid slow generic code
def parse_varint(data): if data[0] < 128: # 单字节场景占90% return data[0], 1 return generic_varint_parser(data) # 罕见的多字节场景

**优化技巧:**
- 直接处理常见维度(如1D、2D张量)
- 提前检查空输入或简单输入
- 针对常见大小做专门优化(如短字符串、少量元素)
- 单独处理末尾元素以避免慢通用代码

5. Precompute Expensive Information

5. 预计算昂贵信息

Trade memory for compute when beneficial:
python
undefined
在收益大于成本时用内存换取计算资源:
python
undefined

BAD: Recomputes on every access

不佳:每次访问都重新计算

def is_vowel(char): return char.lower() in 'aeiou'
def is_vowel(char): return char.lower() in 'aeiou'

GOOD: Lookup table

推荐:使用查找表

VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]

**Techniques:**
- Precompute flags/properties at construction time
- Build lookup tables for character classification
- Cache expensive computed properties
- Validate at boundaries once, not repeatedly inside
VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]

**优化技巧:**
- 在构造阶段预计算标志/属性
- 为字符分类构建查找表
- 缓存开销大的计算属性
- 在边界处验证一次,而非在内部重复验证

6. Bulk/Batch APIs

6. 批量API调用

Amortize fixed costs across multiple operations:
python
undefined
将固定成本分摊到多个操作上:
python
undefined

BAD: N round trips

不佳:N次往返

for item in items: result = db.lookup(item)
for item in items: result = db.lookup(item)

GOOD: 1 round trip

推荐:1次往返

results = db.lookup_many(items)

**Design APIs that support:**
- Batch lookups instead of individual fetches
- Vectorized operations over loops
- Streaming interfaces for large datasets
results = db.lookup_many(items)

**设计支持以下特性的API:**
- 批量查找替代单次获取
- 用向量化操作替代循环
- 为大型数据集提供流式接口

7. Avoid Unnecessary Work

7. 避免冗余操作

python
undefined
python
undefined

BAD: Always computes expensive value

不佳:始终计算昂贵的值

def process(data, config): expensive = compute_expensive(data) # Always runs if config.needs_expensive: use(expensive)
def process(data, config): expensive = compute_expensive(data) # 始终执行 if config.needs_expensive: use(expensive)

GOOD: Defer until needed

推荐:延迟计算直到需要时

def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # Only when needed use(expensive)

**Techniques:**
- Lazy evaluation for expensive operations
- Short-circuit evaluation in conditions
- Move loop-invariant code outside loops
- Specialize instead of using general-purpose libraries in hot paths
def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # 仅在需要时执行 use(expensive)

**优化技巧:**
- 对昂贵操作使用延迟求值
- 条件判断中使用短路求值
- 将循环不变代码移到循环外部
- 在热点路径中使用专门实现而非通用库

8. Help the Compiler/Runtime

8. 协助编译器/运行时优化

Lower-level optimizations when profiling shows need:
rust
// Avoid function call overhead in hot loops
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }

// Copy to local variable for better alias analysis
fn process(data: &mut [i32], factor: &i32) {
    let f = *factor;  // Compiler knows this won't change
    for x in data {
        *x *= f;
    }
}
Techniques:
  • Use raw pointers/indices instead of iterators in critical loops
  • Hand-unroll very hot loops (4+ iterations)
  • Move slow-path code to separate non-inlined functions
  • Avoid abstractions that hide costs in hot paths
当profiling显示需要时进行底层优化:
rust
// 避免热点循环中的函数调用开销
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }

// 复制到局部变量以提升别名分析效果
fn process(data: &mut [i32], factor: &i32) {
    let f = *factor;  // 编译器知道该值不会改变
    for x in data {
        *x *= f;
    }
}
优化技巧:
  • 在关键循环中使用原始指针/索引而非迭代器
  • 手动展开热点循环(4次及以上迭代)
  • 将慢路径代码移至单独的非内联函数
  • 避免在热点路径中使用隐藏成本的抽象

9. Reduce Synchronization

9. 减少同步操作

Minimize lock contention and atomic operations:
  • Default to thread-compatible (external sync) not thread-safe
  • Sample statistics (1-in-32) instead of tracking everything
  • Batch updates to shared state
  • Use thread-local storage for per-thread data
最小化锁竞争和原子操作:
  • 默认使用线程兼容(外部同步)而非线程安全实现
  • 采用抽样统计(如每32次操作抽样1次)而非跟踪所有操作
  • 批量更新共享状态
  • 为线程本地数据使用线程本地存储(TLS)

Profiling Workflow

性能分析工作流

  1. Identify hotspots: Use profiler (pprof, perf, py-spy)
  2. Measure baseline: Write benchmark before optimizing
  3. Estimate improvement: Calculate expected gain
  4. Implement change: Focus on one optimization at a time
  5. Verify improvement: Run benchmark, confirm gain
  6. Check for regressions: Ensure no other code paths slowed down
  1. 定位热点:使用性能分析工具(pprof, perf, py-spy)
  2. 测量基准:在优化前编写benchmark
  3. 估算提升效果:计算预期性能增益
  4. 实现优化:每次只聚焦一项优化
  5. 验证提升:运行benchmark,确认性能提升
  6. 检查退化:确保其他代码路径未变慢

When Profile is Flat (No Clear Hotspots)

当性能分析结果无明显热点时

  • Pursue multiple small 1-2% improvements collectively
  • Look for restructuring opportunities higher in call stack
  • Profile allocations specifically (often hidden cost)
  • Gather hardware performance counters (cache misses, branch mispredicts)
  • Consider if the code is already well-optimized
  • 同时进行多个1-2%的小幅优化,累积收益
  • 寻找调用栈上层的重构机会
  • 专门分析内存分配情况(往往是隐藏成本)
  • 收集硬件性能计数器数据(缓存未命中、分支预测失败等)
  • 评估代码是否已处于最优状态

Anti-Patterns to Avoid

需避免的反模式

  1. Premature optimization in non-critical code
  2. Optimizing without measurement - changes based on intuition
  3. Micro-optimizing while ignoring algorithmic complexity
  4. Copying when moving would suffice
  5. Growing containers one element at a time (quadratic)
  6. Allocating in loops when reuse is possible
  7. String formatting in hot paths (use pre-built templates)
  8. Regex when simple string matching suffices
  1. 过早优化:在非关键代码中进行优化
  2. 无测量优化:仅凭直觉修改代码
  3. 忽略算法复杂度的微优化:只做微优化却忽略算法层面的问题
  4. 可移动时却复制:本可移动对象却进行复制操作
  5. 逐个元素扩展容器:导致二次方时间复杂度
  6. 循环内分配内存:本可复用内存却重复分配
  7. 热点路径中进行字符串格式化:应使用预构建模板
  8. 能用简单字符串匹配却用正则表达式:过度使用Regex

Estimation Template

估算模板

Before optimizing, estimate:
Operation cost: ___ ns/us/ms
Frequency: ___ times per second/request
Total time: cost × frequency = ___
Improvement target: ___% reduction
Expected new time: ___
Is this worth it? [ ] Yes [ ] No
优化前,先进行估算:
操作成本: ___ ns/us/ms
执行频率: ___ 次/秒/请求
总耗时: 成本 × 频率 = ___
优化目标: ___% 耗时减少
预期新耗时: ___
是否值得优化? [ ] 是 [ ] 否