performance-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Optimization Skill

性能优化技能

Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.

在为代码做性能优化时遵循以下原则。聚焦真正影响性能的关键3%代码——在工程领域，12%的性能提升绝非微不足道。

Core Philosophy

核心理念

Measure First: Never optimize without profiling data
Estimate Costs: Back-of-envelope calculations before implementation
Avoid Work: The fastest code is code that doesn't run
Reduce Allocations: Memory allocation is often the hidden bottleneck
Cache Locality: Memory access patterns dominate modern performance

先测量：绝不未获取profiling数据就进行优化
估算成本：在实现前先做粗略的成本计算
避免冗余操作：最快的代码是无需运行的代码
减少内存分配：内存分配往往是隐藏的性能瓶颈
缓存局部性：内存访问模式决定了现代系统的性能表现

Reference Latency Numbers

参考延迟数值

Use these for back-of-envelope calculations:

Operation	Latency
L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
Compress 1KB (Snappy)	3 us
SSD random read (4KB)	20 us
Round trip in datacenter	50 us
Disk seek	5 ms

以下数值可用于粗略计算：

操作	延迟
L1 cache读取	0.5 ns
分支预测失败	5 ns
L2 cache读取	7 ns
互斥锁加锁/解锁	25 ns
主内存读取	100 ns
压缩1KB数据（Snappy）	3 us
SSD随机读取（4KB）	20 us
数据中心内往返	50 us
磁盘寻道	5 ms

Optimization Techniques (Priority Order)

优化技术（优先级排序）

1. Algorithmic Improvements (Highest Impact)

1. 算法优化（影响最大）

Always check algorithm complexity first:

O(N^2) → O(N log N)  = 1000x faster for N=1M
O(N)   → O(1)        = unbounded improvement

Common patterns:

Replace nested loops with hash lookups
Use sorted-list intersection → hash table lookups
Process in reverse post-order to eliminate per-element checks
Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed

始终先检查算法复杂度：

O(N^2) → O(N log N)  = 当N=1M时，速度提升1000倍
O(N)   → O(1)        = 性能提升无上限

常见优化模式：

用哈希查找替换嵌套循环
将有序列表交集操作改为哈希表查找
采用后序逆序处理以消除逐元素检查
当无需范围查询时，用哈希表（O(1)）替换区间树（O(log N)）

2. Reduce Memory Allocations

2. 减少内存分配

Allocation is expensive (~25-100ns + GC pressure)

python

undefined

内存分配成本高昂（约25-100ns + GC压力）

python

undefined

BAD: Allocates on every call

不佳：每次调用都会分配内存

def process(items): result = [] # New allocation for item in items: result.append(transform(item)) return result

def process(items): result = [] # 新分配内存 for item in items: result.append(transform(item)) return result

GOOD: Pre-allocate or reuse

推荐：预分配或复用内存

def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out


**Techniques:**
- Pre-size containers with `reserve()` or known capacity
- Hoist temporary containers outside loops
- Reuse buffers across iterations (clear instead of recreate)
- Move instead of copy large structures
- Use stack allocation for bounded-lifetime objects

def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out


**优化技巧：**
- 用`reserve()`或已知容量预先设置容器大小
- 将临时容器移到循环外部
- 跨迭代复用缓冲区（清空而非重建）
- 移动而非复制大型结构
- 为有限生命周期的对象使用栈分配

3. Compact Data Structures

3. 紧凑数据结构

Minimize memory footprint and cache lines touched:

rust

// BAD: 24 bytes due to padding
struct Item {
    flag: bool,      // 1 byte + 7 padding
    value: i64,      // 8 bytes
    count: i32,      // 4 bytes + 4 padding
}

// GOOD: 16 bytes with reordering
struct Item {
    value: i64,      // 8 bytes
    count: i32,      // 4 bytes
    flag: bool,      // 1 byte + 3 padding
}

Techniques:

Reorder struct fields by size (largest first)
Use smaller numeric types when range permits (u8 vs i32)
Replace 64-bit pointers with 32-bit indices
Keep hot fields together, move cold fields to separate struct
Use bit vectors for small-domain sets
Flatten nested maps:
```
Map<A, Map<B, C>>
```
→
```
Map<(A,B), C>
```

最小化内存占用和缓存行命中次数：

rust

// 不佳：因内存对齐占用24字节
struct Item {
    flag: bool,      // 1字节 + 7字节填充
    value: i64,      // 8字节
    count: i32,      // 4字节 + 4字节填充
}

// 推荐：重新排序后仅占用16字节
struct Item {
    value: i64,      // 8字节
    count: i32,      // 4字节
    flag: bool,      // 1字节 + 3字节填充
}

优化技巧：

按字段大小从大到小重新排序结构体字段
在范围允许时使用更小的数值类型（如u8替代i32）
用32位索引替换64位指针
将热点字段放在一起，冷字段移至单独结构体
对小值域集合使用位向量
扁平化嵌套映射：
```
Map<A, Map<B, C>>
```
→
```
Map<(A,B), C>
```

4. Fast Paths for Common Cases

4. 常见场景快速路径

Optimize the common case without hurting rare cases:

python

undefined

优化常见场景且不影响罕见场景：

python

undefined

BAD: Always takes slow path

不佳：始终走慢路径

def parse_varint(data): return generic_varint_parser(data)

GOOD: Fast path for common 1-byte case

推荐：为常见的单字节场景提供快速路径

def parse_varint(data): if data[0] < 128: # Single byte - 90% of cases return data[0], 1 return generic_varint_parser(data) # Rare multi-byte


**Techniques:**
- Handle common dimensions inline (1-D, 2-D tensors)
- Check for empty/trivial inputs early
- Specialize for common sizes (small strings, few elements)
- Process trailing elements separately to avoid slow generic code

def parse_varint(data): if data[0] < 128: # 单字节场景占90% return data[0], 1 return generic_varint_parser(data) # 罕见的多字节场景


**优化技巧：**
- 直接处理常见维度（如1D、2D张量）
- 提前检查空输入或简单输入
- 针对常见大小做专门优化（如短字符串、少量元素）
- 单独处理末尾元素以避免慢通用代码

5. Precompute Expensive Information

5. 预计算昂贵信息

Trade memory for compute when beneficial:

python

undefined

在收益大于成本时用内存换取计算资源：

python

undefined

BAD: Recomputes on every access

不佳：每次访问都重新计算

def is_vowel(char): return char.lower() in 'aeiou'

GOOD: Lookup table

推荐：使用查找表

VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]


**Techniques:**
- Precompute flags/properties at construction time
- Build lookup tables for character classification
- Cache expensive computed properties
- Validate at boundaries once, not repeatedly inside

VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]


**优化技巧：**
- 在构造阶段预计算标志/属性
- 为字符分类构建查找表
- 缓存开销大的计算属性
- 在边界处验证一次，而非在内部重复验证

6. Bulk/Batch APIs

6. 批量API调用

Amortize fixed costs across multiple operations:

python

undefined

将固定成本分摊到多个操作上：

python

undefined

BAD: N round trips

不佳：N次往返

for item in items: result = db.lookup(item)

GOOD: 1 round trip

推荐：1次往返

results = db.lookup_many(items)


**Design APIs that support:**
- Batch lookups instead of individual fetches
- Vectorized operations over loops
- Streaming interfaces for large datasets

results = db.lookup_many(items)


**设计支持以下特性的API：**
- 批量查找替代单次获取
- 用向量化操作替代循环
- 为大型数据集提供流式接口

7. Avoid Unnecessary Work

7. 避免冗余操作

python

undefined

python

undefined

BAD: Always computes expensive value

不佳：始终计算昂贵的值

def process(data, config): expensive = compute_expensive(data) # Always runs if config.needs_expensive: use(expensive)

def process(data, config): expensive = compute_expensive(data) # 始终执行 if config.needs_expensive: use(expensive)

GOOD: Defer until needed

推荐：延迟计算直到需要时

def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # Only when needed use(expensive)


**Techniques:**
- Lazy evaluation for expensive operations
- Short-circuit evaluation in conditions
- Move loop-invariant code outside loops
- Specialize instead of using general-purpose libraries in hot paths

def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # 仅在需要时执行 use(expensive)


**优化技巧：**
- 对昂贵操作使用延迟求值
- 条件判断中使用短路求值
- 将循环不变代码移到循环外部
- 在热点路径中使用专门实现而非通用库

8. Help the Compiler/Runtime

8. 协助编译器/运行时优化

Lower-level optimizations when profiling shows need:

rust

// Avoid function call overhead in hot loops
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }

// Copy to local variable for better alias analysis
fn process(data: &mut [i32], factor: &i32) {
    let f = *factor;  // Compiler knows this won't change
    for x in data {
        *x *= f;
    }
}

Techniques:

Use raw pointers/indices instead of iterators in critical loops
Hand-unroll very hot loops (4+ iterations)
Move slow-path code to separate non-inlined functions
Avoid abstractions that hide costs in hot paths

当profiling显示需要时进行底层优化：

rust

// 避免热点循环中的函数调用开销
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }

// 复制到局部变量以提升别名分析效果
fn process(data: &mut [i32], factor: &i32) {
    let f = *factor;  // 编译器知道该值不会改变
    for x in data {
        *x *= f;
    }
}

优化技巧：

在关键循环中使用原始指针/索引而非迭代器
手动展开热点循环（4次及以上迭代）
将慢路径代码移至单独的非内联函数
避免在热点路径中使用隐藏成本的抽象

9. Reduce Synchronization

9. 减少同步操作

Minimize lock contention and atomic operations:

Default to thread-compatible (external sync) not thread-safe
Sample statistics (1-in-32) instead of tracking everything
Batch updates to shared state
Use thread-local storage for per-thread data

最小化锁竞争和原子操作：

默认使用线程兼容（外部同步）而非线程安全实现
采用抽样统计（如每32次操作抽样1次）而非跟踪所有操作
批量更新共享状态
为线程本地数据使用线程本地存储（TLS）

Profiling Workflow

性能分析工作流

Identify hotspots: Use profiler (pprof, perf, py-spy)
Measure baseline: Write benchmark before optimizing
Estimate improvement: Calculate expected gain
Implement change: Focus on one optimization at a time
Verify improvement: Run benchmark, confirm gain
Check for regressions: Ensure no other code paths slowed down

定位热点：使用性能分析工具（pprof, perf, py-spy）
测量基准：在优化前编写benchmark
估算提升效果：计算预期性能增益
实现优化：每次只聚焦一项优化
验证提升：运行benchmark，确认性能提升
检查退化：确保其他代码路径未变慢

When Profile is Flat (No Clear Hotspots)

当性能分析结果无明显热点时

Pursue multiple small 1-2% improvements collectively
Look for restructuring opportunities higher in call stack
Profile allocations specifically (often hidden cost)
Gather hardware performance counters (cache misses, branch mispredicts)
Consider if the code is already well-optimized

同时进行多个1-2%的小幅优化，累积收益
寻找调用栈上层的重构机会
专门分析内存分配情况（往往是隐藏成本）
收集硬件性能计数器数据（缓存未命中、分支预测失败等）
评估代码是否已处于最优状态

Anti-Patterns to Avoid

需避免的反模式

Premature optimization in non-critical code
Optimizing without measurement - changes based on intuition
Micro-optimizing while ignoring algorithmic complexity
Copying when moving would suffice
Growing containers one element at a time (quadratic)
Allocating in loops when reuse is possible
String formatting in hot paths (use pre-built templates)
Regex when simple string matching suffices

过早优化：在非关键代码中进行优化
无测量优化：仅凭直觉修改代码
忽略算法复杂度的微优化：只做微优化却忽略算法层面的问题
可移动时却复制：本可移动对象却进行复制操作
逐个元素扩展容器：导致二次方时间复杂度
循环内分配内存：本可复用内存却重复分配
热点路径中进行字符串格式化：应使用预构建模板
能用简单字符串匹配却用正则表达式：过度使用Regex

Estimation Template

估算模板

Before optimizing, estimate:

Operation cost: ___ ns/us/ms
Frequency: ___ times per second/request
Total time: cost × frequency = ___
Improvement target: ___% reduction
Expected new time: ___
Is this worth it? [ ] Yes [ ] No

优化前，先进行估算：

操作成本: ___ ns/us/ms
执行频率: ___ 次/秒/请求
总耗时: 成本 × 频率 = ___
优化目标: ___% 耗时减少
预期新耗时: ___
是否值得优化? [ ] 是 [ ] 否