python-performance-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Python Performance Optimization

Python性能优化

Expert guidance for profiling, optimizing, and accelerating Python applications through systematic analysis, algorithmic improvements, efficient data structures, and acceleration techniques.
通过系统分析、算法改进、高效数据结构和加速技术,为Python应用的性能分析、优化与加速提供专业指导。

When to Use This Skill

何时使用该技能

  • Code runs too slowly for production requirements
  • High CPU usage or memory consumption issues
  • Need to reduce API response times or batch processing duration
  • Application fails to scale under load
  • Optimizing data processing pipelines or scientific computing
  • Reducing cloud infrastructure costs through efficiency gains
  • Profile-guided optimization after measuring performance bottlenecks
  • 代码运行速度无法满足生产环境要求
  • 存在CPU占用过高或内存消耗过大的问题
  • 需要缩短API响应时间或批量处理时长
  • 应用在负载下无法实现水平扩展
  • 优化数据处理管道或科学计算流程
  • 通过提升效率降低云基础设施成本
  • 在测量出性能瓶颈后,基于分析结果进行优化

Core Concepts

核心概念

The Golden Rule: Never optimize without profiling first. 80% of execution time is spent in 20% of code.
Optimization Hierarchy (in priority order):
  1. Algorithm complexity - O(n²) → O(n log n) provides exponential gains
  2. Data structure choice - List → Set for lookups (10,000x faster)
  3. Language features - Comprehensions, built-ins, generators
  4. Caching - Memoization for repeated calculations
  5. Compiled extensions - NumPy, Numba, Cython for hot paths
  6. Parallelism - Multiprocessing for CPU-bound work
Key Principle: Algorithmic improvements beat micro-optimizations every time.
黄金法则:永远不要在未做性能分析的情况下进行优化。80%的执行时间都消耗在20%的代码上。
优化优先级层级(按优先级排序):
  1. 算法复杂度 - 从O(n²)优化到O(n log n)能带来指数级的性能提升
  2. 数据结构选择 - 将List替换为Set进行查找操作(速度提升10000倍)
  3. 语言特性运用 - 推导式、内置函数、生成器
  4. 缓存机制 - 为重复计算的结果添加缓存
  5. 编译扩展 - 针对热点路径使用NumPy、Numba、Cython
  6. 并行处理 - 针对CPU密集型任务使用多进程
关键原则:算法改进的效果永远优于微优化。

Quick Reference

快速参考

Load detailed guides for specific optimization areas:
TaskLoad reference
Profile code and find bottlenecks
skills/python-performance-optimization/references/profiling.md
Algorithm and data structure optimization
skills/python-performance-optimization/references/algorithms.md
Memory optimization and generators
skills/python-performance-optimization/references/memory.md
String concatenation and file I/O
skills/python-performance-optimization/references/string-io.md
NumPy, Numba, Cython, multiprocessing
skills/python-performance-optimization/references/acceleration.md
加载针对特定优化领域的详细指南:
任务加载参考文档
分析代码并定位瓶颈
skills/python-performance-optimization/references/profiling.md
算法与数据结构优化
skills/python-performance-optimization/references/algorithms.md
内存优化与生成器使用
skills/python-performance-optimization/references/memory.md
字符串拼接与文件I/O
skills/python-performance-optimization/references/string-io.md
NumPy、Numba、Cython与多进程
skills/python-performance-optimization/references/acceleration.md

Optimization Workflow

优化工作流

Phase 1: Measure

阶段1:测量

  1. Profile with cProfile - Identify slow functions
  2. Line profile hot paths - Find exact slow lines
  3. Memory profile - Check for memory bottlenecks
  4. Benchmark baseline - Record current performance
  1. 使用cProfile进行性能分析 - 定位运行缓慢的函数
  2. 对热点路径进行行级分析 - 找出具体的慢执行行
  3. 内存分析 - 检查内存瓶颈
  4. 基准测试 - 记录当前性能指标

Phase 2: Analyze

阶段2:分析

  1. Check algorithm complexity - Is it O(n²) or worse?
  2. Evaluate data structures - Are you using lists for lookups?
  3. Identify repeated work - Can results be cached?
  4. Find I/O bottlenecks - Database queries, file operations
  1. 检查算法复杂度 - 是否为O(n²)或更差的复杂度?
  2. 评估数据结构 - 是否在使用List进行查找操作?
  3. 识别重复计算 - 结果是否可以被缓存?
  4. 定位I/O瓶颈 - 数据库查询、文件操作等

Phase 3: Optimize

阶段3:优化

  1. Improve algorithms first - Biggest impact
  2. Use appropriate data structures - Set/dict for O(1) lookups
  3. Apply caching -
    @lru_cache
    for expensive functions
  4. Use generators - For large datasets
  5. Leverage NumPy/Numba - For numerical code
  6. Parallelize - Multiprocessing for CPU-bound tasks
  1. 优先改进算法 - 带来最大性能提升
  2. 使用合适的数据结构 - 用Set/Dict实现O(1)时间复杂度的查找
  3. 应用缓存 - 为计算密集型函数使用
    @lru_cache
  4. 使用生成器 - 处理大型数据集
  5. 借助NumPy/Numba - 针对数值计算代码
  6. 并行化处理 - 为CPU密集型任务使用多进程

Phase 4: Validate

阶段4:验证

  1. Re-profile - Verify improvements
  2. Benchmark - Measure speedup quantitatively
  3. Test correctness - Ensure optimizations didn't break functionality
  4. Document - Explain why optimization was needed
  1. 重新分析性能 - 验证优化效果
  2. 基准测试 - 量化测量性能提升幅度
  3. 正确性测试 - 确保优化未破坏原有功能
  4. 文档记录 - 说明优化的必要性

Common Optimization Patterns

常见优化模式

Pattern 1: Replace List with Set for Lookups

模式1:将List替换为Set进行查找操作

python
undefined
python
undefined

Slow: O(n) lookup

慢:O(n) 查找时间

if item in large_list: # Bad
if item in large_list: # 不推荐

Fast: O(1) lookup

快:O(1) 查找时间

if item in large_set: # Good
undefined
if item in large_set: # 推荐
undefined

Pattern 2: Use Comprehensions

模式2:使用推导式

python
undefined
python
undefined

Slower

较慢

result = [] for i in range(n): result.append(i * 2)
result = [] for i in range(n): result.append(i * 2)

Faster (35% speedup)

更快(提升35%速度)

result = [i * 2 for i in range(n)]
undefined
result = [i * 2 for i in range(n)]
undefined

Pattern 3: Cache Expensive Calculations

模式3:缓存计算密集型结果

python
from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_function(n):
    # Result cached automatically
    return complex_calculation(n)
python
from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_function(n):
    # 结果会被自动缓存
    return complex_calculation(n)

Pattern 4: Use Generators for Large Data

模式4:使用生成器处理大型数据

python
undefined
python
undefined

Memory inefficient

内存效率低

def read_file(path): return [line for line in open(path)] # Loads entire file
def read_file(path): return [line for line in open(path)] # 加载整个文件到内存

Memory efficient

内存效率高

def read_file(path): for line in open(path): # Streams line by line yield line.strip()
undefined
def read_file(path): for line in open(path): # 逐行流式读取 yield line.strip()
undefined

Pattern 5: Vectorize with NumPy

模式5:使用NumPy向量化计算

python
undefined
python
undefined

Pure Python: ~500ms

纯Python实现:约500ms

result = sum(i**2 for i in range(1000000))
result = sum(i**2 for i in range(1000000))

NumPy: ~5ms (100x faster)

NumPy实现:约5ms(速度提升100倍)

import numpy as np result = np.sum(np.arange(1000000)**2)
undefined
import numpy as np result = np.sum(np.arange(1000000)**2)
undefined

Common Mistakes to Avoid

需避免的常见错误

  1. Optimizing before profiling - You'll optimize the wrong code
  2. Using lists for membership tests - Use sets/dicts instead
  3. String concatenation in loops - Use
    "".join()
    or
    StringIO
  4. Loading entire files into memory - Use generators
  5. N+1 database queries - Use JOINs or batch queries
  6. Ignoring built-in functions - They're C-optimized and fast
  7. Premature optimization - Focus on algorithmic improvements first
  8. Not benchmarking - Always measure improvements quantitatively
  1. 未做性能分析就进行优化 - 你可能会优化错误的代码
  2. 使用List进行成员测试 - 改用Set/Dict
  3. 在循环中拼接字符串 - 使用
    "".join()
    StringIO
  4. 将整个文件加载到内存 - 使用生成器
  5. N+1数据库查询问题 - 使用JOIN或批量查询
  6. 忽略内置函数 - 它们是C实现的,速度更快
  7. 过早优化 - 优先关注算法层面的改进
  8. 未进行基准测试 - 始终量化测量优化效果

Decision Tree

决策树

Start here: Profile with cProfile to find bottlenecks
Hot path is algorithm?
  • Yes → Check complexity, improve algorithm, use better data structures
  • No → Continue
Hot path is computation?
  • Numerical loops → NumPy or Numba
  • CPU-bound → Multiprocessing
  • Already fast enough → Done
Hot path is memory?
  • Large data → Generators, streaming
  • Many objects →
    __slots__
    , object pooling
  • Caching needed →
    @lru_cache
    or custom cache
Hot path is I/O?
  • Database → Batch queries, indexes, connection pooling
  • Files → Buffering, streaming
  • Network → Async I/O, request batching
开始:使用cProfile进行性能分析,定位瓶颈
热点路径是算法问题?
  • 是 → 检查复杂度,改进算法,使用更优的数据结构
  • 否 → 继续
热点路径是计算问题?
  • 数值循环 → 使用NumPy或Numba
  • CPU密集型 → 使用多进程
  • 已满足性能要求 → 结束
热点路径是内存问题?
  • 大型数据 → 生成器、流式处理
  • 大量对象 → 使用
    __slots__
    、对象池
  • 需要缓存 → 使用
    @lru_cache
    或自定义缓存
热点路径是I/O问题?
  • 数据库 → 批量查询、索引、连接池
  • 文件 → 缓冲、流式处理
  • 网络 → 异步I/O、请求批量处理

Best Practices

最佳实践

  1. Profile before optimizing - Measure to find real bottlenecks
  2. Optimize algorithms first - O(n²) → O(n) beats micro-optimizations
  3. Use appropriate data structures - Set/dict for lookups, not lists
  4. Leverage built-ins - C-implemented built-ins are faster than pure Python
  5. Avoid premature optimization - Optimize hot paths identified by profiling
  6. Use generators for large data - Reduce memory usage with lazy evaluation
  7. Batch operations - Minimize overhead from syscalls and network requests
  8. Cache expensive computations - Use
    @lru_cache
    or custom caching
  9. Consider NumPy/Numba - Vectorization and JIT for numerical code
  10. Parallelize CPU-bound work - Use multiprocessing to utilize all cores
  1. 优化前先做性能分析 - 通过测量找到真实的瓶颈
  2. 优先优化算法 - 从O(n²)到O(n)的优化效果远胜于微优化
  3. 使用合适的数据结构 - 用Set/Dict进行查找,而非List
  4. 充分利用内置函数 - C实现的内置函数比纯Python代码更快
  5. 避免过早优化 - 仅针对性能分析定位出的热点路径进行优化
  6. 使用生成器处理大型数据 - 通过惰性计算降低内存占用
  7. 批量操作 - 减少系统调用和网络请求的开销
  8. 缓存计算密集型任务的结果 - 使用
    @lru_cache
    或自定义缓存机制
  9. 考虑使用NumPy/Numba - 针对数值计算的向量化和JIT编译
  10. 并行化CPU密集型任务 - 使用多进程充分利用所有核心

Resources

资源