Python Performance Profiling
Python性能分析
When NOT to Use This Skill
本技能不适用场景
- Java/JVM profiling - Use the skill for JFR and GC tuning
- Node.js profiling - Use the skill for V8 profiler
- NumPy/Pandas optimization - Use library-specific profiling tools and vectorization guides
- Database query optimization - Use database-specific profiling tools
- Web server performance - Use application-level profiling (Django Debug Toolbar, Flask-DebugToolbar)
Deep Knowledge: Use
mcp__documentation__fetch_docs
with technology:
for comprehensive profiling guides, optimization techniques, and best practices.
- Java/JVM性能分析 - 请使用技能处理JFR和GC调优相关问题
- Node.js性能分析 - 请使用技能处理V8引擎分析相关问题
- NumPy/Pandas优化 - 请使用对应库专属的分析工具和向量化指南
- 数据库查询优化 - 请使用数据库专属的分析工具
- Web服务器性能优化 - 请使用应用层分析工具(Django Debug Toolbar、Flask-DebugToolbar)
深度知识:如需全面的性能分析指南、优化技巧和最佳实践,可以调用
mcp__documentation__fetch_docs
工具,指定technology参数为
获取相关文档。
cProfile (CPU Profiling)
cProfile(CPU性能分析)
Profile entire script
Profile entire script
python -m cProfile -o output.prof script.py
python -m cProfile -o output.prof script.py
Sort by cumulative time
Sort by cumulative time
python -m cProfile -s cumtime script.py
python -m cProfile -s cumtime script.py
Sort by total time in function
Sort by total time in function
python -m cProfile -s tottime script.py
python -m cProfile -s tottime script.py
Analyze saved profile
Analyze saved profile
python -m pstats output.prof
python -m pstats output.prof
Load and analyze profile
Load and analyze profile
stats = pstats.Stats('output.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
stats = pstats.Stats('output.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Filter by module
Filter by module
stats.print_stats('mymodule')
stats.print_stats('mymodule')
stats.print_callers('slow_function')
stats.print_callers('slow_function')
stats.print_callees('main')
stats.print_callees('main')
Programmatic Profiling
编程式分析
python
import cProfile
import pstats
from io import StringIO
def profile_function(func, *args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
# Analyze
stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())
return result
python
import cProfile
import pstats
from io import StringIO
def profile_function(func, *args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
# Analyze
stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())
return result
Context manager
Context manager
from contextlib import contextmanager
@contextmanager
def profile_block(name='profile'):
profiler = cProfile.Profile()
profiler.enable()
try:
yield
finally:
profiler.disable()
profiler.dump_stats(f'{name}.prof')
from contextlib import contextmanager
@contextmanager
def profile_block(name='profile'):
profiler = cProfile.Profile()
profiler.enable()
try:
yield
finally:
profiler.disable()
profiler.dump_stats(f'{name}.prof')
tracemalloc (Built-in)
tracemalloc(内置工具)
Start tracking
Start tracking
Your code here
Your code here
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("Top 10 memory allocations:")
for stat in top_stats[:10]:
print(stat)
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("Top 10 memory allocations:")
for stat in top_stats[:10]:
print(stat)
Compare snapshots
Compare snapshots
snapshot1 = tracemalloc.take_snapshot()
snapshot1 = tracemalloc.take_snapshot()
snapshot2 = tracemalloc.take_snapshot()
diff = snapshot2.compare_to(snapshot1, 'lineno')
for stat in diff[:10]:
print(stat)
snapshot2 = tracemalloc.take_snapshot()
diff = snapshot2.compare.compare
for stat in diff[:10]:
print(stat)
Stop tracking
Stop tracking
memory_profiler (Line-by-line)
memory_profiler(逐行内存分析)
Install: pip install memory_profiler
Install: pip install memory_profiler
from memory_profiler import profile
@profile
def my_function():
a = [1] * 1_000_000
b = [2] * 2_000_000
del b
return a
from memory_profiler import profile
@profile
def my_function():
a = [1] * 1_000_000
b = [2] * 2_000_000
del b
return a
Command line usage
Command line usage
python -m memory_profiler script.py
python -m memory_profiler script.py
Profile specific function
Profile specific function
mprof run script.py
mprof run script.py
objgraph (Object References)
objgraph(对象引用分析)
Install: pip install objgraph
Install: pip install objgraph
Most common types
Most common types
objgraph.show_most_common_types(limit=20)
objgraph.show_most_common_types(limit=20)
Growth since last call
Growth since last call
Find reference chain (memory leak detection)
Find reference chain (memory leak detection)
objgraph.show_backrefs([leaked_object], filename='refs.png')
objgraph.show_backrefs([leaked_object], filename='refs.png')
Line Profiler
Line Profiler
Install: pip install line_profiler
Install: pip install line_profiler
Decorate functions to profile
Decorate functions to profile
@profile
def slow_function():
total = 0
for i in range(1000000):
total += i
return total
@profile
def slow_function():
total = 0
for i in range(1000000):
total += i
return total
Run with: kernprof -l -v script.py
Run with: kernprof -l -v script.py
High-Resolution Timing
高精度计时
Monotonic clock (best for measuring durations)
Monotonic clock (best for measuring durations)
start = time.perf_counter()
result = do_work()
duration = time.perf_counter() - start
print(f"Duration: {duration:.4f}s")
start = time.perf_counter()
result = do_work()
duration = time.perf_counter() - start
print(f"Duration: {duration:.4f}s")
Nanosecond precision (Python 3.7+)
Nanosecond precision (Python 3.7+)
start = time.perf_counter_ns()
result = do_work()
duration_ns = time.perf_counter_ns() - start
print(f"Duration: {duration_ns}ns")
start = time.perf_counter_ns()
result = do_work()
duration_ns = time.perf_counter_ns() - start
print(f"Duration: {duration_ns}ns")
Time small code snippets
Time small code snippets
duration = timeit.timeit('sum(range(1000))', number=10000)
print(f"Average: {duration / 10000:.6f}s")
duration = timeit.timeit('sum(range(1000))', number=10000)
print(f"Average: {duration / 10000:.6f}s")
Compare implementations
Compare implementations
setup = "data = list(range(10000))"
time1 = timeit.timeit('sum(data)', setup, number=1000)
time2 = timeit.timeit('sum(x for x in data)', setup, number=1000)
print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")
setup = "data = list(range(10000))"
time1 = timeit.timeit('sum(data)', setup, number=1000)
time2 = timeit.timeit('sum(x for x in data)', setup, number=1000)
print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")
Common Bottleneck Patterns
常见性能瓶颈模式
❌ Bad: Concatenating lists in loop
❌ Bad: Concatenating lists in loop
result = []
for item in items:
result = result + [process(item)] # O(n²)
result = []
for item in items:
result = result + [process(item)] # O(n²)
✅ Good: Use append
✅ Good: Use append
result = []
for item in items:
result.append(process(item)) # O(n)
result = []
for item in items:
result.append(process(item)) # O(n)
✅ Better: List comprehension
✅ Better: List comprehension
result = [process(item) for item in items]
result = [process(item) for item in items]
❌ Bad: Checking membership in list
❌ Bad: Checking membership in list
if item in large_list: # O(n)
pass
if item in large_list: # O(n)
pass
✅ Good: Use set for membership
✅ Good: Use set for membership
large_set = set(large_list)
if item in large_set: # O(1)
pass
large_set = set(large_list)
if item in large_set: # O(1)
pass
❌ Bad: String concatenation in loop
❌ Bad: String concatenation in loop
result = ""
for s in strings:
result += s # Creates new string each time
result = ""
for s in strings:
result += s # Creates new string each time
✅ Good: Use join
✅ Good: Use join
result = "".join(strings)
result = "".join(strings)
❌ Bad: Format in loop
❌ Bad: Format in loop
for item in items:
log(f"Processing {item}")
for item in items:
log(f"Processing {item}")
✅ Good: Lazy formatting
✅ Good: Lazy formatting
import logging
for item in items:
logging.debug("Processing %s", item) # Only formats if needed
import logging
for item in items:
logging.debug("Processing %s", item) # Only formats if needed
Dictionary Operations
字典操作
❌ Bad: Repeated key lookup
❌ Bad: Repeated key lookup
if key in d:
value = d[key]
process(value)
if key in d:
value = d[key]
process(value)
✅ Good: Use get or setdefault
✅ Good: Use get or setdefault
value = d.get(key)
if value is not None:
process(value)
value = d.get(key)
if value is not None:
process(value)
❌ Bad: Checking then setting
❌ Bad: Checking then setting
if key not in d:
d[key] = []
d[key].append(value)
if key not in d:
d[key] = []
d[key].append(value)
✅ Good: Use defaultdict
✅ Good: Use defaultdict
from collections import defaultdict
d = defaultdict(list)
d[key].append(value)
from collections import defaultdict
d = defaultdict(list)
d[key].append(value)
Generator vs List
生成器 vs 列表
❌ Bad: Creating large intermediate lists
❌ Bad: Creating large intermediate lists
result = sum([x * 2 for x in range(10_000_000)]) # Uses memory
result = sum([x * 2 for x in range(10_000_000)]) # Uses memory
✅ Good: Use generator
✅ Good: Use generator
result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation
result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation
Process large files
Process large files
data = open('large.csv').readlines() # All in memory
for line in data:
process(line)
data = open('large.csv').readlines() # All in memory
for line in data:
process(line)
with open('large.csv') as f: # Stream line by line
for line in f:
process(line)
with open('large.csv') as f: # Stream line by line
for line in f:
process(line)
NumPy Optimization
NumPy优化
❌ Bad: Python loops over arrays
❌ Bad: Python loops over arrays
result = []
for i in range(len(arr)):
result.append(arr[i] * 2)
result = []
for i in range(len(arr)):
result.append(arr[i] * 2)
✅ Good: Vectorized operations
✅ Good: Vectorized operations
result = arr * 2 # SIMD operations
result = arr * 2 # SIMD operations
❌ Bad: Creating many temporary arrays
❌ Bad: Creating many temporary arrays
result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries
result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries
✅ Good: In-place operations when possible
✅ Good: In-place operations when possible
result = arr1.copy()
result += arr2
result *= arr3
result /= arr4
result = arr1.copy()
result += arr2
result *= arr3
result /= arr4
Use appropriate dtypes
Use appropriate dtypes
arr = np.array(data, dtype=np.float32) # Half memory of float64
arr = np.array(data, dtype=np.float32) # Half memory of float64
Async Optimization
Async优化
python
import asyncio
import aiohttp
python
import asyncio
import aiohttp
❌ Bad: Sequential async
❌ Bad: Sequential async
async def fetch_all_sequential(urls):
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
return results
async def fetch_all_sequential(urls):
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
return results
✅ Good: Concurrent async
✅ Good: Concurrent async
async def fetch_all_concurrent(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [await r.text() for r in responses]
async def fetch_all_concurrent(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [await r.text() for r in responses]
✅ Better: With concurrency limit
✅ Better: With concurrency limit
from asyncio import Semaphore
async def fetch_with_limit(urls, limit=10):
semaphore = Semaphore(limit)
async def fetch_one(url):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
return await asyncio.gather(*[fetch_one(url) for url in urls])
from asyncio import Semaphore
async def fetch_with_limit(urls, limit=10):
semaphore = Semaphore(limit)
async def fetch_one(url):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
return await asyncio.gather(*[fetch_one(url) for url in urls])
python
from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor
python
from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor
CPU-bound work
CPU-bound work
def cpu_intensive(x):
return sum(i * i for i in range(x))
def cpu_intensive(x):
return sum(i * i for i in range(x))
with Pool(cpu_count()) as pool:
results = pool.map(cpu_intensive, range(100))
with Pool(cpu_count()) as pool:
results = pool.map(cpu_intensive, range(100))
Using ProcessPoolExecutor
Using ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
results = list(executor.map(cpu_intensive, range(100)))
with ProcessPoolExecutor() as executor:
results = list(executor.map(cpu_intensive, range(100)))
Shared memory (Python 3.8+)
Shared memory (Python 3.8+)
from multiprocessing import shared_memory
import numpy as np
from multiprocessing import shared_memory
import numpy as np
Create shared array
Create shared array
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shared_arr[:] = arr[:]
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shared_arr[:] = arr[:]
Profiling Checklist
性能分析检查清单
| Check | Tool | Command |
|---|
| CPU hotspots | cProfile | python -m cProfile script.py
|
| Line-by-line | line_profiler | |
| Memory usage | tracemalloc | |
| Memory per line | memory_profiler | decorator |
| Object references | objgraph | |
| Quick benchmarks | timeit | |
| 检查项 | 工具 | 命令 |
|---|
| CPU热点定位 | cProfile | python -m cProfile script.py
|
| 逐行性能分析 | line_profiler | |
| 内存使用分析 | tracemalloc | |
| 逐行内存占用 | memory_profiler | 装饰器 |
| 对象引用分析 | objgraph | |
| 快速基准测试 | timeit | |
py-spy (Sampling Profiler)
py-spy(采样分析器)
Install: pip install py-spy
Install: pip install py-spy
Record profile
Record profile
py-spy record -o profile.svg -- python script.py
py-spy record -o profile.svg -- python script.py
Top-like view of running process
Top-like view of running process
Dump current stack
Dump current stack
Profile subprocesses
Profile subprocesses
py-spy record --subprocesses -o profile.svg -- python script.py
py-spy record --subprocesses -o profile.svg -- python script.py
Production Optimization
生产环境优化
Use slots for memory efficiency
Use slots for memory efficiency
class Point:
slots = ['x', 'y']
def init(self, x, y):
self.x = x
self.y = y
class Point:
slots = ['x', 'y']
def init(self, x, y):
self.x = x
self.y = y
Use lru_cache for memoization
Use lru_cache for memoization
from functools import lru_cache
@lru_cache(maxsize=1000)
def expensive_computation(x):
return x ** 2
from functools import lru_cache
@lru_cache(maxsize=1000)
def expensive_computation(x):
return x ** 2
Use dataclasses with slots (Python 3.10+)
Use dataclasses with slots (Python 3.10+)
from dataclasses import dataclass
@dataclass(slots=True)
class Point:
x: float
y: float
from dataclasses import dataclass
@dataclass(slots=True)
class Point:
x: float
y: float
| Anti-Pattern | Why It's Wrong | Correct Approach |
|---|
| Using to concatenate strings in loop | O(n²) time complexity | Use or list comprehension |
| List comprehension when generator suffices | Unnecessary memory allocation | Use generator expression for one-time iteration |
| when needed | Manual index tracking, error-prone | Use for index and value |
| Checking membership in list | O(n) lookup | Use for O(1) membership testing |
| variables everywhere | Hard to profile, side effects | Pass parameters, return values |
| Not using NumPy for numerical work | Orders of magnitude slower | Vectorize with NumPy for array operations |
| Premature optimization | Wasted effort, harder to maintain | Profile first, optimize bottlenecks |
| Using | Namespace pollution, slower imports | Import specific names |
| in loop when size known | Multiple reallocations | Pre-allocate with list comprehension or |
| Not using for many instances | Higher memory usage | Use for classes with many instances |
| 反模式 | 问题原因 | 正确做法 |
|---|
| 循环中使用拼接字符串 | O(n²)时间复杂度 | 使用或列表推导式 |
| 可使用生成器的场景下使用列表推导式 | 不必要的内存占用 | 一次性迭代时使用生成器表达式 |
| 需要索引时使用 | 手动维护索引容易出错 | 使用同时获取索引和值 |
| 在列表中检查成员关系 | O(n)查找复杂度 | 使用实现O(1)成员检查 |
| 大量使用变量 | 难以分析,容易产生副作用 | 传递参数,返回计算结果 |
| 数值计算不使用NumPy | 运行速度慢几个数量级 | 数组操作使用NumPy向量化实现 |
| 过早优化 | 浪费精力,代码维护难度提升 | 先分析定位瓶颈,再针对性优化 |
| 使用导入 | 命名空间污染,导入速度慢 | 导入需要的具体名称 |
| 列表长度已知时循环使用 | 多次内存重分配 | 使用列表推导式或预分配空间 |
| 大量实例的类不使用 | 内存占用较高 | 实例数量多的类使用 |
Quick Troubleshooting
快速问题排查
| Issue | Diagnosis | Solution |
|---|
| Slow loops over large data | Python loops are slow | Vectorize with NumPy, use list comprehensions |
| High memory usage | Creating large intermediate objects | Use generators, process in chunks |
| GIL contention | Multi-threading doesn't speed up CPU work | Use for CPU-bound tasks |
| Slow imports | Large modules with side effects | Lazy import, reduce module-level code |
| Memory leak | Objects not being garbage collected | Check for circular references, use |
| Recursion too deep | Increase limit with or refactor to iteration |
| Slow dictionary operations | Hash collisions | Ensure keys are hashable and well-distributed |
| High CPU in profiler | C extensions not showing | Use sampling profiler like |
| Out of memory with large file | Loading entire file | Use and iterate line by line |
| Slow JSON parsing | Large JSON file | Use streaming parser (ijson) or pandas |
| 问题 | 诊断方法 | 解决方案 |
|---|
| 大数据量循环运行慢 | Python循环本身性能低 | 使用NumPy向量化、列表推导式优化 |
| 内存占用过高 | 创建了大量中间对象 | 使用生成器,分块处理数据 |
| GIL竞争 | 多线程无法提升CPU密集型任务速度 | CPU密集型任务使用 |
| 导入速度慢 | 大型模块存在导入副作用 | 懒加载,减少模块顶层代码 |
| 内存泄漏 | 对象无法被垃圾回收 | 检查循环引用,使用 |
| 递归深度超过限制 | 使用提升限制,或重构为迭代实现 |
| 字典操作慢 | 哈希冲突 | 确保键可哈希且分布均匀 |
| 分析器显示CPU占用高但看不到C扩展调用 | 分析器无法追踪C扩展调用 | 使用这类采样分析器 |
| 大文件处理内存不足 | 加载整个文件到内存 | 使用逐行迭代处理 |
| JSON解析慢 | JSON文件过大 | 使用流式解析器(ijson)或pandas处理 |
- FastAPI
- Django
- NumPy/Pandas
- FastAPI
- Django
- NumPy/Pandas