Python Performance Profiling

Python性能分析

When NOT to Use This Skill

本技能不适用场景

Java/JVM profiling - Use the
```
java-profiling
```
skill for JFR and GC tuning
Node.js profiling - Use the
```
nodejs-profiling
```
skill for V8 profiler
NumPy/Pandas optimization - Use library-specific profiling tools and vectorization guides
Database query optimization - Use database-specific profiling tools
Web server performance - Use application-level profiling (Django Debug Toolbar, Flask-DebugToolbar)

Deep Knowledge: Use
mcp__documentation__fetch_docs
with technology:
python
for comprehensive profiling guides, optimization techniques, and best practices.

Java/JVM性能分析 - 请使用
```
java-profiling
```
技能处理JFR和GC调优相关问题
Node.js性能分析 - 请使用
```
nodejs-profiling
```
技能处理V8引擎分析相关问题
NumPy/Pandas优化 - 请使用对应库专属的分析工具和向量化指南
数据库查询优化 - 请使用数据库专属的分析工具
Web服务器性能优化 - 请使用应用层分析工具（Django Debug Toolbar、Flask-DebugToolbar）

深度知识：如需全面的性能分析指南、优化技巧和最佳实践，可以调用
mcp__documentation__fetch_docs
工具，指定technology参数为
python
获取相关文档。

cProfile (CPU Profiling)

cProfile（CPU性能分析）

Command Line Usage

命令行使用

bash

undefined

bash

undefined

Profile entire script

python -m cProfile -o output.prof script.py

Sort by cumulative time

python -m cProfile -s cumtime script.py

Sort by total time in function

python -m cProfile -s tottime script.py

Analyze saved profile

python -m pstats output.prof

undefined

python -m pstats output.prof

undefined

pstats Analysis

pstats分析

python

import pstats

python

import pstats

Load and analyze profile

stats = pstats.Stats('output.prof') stats.strip_dirs() stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions

Filter by module

stats.print_stats('mymodule')

Show callers

stats.print_callers('slow_function')

Show callees

stats.print_callees('main')

undefined

stats.print_callees('main')

undefined

Programmatic Profiling

编程式分析

python

import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    profiler = cProfile.Profile()
    profiler.enable()

    result = func(*args, **kwargs)

    profiler.disable()

    # Analyze
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

    return result

python

import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    profiler = cProfile.Profile()
    profiler.enable()

    result = func(*args, **kwargs)

    profiler.disable()

    # Analyze
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

    return result

Context manager

from contextlib import contextmanager

@contextmanager def profile_block(name='profile'): profiler = cProfile.Profile() profiler.enable() try: yield finally: profiler.disable() profiler.dump_stats(f'{name}.prof')

undefined

from contextlib import contextmanager

@contextmanager def profile_block(name='profile'): profiler = cProfile.Profile() profiler.enable() try: yield finally: profiler.disable() profiler.dump_stats(f'{name}.prof')

undefined

Memory Profiling

内存分析

tracemalloc (Built-in)

tracemalloc（内置工具）

python

import tracemalloc

python

import tracemalloc

Start tracking

tracemalloc.start()

Your code here

result = process_data()

Get snapshot

snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:") for stat in top_stats[:10]: print(stat)

snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:") for stat in top_stats[:10]: print(stat)

Compare snapshots

snapshot1 = tracemalloc.take_snapshot()

... code ...

snapshot2 = tracemalloc.take_snapshot()

diff = snapshot2.compare_to(snapshot1, 'lineno') for stat in diff[:10]: print(stat)

snapshot2 = tracemalloc.take_snapshot()

diff = snapshot2.compare.compare for stat in diff[:10]: print(stat)

Stop tracking

tracemalloc.stop()

undefined

tracemalloc.stop()

undefined

memory_profiler (Line-by-line)

memory_profiler（逐行内存分析）

python

undefined

python

undefined

Install: pip install memory_profiler

from memory_profiler import profile

@profile def my_function(): a = [1] * 1_000_000 b = [2] * 2_000_000 del b return a

from memory_profiler import profile

@profile def my_function(): a = [1] * 1_000_000 b = [2] * 2_000_000 del b return a

Command line usage

python -m memory_profiler script.py

Profile specific function

mprof run script.py

mprof plot

undefined

undefined

objgraph (Object References)

objgraph（对象引用分析）

python

undefined

python

undefined

Install: pip install objgraph

import objgraph

Most common types

objgraph.show_most_common_types(limit=20)

Growth since last call

objgraph.show_growth()

Find reference chain (memory leak detection)

objgraph.show_backrefs([leaked_object], filename='refs.png')

undefined

objgraph.show_backrefs([leaked_object], filename='refs.png')

undefined

Line Profiler

python

undefined

python

undefined

Install: pip install line_profiler

Decorate functions to profile

@profile def slow_function(): total = 0 for i in range(1000000): total += i return total

Run with: kernprof -l -v script.py

undefined

undefined

High-Resolution Timing

高精度计时

time Module

time模块

python

import time

python

import time

Monotonic clock (best for measuring durations)

start = time.perf_counter() result = do_work() duration = time.perf_counter() - start print(f"Duration: {duration:.4f}s")

Nanosecond precision (Python 3.7+)

start = time.perf_counter_ns() result = do_work() duration_ns = time.perf_counter_ns() - start print(f"Duration: {duration_ns}ns")

undefined

start = time.perf_counter_ns() result = do_work() duration_ns = time.perf_counter_ns() - start print(f"Duration: {duration_ns}ns")

undefined

timeit Module

timeit模块

python

import timeit

python

import timeit

Time small code snippets

duration = timeit.timeit('sum(range(1000))', number=10000) print(f"Average: {duration / 10000:.6f}s")

Compare implementations

setup = "data = list(range(10000))" time1 = timeit.timeit('sum(data)', setup, number=1000) time2 = timeit.timeit('sum(x for x in data)', setup, number=1000) print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")

undefined

setup = "data = list(range(10000))" time1 = timeit.timeit('sum(data)', setup, number=1000) time2 = timeit.timeit('sum(x for x in data)', setup, number=1000) print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")

undefined

Common Bottleneck Patterns

常见性能瓶颈模式

List Operations

列表操作

python

undefined

python

undefined

❌ Bad: Concatenating lists in loop

result = [] for item in items: result = result + [process(item)] # O(n²)

✅ Good: Use append

result = [] for item in items: result.append(process(item)) # O(n)

✅ Better: List comprehension

result = [process(item) for item in items]

❌ Bad: Checking membership in list

if item in large_list: # O(n) pass

✅ Good: Use set for membership

large_set = set(large_list) if item in large_set: # O(1) pass

undefined

large_set = set(large_list) if item in large_set: # O(1) pass

undefined

String Operations

字符串操作

python

undefined

python

undefined

❌ Bad: String concatenation in loop

result = "" for s in strings: result += s # Creates new string each time

✅ Good: Use join

result = "".join(strings)

❌ Bad: Format in loop

for item in items: log(f"Processing {item}")

✅ Good: Lazy formatting

import logging for item in items: logging.debug("Processing %s", item) # Only formats if needed

undefined

import logging for item in items: logging.debug("Processing %s", item) # Only formats if needed

undefined

Dictionary Operations

字典操作

python

undefined

python

undefined

❌ Bad: Repeated key lookup

if key in d: value = d[key] process(value)

✅ Good: Use get or setdefault

value = d.get(key) if value is not None: process(value)

❌ Bad: Checking then setting

if key not in d: d[key] = [] d[key].append(value)

✅ Good: Use defaultdict

from collections import defaultdict d = defaultdict(list) d[key].append(value)

undefined

from collections import defaultdict d = defaultdict(list) d[key].append(value)

undefined

Generator vs List

生成器 vs 列表

python

undefined

python

undefined

❌ Bad: Creating large intermediate lists

result = sum([x * 2 for x in range(10_000_000)]) # Uses memory

✅ Good: Use generator

result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation

Process large files

❌ Bad

data = open('large.csv').readlines() # All in memory for line in data: process(line)

✅ Good

with open('large.csv') as f: # Stream line by line for line in f: process(line)

undefined

with open('large.csv') as f: # Stream line by line for line in f: process(line)

undefined

NumPy Optimization

NumPy优化

python

import numpy as np

python

import numpy as np

❌ Bad: Python loops over arrays

result = [] for i in range(len(arr)): result.append(arr[i] * 2)

✅ Good: Vectorized operations

result = arr * 2 # SIMD operations

❌ Bad: Creating many temporary arrays

result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries

✅ Good: In-place operations when possible

result = arr1.copy() result += arr2 result *= arr3 result /= arr4

Use appropriate dtypes

arr = np.array(data, dtype=np.float32) # Half memory of float64

undefined

arr = np.array(data, dtype=np.float32) # Half memory of float64

undefined

Async Optimization

Async优化

python

import asyncio
import aiohttp

python

import asyncio
import aiohttp

❌ Bad: Sequential async

async def fetch_all_sequential(urls): results = [] async with aiohttp.ClientSession() as session: for url in urls: async with session.get(url) as resp: results.append(await resp.text()) return results

✅ Good: Concurrent async

async def fetch_all_concurrent(urls): async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses]

✅ Better: With concurrency limit

from asyncio import Semaphore

async def fetch_with_limit(urls, limit=10): semaphore = Semaphore(limit)

async def fetch_one(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                return await resp.text()

return await asyncio.gather(*[fetch_one(url) for url in urls])

undefined

from asyncio import Semaphore

async def fetch_with_limit(urls, limit=10): semaphore = Semaphore(limit)

async def fetch_one(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                return await resp.text()

return await asyncio.gather(*[fetch_one(url) for url in urls])

undefined

Multiprocessing

多进程

python

from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor

python

from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor

CPU-bound work

def cpu_intensive(x): return sum(i * i for i in range(x))

Using Pool

with Pool(cpu_count()) as pool: results = pool.map(cpu_intensive, range(100))

Using ProcessPoolExecutor

with ProcessPoolExecutor() as executor: results = list(executor.map(cpu_intensive, range(100)))

Shared memory (Python 3.8+)

from multiprocessing import shared_memory import numpy as np

Create shared array

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes) shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf) shared_arr[:] = arr[:]

undefined

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes) shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf) shared_arr[:] = arr[:]

undefined

Profiling Checklist

性能分析检查清单

Check	Tool	Command
CPU hotspots	cProfile	`python -m cProfile script.py`
Line-by-line	line_profiler	`kernprof -l -v script.py`
Memory usage	tracemalloc	`tracemalloc.start()`
Memory per line	memory_profiler	`@profile` decorator
Object references	objgraph	`objgraph.show_growth()`
Quick benchmarks	timeit	`timeit.timeit()`

检查项	工具	命令
CPU热点定位	cProfile	`python -m cProfile script.py`
逐行性能分析	line_profiler	`kernprof -l -v script.py`
内存使用分析	tracemalloc	`tracemalloc.start()`
逐行内存占用	memory_profiler	`@profile` 装饰器
对象引用分析	objgraph	`objgraph.show_growth()`
快速基准测试	timeit	`timeit.timeit()`

py-spy (Sampling Profiler)

py-spy（采样分析器）

bash

undefined

bash

undefined

Install: pip install py-spy

Record profile

py-spy record -o profile.svg -- python script.py

Top-like view of running process

py-spy top --pid <pid>

Dump current stack

py-spy dump --pid <pid>

Profile subprocesses

py-spy record --subprocesses -o profile.svg -- python script.py

undefined

py-spy record --subprocesses -o profile.svg -- python script.py

undefined

Production Optimization

生产环境优化

python

undefined

python

undefined

Use slots for memory efficiency

class Point: slots = ['x', 'y'] def init(self, x, y): self.x = x self.y = y

Use lru_cache for memoization

from functools import lru_cache

@lru_cache(maxsize=1000) def expensive_computation(x): return x ** 2

from functools import lru_cache

@lru_cache(maxsize=1000) def expensive_computation(x): return x ** 2

Use dataclasses with slots (Python 3.10+)

from dataclasses import dataclass

@dataclass(slots=True) class Point: x: float y: float

undefined

from dataclasses import dataclass

@dataclass(slots=True) class Point: x: float y: float

undefined

Anti-Patterns

反模式

Anti-Pattern	Why It's Wrong	Correct Approach
Using `+` to concatenate strings in loop	O(n²) time complexity	Use `''.join()` or list comprehension
List comprehension when generator suffices	Unnecessary memory allocation	Use generator expression for one-time iteration
`range()` when `enumerate()` needed	Manual index tracking, error-prone	Use `enumerate()` for index and value
Checking membership in list	O(n) lookup	Use `set` for O(1) membership testing
`global` variables everywhere	Hard to profile, side effects	Pass parameters, return values
Not using NumPy for numerical work	Orders of magnitude slower	Vectorize with NumPy for array operations
Premature optimization	Wasted effort, harder to maintain	Profile first, optimize bottlenecks
Using `import *`	Namespace pollution, slower imports	Import specific names
`.append()` in loop when size known	Multiple reallocations	Pre-allocate with list comprehension or `[None] * size`
Not using `__slots__` for many instances	Higher memory usage	Use `__slots__` for classes with many instances

反模式	问题原因	正确做法
循环中使用 `+` 拼接字符串	O(n²)时间复杂度	使用 `''.join()` 或列表推导式
可使用生成器的场景下使用列表推导式	不必要的内存占用	一次性迭代时使用生成器表达式
需要索引时使用 `range()`	手动维护索引容易出错	使用 `enumerate()` 同时获取索引和值
在列表中检查成员关系	O(n)查找复杂度	使用 `set` 实现O(1)成员检查
大量使用 `global` 变量	难以分析，容易产生副作用	传递参数，返回计算结果
数值计算不使用NumPy	运行速度慢几个数量级	数组操作使用NumPy向量化实现
过早优化	浪费精力，代码维护难度提升	先分析定位瓶颈，再针对性优化
使用 `import *` 导入	命名空间污染，导入速度慢	导入需要的具体名称
列表长度已知时循环使用 `.append()`	多次内存重分配	使用列表推导式或 `[None] * size` 预分配空间
大量实例的类不使用 `__slots__`	内存占用较高	实例数量多的类使用 `__slots__`

Quick Troubleshooting

快速问题排查

Issue	Diagnosis	Solution
Slow loops over large data	Python loops are slow	Vectorize with NumPy, use list comprehensions
High memory usage	Creating large intermediate objects	Use generators, process in chunks
GIL contention	Multi-threading doesn't speed up CPU work	Use `multiprocessing` for CPU-bound tasks
Slow imports	Large modules with side effects	Lazy import, reduce module-level code
Memory leak	Objects not being garbage collected	Check for circular references, use `weakref`
`RecursionError`	Recursion too deep	Increase limit with `sys.setrecursionlimit()` or refactor to iteration
Slow dictionary operations	Hash collisions	Ensure keys are hashable and well-distributed
High CPU in profiler	C extensions not showing	Use sampling profiler like `py-spy`
Out of memory with large file	Loading entire file	Use `with open()` and iterate line by line
Slow JSON parsing	Large JSON file	Use streaming parser (ijson) or pandas

问题	诊断方法	解决方案
大数据量循环运行慢	Python循环本身性能低	使用NumPy向量化、列表推导式优化
内存占用过高	创建了大量中间对象	使用生成器，分块处理数据
GIL竞争	多线程无法提升CPU密集型任务速度	CPU密集型任务使用 `multiprocessing`
导入速度慢	大型模块存在导入副作用	懒加载，减少模块顶层代码
内存泄漏	对象无法被垃圾回收	检查循环引用，使用 `weakref`
`RecursionError`	递归深度超过限制	使用 `sys.setrecursionlimit()` 提升限制，或重构为迭代实现
字典操作慢	哈希冲突	确保键可哈希且分布均匀
分析器显示CPU占用高但看不到C扩展调用	分析器无法追踪C扩展调用	使用 `py-spy` 这类采样分析器
大文件处理内存不足	加载整个文件到内存	使用 `with open()` 逐行迭代处理
JSON解析慢	JSON文件过大	使用流式解析器（ijson）或pandas处理

python-profiling

Original

Translation

Python Performance Profiling

Python性能分析

When NOT to Use This Skill

本技能不适用场景

cProfile (CPU Profiling)

cProfile（CPU性能分析）

Command Line Usage

命令行使用

Profile entire script

Profile entire script

Sort by cumulative time

Sort by cumulative time

Sort by total time in function

Sort by total time in function

Analyze saved profile

Analyze saved profile

pstats Analysis

pstats分析

Load and analyze profile

Load and analyze profile

Filter by module

Filter by module

Show callers

Show callers

Show callees

Show callees

Programmatic Profiling

编程式分析

Context manager

Context manager

Memory Profiling

内存分析

tracemalloc (Built-in)

tracemalloc（内置工具）

Start tracking

Start tracking

Your code here

Your code here

Get snapshot

Get snapshot

Compare snapshots

Compare snapshots

... code ...

... code ...

Stop tracking

Stop tracking

memory_profiler (Line-by-line)

memory_profiler（逐行内存分析）

Install: pip install memory_profiler

Install: pip install memory_profiler

Command line usage

Command line usage

python -m memory_profiler script.py

python -m memory_profiler script.py

Profile specific function

Profile specific function

mprof run script.py

mprof run script.py

mprof plot

mprof plot

objgraph (Object References)

objgraph（对象引用分析）

Install: pip install objgraph

Install: pip install objgraph

Most common types

Most common types

Growth since last call

Growth since last call

Find reference chain (memory leak detection)

Find reference chain (memory leak detection)

Line Profiler

Line Profiler

Install: pip install line_profiler

Install: pip install line_profiler

Decorate functions to profile

Decorate functions to profile

Run with: kernprof -l -v script.py