python-profiling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Python Performance Profiling

Python性能分析

When NOT to Use This Skill

本技能不适用场景

  • Java/JVM profiling - Use the
    java-profiling
    skill for JFR and GC tuning
  • Node.js profiling - Use the
    nodejs-profiling
    skill for V8 profiler
  • NumPy/Pandas optimization - Use library-specific profiling tools and vectorization guides
  • Database query optimization - Use database-specific profiling tools
  • Web server performance - Use application-level profiling (Django Debug Toolbar, Flask-DebugToolbar)
Deep Knowledge: Use
mcp__documentation__fetch_docs
with technology:
python
for comprehensive profiling guides, optimization techniques, and best practices.
  • Java/JVM性能分析 - 请使用
    java-profiling
    技能处理JFR和GC调优相关问题
  • Node.js性能分析 - 请使用
    nodejs-profiling
    技能处理V8引擎分析相关问题
  • NumPy/Pandas优化 - 请使用对应库专属的分析工具和向量化指南
  • 数据库查询优化 - 请使用数据库专属的分析工具
  • Web服务器性能优化 - 请使用应用层分析工具(Django Debug Toolbar、Flask-DebugToolbar)
深度知识:如需全面的性能分析指南、优化技巧和最佳实践,可以调用
mcp__documentation__fetch_docs
工具,指定technology参数为
python
获取相关文档。

cProfile (CPU Profiling)

cProfile(CPU性能分析)

Command Line Usage

命令行使用

bash
undefined
bash
undefined

Profile entire script

Profile entire script

python -m cProfile -o output.prof script.py
python -m cProfile -o output.prof script.py

Sort by cumulative time

Sort by cumulative time

python -m cProfile -s cumtime script.py
python -m cProfile -s cumtime script.py

Sort by total time in function

Sort by total time in function

python -m cProfile -s tottime script.py
python -m cProfile -s tottime script.py

Analyze saved profile

Analyze saved profile

python -m pstats output.prof
undefined
python -m pstats output.prof
undefined

pstats Analysis

pstats分析

python
import pstats
python
import pstats

Load and analyze profile

Load and analyze profile

stats = pstats.Stats('output.prof') stats.strip_dirs() stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions
stats = pstats.Stats('output.prof') stats.strip_dirs() stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions

Filter by module

Filter by module

stats.print_stats('mymodule')
stats.print_stats('mymodule')

Show callers

Show callers

stats.print_callers('slow_function')
stats.print_callers('slow_function')

Show callees

Show callees

stats.print_callees('main')
undefined
stats.print_callees('main')
undefined

Programmatic Profiling

编程式分析

python
import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    profiler = cProfile.Profile()
    profiler.enable()

    result = func(*args, **kwargs)

    profiler.disable()

    # Analyze
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

    return result
python
import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    profiler = cProfile.Profile()
    profiler.enable()

    result = func(*args, **kwargs)

    profiler.disable()

    # Analyze
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

    return result

Context manager

Context manager

from contextlib import contextmanager
@contextmanager def profile_block(name='profile'): profiler = cProfile.Profile() profiler.enable() try: yield finally: profiler.disable() profiler.dump_stats(f'{name}.prof')
undefined
from contextlib import contextmanager
@contextmanager def profile_block(name='profile'): profiler = cProfile.Profile() profiler.enable() try: yield finally: profiler.disable() profiler.dump_stats(f'{name}.prof')
undefined

Memory Profiling

内存分析

tracemalloc (Built-in)

tracemalloc(内置工具)

python
import tracemalloc
python
import tracemalloc

Start tracking

Start tracking

tracemalloc.start()
tracemalloc.start()

Your code here

Your code here

result = process_data()
result = process_data()

Get snapshot

Get snapshot

snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno')
print("Top 10 memory allocations:") for stat in top_stats[:10]: print(stat)
snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno')
print("Top 10 memory allocations:") for stat in top_stats[:10]: print(stat)

Compare snapshots

Compare snapshots

snapshot1 = tracemalloc.take_snapshot()
snapshot1 = tracemalloc.take_snapshot()

... code ...

... code ...

snapshot2 = tracemalloc.take_snapshot()
diff = snapshot2.compare_to(snapshot1, 'lineno') for stat in diff[:10]: print(stat)
snapshot2 = tracemalloc.take_snapshot()
diff = snapshot2.compare.compare for stat in diff[:10]: print(stat)

Stop tracking

Stop tracking

tracemalloc.stop()
undefined
tracemalloc.stop()
undefined

memory_profiler (Line-by-line)

memory_profiler(逐行内存分析)

python
undefined
python
undefined

Install: pip install memory_profiler

Install: pip install memory_profiler

from memory_profiler import profile
@profile def my_function(): a = [1] * 1_000_000 b = [2] * 2_000_000 del b return a
from memory_profiler import profile
@profile def my_function(): a = [1] * 1_000_000 b = [2] * 2_000_000 del b return a

Command line usage

Command line usage

python -m memory_profiler script.py

python -m memory_profiler script.py

Profile specific function

Profile specific function

mprof run script.py

mprof run script.py

mprof plot

mprof plot

undefined
undefined

objgraph (Object References)

objgraph(对象引用分析)

python
undefined
python
undefined

Install: pip install objgraph

Install: pip install objgraph

import objgraph
import objgraph

Most common types

Most common types

objgraph.show_most_common_types(limit=20)
objgraph.show_most_common_types(limit=20)

Growth since last call

Growth since last call

objgraph.show_growth()
objgraph.show_growth()

Find reference chain (memory leak detection)

Find reference chain (memory leak detection)

objgraph.show_backrefs([leaked_object], filename='refs.png')
undefined
objgraph.show_backrefs([leaked_object], filename='refs.png')
undefined

Line Profiler

Line Profiler

python
undefined
python
undefined

Install: pip install line_profiler

Install: pip install line_profiler

Decorate functions to profile

Decorate functions to profile

@profile def slow_function(): total = 0 for i in range(1000000): total += i return total
@profile def slow_function(): total = 0 for i in range(1000000): total += i return total

Run with: kernprof -l -v script.py

Run with: kernprof -l -v script.py

undefined
undefined

High-Resolution Timing

高精度计时

time Module

time模块

python
import time
python
import time

Monotonic clock (best for measuring durations)

Monotonic clock (best for measuring durations)

start = time.perf_counter() result = do_work() duration = time.perf_counter() - start print(f"Duration: {duration:.4f}s")
start = time.perf_counter() result = do_work() duration = time.perf_counter() - start print(f"Duration: {duration:.4f}s")

Nanosecond precision (Python 3.7+)

Nanosecond precision (Python 3.7+)

start = time.perf_counter_ns() result = do_work() duration_ns = time.perf_counter_ns() - start print(f"Duration: {duration_ns}ns")
undefined
start = time.perf_counter_ns() result = do_work() duration_ns = time.perf_counter_ns() - start print(f"Duration: {duration_ns}ns")
undefined

timeit Module

timeit模块

python
import timeit
python
import timeit

Time small code snippets

Time small code snippets

duration = timeit.timeit('sum(range(1000))', number=10000) print(f"Average: {duration / 10000:.6f}s")
duration = timeit.timeit('sum(range(1000))', number=10000) print(f"Average: {duration / 10000:.6f}s")

Compare implementations

Compare implementations

setup = "data = list(range(10000))" time1 = timeit.timeit('sum(data)', setup, number=1000) time2 = timeit.timeit('sum(x for x in data)', setup, number=1000) print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")
undefined
setup = "data = list(range(10000))" time1 = timeit.timeit('sum(data)', setup, number=1000) time2 = timeit.timeit('sum(x for x in data)', setup, number=1000) print(f"sum(): {time1:.4f}s, generator: {time2:.4f}s")
undefined

Common Bottleneck Patterns

常见性能瓶颈模式

List Operations

列表操作

python
undefined
python
undefined

❌ Bad: Concatenating lists in loop

❌ Bad: Concatenating lists in loop

result = [] for item in items: result = result + [process(item)] # O(n²)
result = [] for item in items: result = result + [process(item)] # O(n²)

✅ Good: Use append

✅ Good: Use append

result = [] for item in items: result.append(process(item)) # O(n)
result = [] for item in items: result.append(process(item)) # O(n)

✅ Better: List comprehension

✅ Better: List comprehension

result = [process(item) for item in items]
result = [process(item) for item in items]

❌ Bad: Checking membership in list

❌ Bad: Checking membership in list

if item in large_list: # O(n) pass
if item in large_list: # O(n) pass

✅ Good: Use set for membership

✅ Good: Use set for membership

large_set = set(large_list) if item in large_set: # O(1) pass
undefined
large_set = set(large_list) if item in large_set: # O(1) pass
undefined

String Operations

字符串操作

python
undefined
python
undefined

❌ Bad: String concatenation in loop

❌ Bad: String concatenation in loop

result = "" for s in strings: result += s # Creates new string each time
result = "" for s in strings: result += s # Creates new string each time

✅ Good: Use join

✅ Good: Use join

result = "".join(strings)
result = "".join(strings)

❌ Bad: Format in loop

❌ Bad: Format in loop

for item in items: log(f"Processing {item}")
for item in items: log(f"Processing {item}")

✅ Good: Lazy formatting

✅ Good: Lazy formatting

import logging for item in items: logging.debug("Processing %s", item) # Only formats if needed
undefined
import logging for item in items: logging.debug("Processing %s", item) # Only formats if needed
undefined

Dictionary Operations

字典操作

python
undefined
python
undefined

❌ Bad: Repeated key lookup

❌ Bad: Repeated key lookup

if key in d: value = d[key] process(value)
if key in d: value = d[key] process(value)

✅ Good: Use get or setdefault

✅ Good: Use get or setdefault

value = d.get(key) if value is not None: process(value)
value = d.get(key) if value is not None: process(value)

❌ Bad: Checking then setting

❌ Bad: Checking then setting

if key not in d: d[key] = [] d[key].append(value)
if key not in d: d[key] = [] d[key].append(value)

✅ Good: Use defaultdict

✅ Good: Use defaultdict

from collections import defaultdict d = defaultdict(list) d[key].append(value)
undefined
from collections import defaultdict d = defaultdict(list) d[key].append(value)
undefined

Generator vs List

生成器 vs 列表

python
undefined
python
undefined

❌ Bad: Creating large intermediate lists

❌ Bad: Creating large intermediate lists

result = sum([x * 2 for x in range(10_000_000)]) # Uses memory
result = sum([x * 2 for x in range(10_000_000)]) # Uses memory

✅ Good: Use generator

✅ Good: Use generator

result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation
result = sum(x * 2 for x in range(10_000_000)) # Lazy evaluation

Process large files

Process large files

❌ Bad

❌ Bad

data = open('large.csv').readlines() # All in memory for line in data: process(line)
data = open('large.csv').readlines() # All in memory for line in data: process(line)

✅ Good

✅ Good

with open('large.csv') as f: # Stream line by line for line in f: process(line)
undefined
with open('large.csv') as f: # Stream line by line for line in f: process(line)
undefined

NumPy Optimization

NumPy优化

python
import numpy as np
python
import numpy as np

❌ Bad: Python loops over arrays

❌ Bad: Python loops over arrays

result = [] for i in range(len(arr)): result.append(arr[i] * 2)
result = [] for i in range(len(arr)): result.append(arr[i] * 2)

✅ Good: Vectorized operations

✅ Good: Vectorized operations

result = arr * 2 # SIMD operations
result = arr * 2 # SIMD operations

❌ Bad: Creating many temporary arrays

❌ Bad: Creating many temporary arrays

result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries
result = (arr1 + arr2) * arr3 / arr4 # 3 temporaries

✅ Good: In-place operations when possible

✅ Good: In-place operations when possible

result = arr1.copy() result += arr2 result *= arr3 result /= arr4
result = arr1.copy() result += arr2 result *= arr3 result /= arr4

Use appropriate dtypes

Use appropriate dtypes

arr = np.array(data, dtype=np.float32) # Half memory of float64
undefined
arr = np.array(data, dtype=np.float32) # Half memory of float64
undefined

Async Optimization

Async优化

python
import asyncio
import aiohttp
python
import asyncio
import aiohttp

❌ Bad: Sequential async

❌ Bad: Sequential async

async def fetch_all_sequential(urls): results = [] async with aiohttp.ClientSession() as session: for url in urls: async with session.get(url) as resp: results.append(await resp.text()) return results
async def fetch_all_sequential(urls): results = [] async with aiohttp.ClientSession() as session: for url in urls: async with session.get(url) as resp: results.append(await resp.text()) return results

✅ Good: Concurrent async

✅ Good: Concurrent async

async def fetch_all_concurrent(urls): async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses]
async def fetch_all_concurrent(urls): async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses]

✅ Better: With concurrency limit

✅ Better: With concurrency limit

from asyncio import Semaphore
async def fetch_with_limit(urls, limit=10): semaphore = Semaphore(limit)
async def fetch_one(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                return await resp.text()

return await asyncio.gather(*[fetch_one(url) for url in urls])
undefined
from asyncio import Semaphore
async def fetch_with_limit(urls, limit=10): semaphore = Semaphore(limit)
async def fetch_one(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                return await resp.text()

return await asyncio.gather(*[fetch_one(url) for url in urls])
undefined

Multiprocessing

多进程

python
from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor
python
from multiprocessing import Pool, cpu_count
from concurrent.futures import ProcessPoolExecutor

CPU-bound work

CPU-bound work

def cpu_intensive(x): return sum(i * i for i in range(x))
def cpu_intensive(x): return sum(i * i for i in range(x))

Using Pool

Using Pool

with Pool(cpu_count()) as pool: results = pool.map(cpu_intensive, range(100))
with Pool(cpu_count()) as pool: results = pool.map(cpu_intensive, range(100))

Using ProcessPoolExecutor

Using ProcessPoolExecutor

with ProcessPoolExecutor() as executor: results = list(executor.map(cpu_intensive, range(100)))
with ProcessPoolExecutor() as executor: results = list(executor.map(cpu_intensive, range(100)))

Shared memory (Python 3.8+)

Shared memory (Python 3.8+)

from multiprocessing import shared_memory import numpy as np
from multiprocessing import shared_memory import numpy as np

Create shared array

Create shared array

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes) shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf) shared_arr[:] = arr[:]
undefined
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes) shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf) shared_arr[:] = arr[:]
undefined

Profiling Checklist

性能分析检查清单

CheckToolCommand
CPU hotspotscProfile
python -m cProfile script.py
Line-by-lineline_profiler
kernprof -l -v script.py
Memory usagetracemalloc
tracemalloc.start()
Memory per linememory_profiler
@profile
decorator
Object referencesobjgraph
objgraph.show_growth()
Quick benchmarkstimeit
timeit.timeit()
检查项工具命令
CPU热点定位cProfile
python -m cProfile script.py
逐行性能分析line_profiler
kernprof -l -v script.py
内存使用分析tracemalloc
tracemalloc.start()
逐行内存占用memory_profiler
@profile
装饰器
对象引用分析objgraph
objgraph.show_growth()
快速基准测试timeit
timeit.timeit()

py-spy (Sampling Profiler)

py-spy(采样分析器)

bash
undefined
bash
undefined

Install: pip install py-spy

Install: pip install py-spy

Record profile

Record profile

py-spy record -o profile.svg -- python script.py
py-spy record -o profile.svg -- python script.py

Top-like view of running process

Top-like view of running process

py-spy top --pid <pid>
py-spy top --pid <pid>

Dump current stack

Dump current stack

py-spy dump --pid <pid>
py-spy dump --pid <pid>

Profile subprocesses

Profile subprocesses

py-spy record --subprocesses -o profile.svg -- python script.py
undefined
py-spy record --subprocesses -o profile.svg -- python script.py
undefined

Production Optimization

生产环境优化

python
undefined
python
undefined

Use slots for memory efficiency

Use slots for memory efficiency

class Point: slots = ['x', 'y'] def init(self, x, y): self.x = x self.y = y
class Point: slots = ['x', 'y'] def init(self, x, y): self.x = x self.y = y

Use lru_cache for memoization

Use lru_cache for memoization

from functools import lru_cache
@lru_cache(maxsize=1000) def expensive_computation(x): return x ** 2
from functools import lru_cache
@lru_cache(maxsize=1000) def expensive_computation(x): return x ** 2

Use dataclasses with slots (Python 3.10+)

Use dataclasses with slots (Python 3.10+)

from dataclasses import dataclass
@dataclass(slots=True) class Point: x: float y: float
undefined
from dataclasses import dataclass
@dataclass(slots=True) class Point: x: float y: float
undefined

Anti-Patterns

反模式

Anti-PatternWhy It's WrongCorrect Approach
Using
+
to concatenate strings in loop
O(n²) time complexityUse
''.join()
or list comprehension
List comprehension when generator sufficesUnnecessary memory allocationUse generator expression for one-time iteration
range()
when
enumerate()
needed
Manual index tracking, error-proneUse
enumerate()
for index and value
Checking membership in listO(n) lookupUse
set
for O(1) membership testing
global
variables everywhere
Hard to profile, side effectsPass parameters, return values
Not using NumPy for numerical workOrders of magnitude slowerVectorize with NumPy for array operations
Premature optimizationWasted effort, harder to maintainProfile first, optimize bottlenecks
Using
import *
Namespace pollution, slower importsImport specific names
.append()
in loop when size known
Multiple reallocationsPre-allocate with list comprehension or
[None] * size
Not using
__slots__
for many instances
Higher memory usageUse
__slots__
for classes with many instances
反模式问题原因正确做法
循环中使用
+
拼接字符串
O(n²)时间复杂度使用
''.join()
或列表推导式
可使用生成器的场景下使用列表推导式不必要的内存占用一次性迭代时使用生成器表达式
需要索引时使用
range()
手动维护索引容易出错使用
enumerate()
同时获取索引和值
在列表中检查成员关系O(n)查找复杂度使用
set
实现O(1)成员检查
大量使用
global
变量
难以分析,容易产生副作用传递参数,返回计算结果
数值计算不使用NumPy运行速度慢几个数量级数组操作使用NumPy向量化实现
过早优化浪费精力,代码维护难度提升先分析定位瓶颈,再针对性优化
使用
import *
导入
命名空间污染,导入速度慢导入需要的具体名称
列表长度已知时循环使用
.append()
多次内存重分配使用列表推导式或
[None] * size
预分配空间
大量实例的类不使用
__slots__
内存占用较高实例数量多的类使用
__slots__

Quick Troubleshooting

快速问题排查

IssueDiagnosisSolution
Slow loops over large dataPython loops are slowVectorize with NumPy, use list comprehensions
High memory usageCreating large intermediate objectsUse generators, process in chunks
GIL contentionMulti-threading doesn't speed up CPU workUse
multiprocessing
for CPU-bound tasks
Slow importsLarge modules with side effectsLazy import, reduce module-level code
Memory leakObjects not being garbage collectedCheck for circular references, use
weakref
RecursionError
Recursion too deepIncrease limit with
sys.setrecursionlimit()
or refactor to iteration
Slow dictionary operationsHash collisionsEnsure keys are hashable and well-distributed
High CPU in profilerC extensions not showingUse sampling profiler like
py-spy
Out of memory with large fileLoading entire fileUse
with open()
and iterate line by line
Slow JSON parsingLarge JSON fileUse streaming parser (ijson) or pandas
问题诊断方法解决方案
大数据量循环运行慢Python循环本身性能低使用NumPy向量化、列表推导式优化
内存占用过高创建了大量中间对象使用生成器,分块处理数据
GIL竞争多线程无法提升CPU密集型任务速度CPU密集型任务使用
multiprocessing
导入速度慢大型模块存在导入副作用懒加载,减少模块顶层代码
内存泄漏对象无法被垃圾回收检查循环引用,使用
weakref
RecursionError
递归深度超过限制使用
sys.setrecursionlimit()
提升限制,或重构为迭代实现
字典操作慢哈希冲突确保键可哈希且分布均匀
分析器显示CPU占用高但看不到C扩展调用分析器无法追踪C扩展调用使用
py-spy
这类采样分析器
大文件处理内存不足加载整个文件到内存使用
with open()
逐行迭代处理
JSON解析慢JSON文件过大使用流式解析器(ijson)或pandas处理

Related Skills

相关技能

  • FastAPI
  • Django
  • NumPy/Pandas
  • FastAPI
  • Django
  • NumPy/Pandas