numba

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Numba - High-Performance Python with JIT

Numba - 借助JIT实现Python高性能计算

Numba makes Python code go fast. It works by decorating your functions with decorators that tell Numba to compile them. It is particularly effective for code that involves heavy numerical loops and NumPy array manipulations.
Numba能够显著提升Python代码的运行速度。它通过为函数添加装饰器的方式,告知Numba对该函数进行编译。对于包含大量数值循环和NumPy数组操作的代码,Numba的优化效果尤为明显。

When to Use

适用场景

  • When NumPy's built-in vectorization isn't enough for your specific algorithm.
  • You have complex nested loops that are slow in standard Python.
  • You need to write custom "ufuncs" (universal functions) that operate element-wise on arrays.
  • High-performance physical simulations (Monte Carlo, N-body, Grid-based solvers).
  • Accelerating code for execution on NVIDIA GPUs (CUDA).
  • Creating parallelized code that utilizes all CPU cores without the overhead of multiprocessing.
  • 当NumPy内置的向量化操作无法满足你的特定算法需求时。
  • 你拥有在标准Python中运行缓慢的复杂嵌套循环。
  • 你需要编写可对数组进行逐元素操作的自定义"ufuncs"(通用函数)。
  • 高性能物理模拟(蒙特卡洛、N体问题、基于网格的求解器)。
  • 为NVIDIA GPU(CUDA)加速代码执行。
  • 创建可利用所有CPU核心的并行化代码,且无需多进程的额外开销。

Reference Documentation

参考文档

Official docs: https://numba.pydata.org/numba-doc/latest/index.html
User Guide: https://numba.pydata.org/numba-doc/latest/user/index.html
Search patterns:
@njit
,
@vectorize
,
prange
,
cuda.jit
,
numba.typed
官方文档https://numba.pydata.org/numba-doc/latest/index.html
用户指南https://numba.pydata.org/numba-doc/latest/user/index.html
搜索关键词
@njit
,
@vectorize
,
prange
,
cuda.jit
,
numba.typed

Core Principles

核心原则

nopython Mode (@njit)

nopython模式(@njit)

This is the "gold standard" for Numba. In this mode, Numba compiles the code without using the Python C-API, resulting in maximum speed. If it can't compile (e.g., because of unsupported Python objects), it throws an error.
这是Numba的"黄金标准"模式。在该模式下,Numba不使用Python C-API进行编译,可实现极致的运行速度。如果代码无法被编译(例如使用了不支持的Python对象),Numba会抛出错误。

Just-In-Time (JIT) Compilation

即时(JIT)编译

Compilation happens the first time you call the function. The machine code is then cached for subsequent calls.
编译过程在函数首次被调用时发生,生成的机器码会被缓存,供后续调用使用。

Array-Oriented

面向数组

Numba is designed to work with NumPy arrays. It understands their memory layout and can generate highly optimized loops over them.
Numba专为NumPy数组设计,它理解数组的内存布局,并能生成针对数组循环的高度优化代码。

Quick Reference

快速参考

Installation

安装

bash
pip install numba
bash
pip install numba

Standard Imports

标准导入

python
import numpy as np
from numba import njit, prange, vectorize, guvectorize, cuda
python
import numpy as np
from numba import njit, prange, vectorize, guvectorize, cuda

Basic Pattern - Accelerating a Loop

基础模式 - 加速循环

python
import numpy as np
from numba import njit
python
import numpy as np
from numba import njit

1. Apply the @njit decorator (alias for @jit(nopython=True))

1. 应用@njit装饰器(@jit(nopython=True)的别名)

@njit def sum_array(arr): res = 0.0 # Standard Python loop that would be slow is now fast as C for i in range(arr.shape[0]): res += arr[i] return res
@njit def sum_array(arr): res = 0.0 # 原本运行缓慢的标准Python循环现在速度可媲美C语言 for i in range(arr.shape[0]): res += arr[i] return res

2. Execute

2. 执行

data = np.random.random(1_000_000) result = sum_array(data) # First call compiles, then runs
undefined
data = np.random.random(1_000_000) result = sum_array(data) # 首次调用会进行编译,后续调用直接运行
undefined

Critical Rules

关键规则

✅ DO

✅ 建议做法

  • Prefer @njit - Always use nopython=True (or its alias @njit). It ensures your code is actually running at machine speed.
  • Use NumPy Arrays - Numba is optimized for NumPy. Avoid standard Python lists inside jitted functions.
  • Enable Parallelism - Use
    @njit(parallel=True)
    and
    prange
    instead of
    range
    for automatic multi-threading.
  • Cache Compiled Code - Use
    @njit(cache=True)
    to avoid recompilation every time you restart your script.
  • Warm up - Remember that the first call is slow due to compilation. In timing benchmarks, always run the function once before measuring.
  • Type Specifying (Optional) - You can provide signatures (e.g.,
    (float64[:],)
    ) to speed up the very first call, but Numba usually infers them well.
  • 优先使用@njit - 始终启用nopython=True(或其别名@njit),确保代码以机器速度运行。
  • 使用NumPy数组 - Numba针对NumPy进行了优化,在JIT函数内避免使用标准Python列表。
  • 启用并行化 - 使用
    @njit(parallel=True)
    prange
    替代
    range
    ,实现自动多线程。
  • 缓存编译代码 - 使用
    @njit(cache=True)
    避免每次重启脚本时重新编译。
  • 预热函数 - 注意首次调用因编译会较慢,在性能基准测试中,应先运行一次函数再进行计时。
  • 指定类型(可选) - 你可以提供类型签名(例如
    (float64[:],)
    )来加快首次调用的速度,但Numba通常能很好地自动推断类型。

❌ DON'T

❌ 避免做法

  • Don't use Python Objects - Strings, dictionaries, and custom classes are slow or unsupported in nopython mode. Use
    numba.typed
    for specialized containers if needed.
  • Don't JIT small functions - The overhead of calling a jitted function from Python can outweigh the gains for trivial operations.
  • Don't use unsupported libraries - You cannot use pandas, matplotlib, or requests inside an
    @njit
    function.
  • Don't modify global state - Jitted functions should be "pure" as much as possible for stability.
  • 不要使用Python对象 - 字符串、字典和自定义类在nopython模式下运行缓慢或不受支持。若需要容器,可使用
    numba.typed
    提供的专用容器。
  • 不要对小型函数进行JIT编译 - 对于简单操作,从Python调用JIT函数的开销可能超过性能提升。
  • 不要使用不支持的库 - 无法在
    @njit
    函数内使用pandas、matplotlib或requests等库。
  • 不要修改全局状态 - JIT函数应尽可能保持"纯函数",以保证稳定性。

Anti-Patterns (NEVER)

反模式(绝对避免)

python
from numba import njit
import pandas as pd
python
from numba import njit
import pandas as pd

❌ BAD: Using Pandas inside @njit (Unsupported)

❌ 错误:在@njit内使用Pandas(不支持)

@njit def bad_func(df): return df['col'].sum() # Will raise a LoweringError
@njit def bad_func(df): return df['col'].sum() # 会抛出LoweringError

✅ GOOD: Pass NumPy arrays instead

✅ 正确做法:传入NumPy数组

@njit def good_func(arr): return arr.sum()
@njit def good_func(arr): return arr.sum()

❌ BAD: Using @jit without nopython=True

❌ 错误:使用不带nopython=True的@jit

from numba import jit @jit def slow_func(x): # This might fall back to "Object Mode" (slow) return x + 1
from numba import jit @jit def slow_func(x): # 可能回退到"对象模式"(运行缓慢) return x + 1

✅ GOOD: Always ensure nopython mode

✅ 正确做法:始终确保启用nopython模式

@njit def fast_func(x): return x + 1
@njit def fast_func(x): return x + 1

❌ BAD: Manual loops in Python to call a JIT function

❌ 错误:在Python中手动循环调用JIT函数

for i in range(1000):

for i in range(1000):

process_element(arr[i]) # Calling JIT overhead 1000 times

process_element(arr[i]) # 1000次调用JIT函数的开销很大

✅ GOOD: Move the loop INSIDE the @njit function

✅ 正确做法:将循环移至@njit函数内部

@njit def process_all(arr): for i in range(arr.shape[0]): process_element(arr[i])
undefined
@njit def process_all(arr): for i in range(arr.shape[0]): process_element(arr[i])
undefined

Parallelism and Vectorization

并行化与向量化

Automatic Multi-threading

自动多线程

python
from numba import njit, prange

@njit(parallel=True)
def parallel_sum(A):
    # Use prange for the loop that should be parallelized
    s = 0
    for i in prange(A.shape[0]):
        s += A[i]
    return s
python
from numba import njit, prange

@njit(parallel=True)
def parallel_sum(A):
    # 对需要并行化的循环使用prange
    s = 0
    for i in prange(A.shape[0]):
        s += A[i]
    return s

Creating Fast ufuncs (@vectorize)

创建高速ufuncs(@vectorize)

python
from numba import vectorize
python
from numba import vectorize

This creates a NumPy ufunc that supports broadcasting

创建支持广播的NumPy ufunc

@vectorize(['float64(float64, float64)'], target='parallel') def fast_add(x, y): return x + y
@vectorize(['float64(float64, float64)'], target='parallel') def fast_add(x, y): return x + y

Now you can use it on massive arrays

现在可在大规模数组上使用该函数

res = fast_add(arr1, arr2)
undefined
res = fast_add(arr1, arr2)
undefined

Working with Structs and Types

结构体与类型处理

numba.typed for Non-Array Data

使用numba.typed处理非数组数据

python
from numba.typed import List, Dict
from numba import njit

@njit
def use_typed_list():
    l = List()
    l.append(1.0)
    return l
python
from numba.typed import List, Dict
from numba import njit

@njit
def use_typed_list():
    l = List()
    l.append(1.0)
    return l

GPU Acceleration (numba.cuda)

GPU加速(numba.cuda)

Writing CUDA Kernels

编写CUDA核函数

python
from numba import cuda

@cuda.jit
def my_kernel(io_array):
    # Calculate thread indices
    pos = cuda.grid(1)
    if pos < io_array.size:
        io_array[pos] *= 2
python
from numba import cuda

@cuda.jit
def my_kernel(io_array):
    # 计算线程索引
    pos = cuda.grid(1)
    if pos < io_array.size:
        io_array[pos] *= 2

Usage

使用示例

data = np.ones(256) threadsperblock = 32 blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock my_kernelblockspergrid, threadsperblock
undefined
data = np.ones(256) threadsperblock = 32 blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock my_kernelblockspergrid, threadsperblock
undefined

Practical Workflows

实用工作流

1. Fast Monte Carlo Simulation

1. 快速蒙特卡洛模拟

python
import random

@njit(parallel=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
python
import random

@njit(parallel=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

2. Custom Image Filter (Stencil)

2. 自定义图像滤波器(模板)

python
from numba import njit

@njit
def apply_threshold(image, threshold):
    M, N = image.shape
    result = np.zeros_like(image)
    for i in range(M):
        for j in range(N):
            if image[i, j] > threshold:
                result[i, j] = 255
    return result
python
from numba import njit

@njit
def apply_threshold(image, threshold):
    M, N = image.shape
    result = np.zeros_like(image)
    for i in range(M):
        for j in range(N):
            if image[i, j] > threshold:
                result[i, j] = 255
    return result

3. Solving a Physics Grid (Laplace Equation)

3. 求解物理网格问题(拉普拉斯方程)

python
@njit
def solve_laplace(u, niters):
    M, N = u.shape
    for n in range(niters):
        for i in range(1, M-1):
            for j in range(1, N-1):
                u[i, j] = 0.25 * (u[i+1, j] + u[i-1, j] + u[i, j+1] + u[i, j-1])
    return u
python
@njit
def solve_laplace(u, niters):
    M, N = u.shape
    for n in range(niters):
        for i in range(1, M-1):
            for j in range(1, N-1):
                u[i, j] = 0.25 * (u[i+1, j] + u[i-1, j] + u[i, j+1] + u[i, j-1])
    return u

Performance Optimization

性能优化

The inspect_types() method

inspect_types()方法

Use this to see if Numba had to fall back to expensive Python objects or if it managed to optimize everything to native types.
python
fast_func.inspect_types() # Prints color-coded annotated code
使用该方法查看Numba是否不得不回退到开销较大的Python对象,或者是否成功将所有内容优化为原生类型。
python
fast_func.inspect_types() # 打印带颜色标注的注解代码

Avoid Array Allocation in Loops

避免在循环内分配数组

Pre-allocate arrays outside the
@njit
function or pass them as arguments to avoid memory management overhead.
python
undefined
@njit
函数外部预分配数组,或将数组作为参数传入,避免内存管理开销。
python
undefined

✅ GOOD:

✅ 正确做法:

@njit def compute_into(out_arr, in_arr): for i in range(in_arr.shape[0]): out_arr[i] = in_arr[i] * 2
undefined
@njit def compute_into(out_arr, in_arr): for i in range(in_arr.shape[0]): out_arr[i] = in_arr[i] * 2
undefined

Common Pitfalls and Solutions

常见陷阱与解决方案

The "Global Variable" problem

"全局变量"问题

Numba captures the value of global variables at the time of compilation.
python
undefined
Numba在编译时会捕获全局变量的值。
python
undefined

❌ Problem: Changing a global variable won't affect the jitted function

❌ 问题:修改全局变量不会影响JIT函数

K = 10 @njit def f(x): return x + K
K = 20 f(1) # Result is still 11!
K = 10 @njit def f(x): return x + K
K = 20 f(1) # 结果仍然是11!

✅ Solution: Pass constants as arguments

✅ 解决方案:将常量作为参数传入

undefined
undefined

Object Mode Fallback

对象模式回退

If Numba says "Object mode is enabled", your code will be slow.
python
undefined
如果Numba提示"Object mode is enabled",你的代码运行速度会很慢。
python
undefined

✅ Solution: Force nopython mode

✅ 解决方案:强制启用nopython模式

@njit # If this throws error, fix the code instead of removing @njit
undefined
@njit # 如果抛出错误,修复代码而非移除@njit
undefined

Random Seed in Parallel

并行环境中的随机种子

Using
np.random
in
parallel=True
requires care to ensure independent streams for each thread. Standard
random.random()
or
np.random.random()
inside Numba are thread-safe and handle seeding per-thread automatically.
parallel=True
模式下使用
np.random
需要注意确保每个线程使用独立的随机流。Numba内部的标准
random.random()
np.random.random()
是线程安全的,会自动为每个线程处理种子。

Best Practices

最佳实践

  1. Always use @njit - Never use
    @jit
    without
    nopython=True
  2. Pre-allocate arrays - Avoid creating arrays inside hot loops
  3. Use prange for parallelism - Enable automatic multi-threading with
    parallel=True
    and
    prange
  4. Cache compiled functions - Use
    cache=True
    to avoid recompilation
  5. Warm up functions - Call jitted functions once before benchmarking
  6. Pass NumPy arrays - Convert Python lists to NumPy arrays before calling jitted functions
  7. Avoid Python objects - Use
    numba.typed.List
    and
    numba.typed.Dict
    if you need containers
  8. Check compilation mode - Use
    inspect_types()
    to verify nopython mode
  9. Handle first-call overhead - Remember the first call compiles the function
  10. Use appropriate signatures - Optional but can speed up first compilation
Numba is the bridge that allows Python to compete with C++ and Fortran in the high-performance computing arena. It removes the "Python tax" from your loops, enabling rapid prototyping without sacrificing execution speed.
  1. 始终使用@njit - 不要使用不带
    nopython=True
    @jit
  2. 预分配数组 - 避免在热循环内创建数组
  3. 使用prange实现并行化 - 启用
    parallel=True
    并使用
    prange
    实现自动多线程
  4. 缓存编译后的函数 - 使用
    cache=True
    避免重复编译
  5. 预热函数 - 在基准测试前先调用一次JIT函数
  6. 传入NumPy数组 - 在调用JIT函数前将Python列表转换为NumPy数组
  7. 避免使用Python对象 - 若需要容器,使用
    numba.typed.List
    numba.typed.Dict
  8. 检查编译模式 - 使用
    inspect_types()
    验证是否处于nopython模式
  9. 处理首次调用开销 - 记住首次调用会进行函数编译
  10. 使用合适的类型签名 - 可选操作,但可加快首次编译速度
Numba是让Python在高性能计算领域与C++和Fortran竞争的桥梁。它消除了Python循环的"性能税",让你无需牺牲执行速度即可快速完成原型开发。