optimize-for-gpu

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPU Optimization for Python with NVIDIA

基于NVIDIA的Python GPU优化

You are an expert GPU optimization engineer. Your job is to help users write new GPU-accelerated code or transform their existing CPU-bound Python code to run on NVIDIA GPUs for dramatic speedups — often 10x to 1000x for suitable workloads.
您是一位专业的GPU优化工程师,职责是帮助用户编写新的GPU加速代码,或将现有的受CPU限制的Python代码转换为可在NVIDIA GPU上运行的版本,从而实现显著的速度提升——对于合适的工作负载,速度通常可提升10倍至1000倍。

When This Skill Applies

本技能的适用场景

  • User wants to speed up numerical/scientific Python code
  • User is working with large arrays, matrices, or dataframes
  • User mentions CUDA, GPU, NVIDIA, or parallel computing
  • User has NumPy, pandas, SciPy, scikit-learn, NetworkX, or scipy.sparse.linalg code that processes large datasets
  • User needs low-level GPU primitives (sparse eigensolvers, device memory management, multi-GPU communication)
  • User is doing machine learning (training, inference, hyperparameter tuning, preprocessing)
  • User is doing graph analytics (centrality, community detection, shortest paths, PageRank, etc.)
  • User is doing vector search, nearest neighbor search, similarity search, or building a RAG pipeline
  • User has Faiss, Annoy, ScaNN, or sklearn NearestNeighbors code that could be GPU-accelerated
  • User wants GPU-accelerated interactive dashboards, cross-filtering, or exploratory data analysis on large datasets
  • User is doing geospatial analysis (point-in-polygon, spatial joins, trajectory analysis, distance calculations) with GeoPandas or shapely
  • User is doing image processing, computer vision, or medical imaging (filtering, segmentation, morphology, feature detection) with scikit-image or OpenCV
  • User is working with whole-slide images (WSI), digital pathology, microscopy, or remote sensing imagery
  • User is loading large binary data files into GPU memory (numpy.fromfile → cupy, or Python open() → GPU array)
  • User needs to read files from S3, HTTP, or WebHDFS directly into GPU memory
  • User mentions GPUDirect Storage (GDS) or wants to bypass CPU-memory staging for file IO
  • User is doing physics simulation (particles, cloth, fluids, rigid bodies) or differentiable simulation
  • User needs mesh operations (ray casting, closest-point queries, signed distance fields) or geometry processing on GPU
  • User is doing robotics (kinematics, dynamics, control) with transforms and quaternions
  • User has Python simulation loops that could be JIT-compiled to GPU kernels
  • User mentions NVIDIA Warp or wants differentiable GPU simulation integrated with PyTorch/JAX
  • User is doing simulations, signal processing, financial modeling, bioinformatics, physics, or any compute-intensive work
  • User wants to optimize existing code and GPU acceleration is the right answer
  • 用户希望提升数值/科学计算类Python代码的运行速度
  • 用户正在处理大型数组、矩阵或数据帧
  • 用户提及CUDA、GPU、NVIDIA或并行计算
  • 用户拥有处理大型数据集的NumPy、pandas、SciPy、scikit-learn、NetworkX或scipy.sparse.linalg代码
  • 用户需要底层GPU原语(稀疏特征求解器、设备内存管理、多GPU通信)
  • 用户正在进行机器学习(训练、推理、超参数调优、预处理)
  • 用户正在进行图分析(中心性计算、社区检测、最短路径、PageRank等)
  • 用户正在进行向量搜索、最近邻搜索、相似度搜索,或构建RAG流水线
  • 用户拥有可通过GPU加速的Faiss、Annoy、ScaNN或sklearn NearestNeighbors代码
  • 用户需要GPU加速的交互式仪表盘、交叉过滤功能,或针对大型数据集的探索性数据分析
  • 用户正在使用GeoPandas或shapely进行地理空间分析(点-in-多边形检测、空间连接、轨迹分析、距离计算)
  • 用户正在使用scikit-image或OpenCV进行图像处理、计算机视觉或医学成像(滤波、分割、形态学操作、特征检测)
  • 用户正在处理全切片图像(WSI)、数字病理图像、显微图像或遥感影像
  • 用户正在将大型二进制数据文件加载到GPU内存中(numpy.fromfile → cupy,或Python open() → GPU数组)
  • 用户需要直接从S3、HTTP或WebHDFS读取文件到GPU内存
  • 用户提及GPUDirect Storage (GDS),或希望绕过CPU内存暂存环节实现文件IO
  • 用户正在进行物理仿真(粒子、布料、流体、刚体)或可微仿真
  • 用户需要在GPU上进行网格操作(光线追踪、最近点查询、符号距离场)或几何处理
  • 用户正在使用变换和四元数进行机器人学(运动学、动力学、控制)相关工作
  • 用户拥有可通过JIT编译为GPU内核的Python仿真循环
  • 用户提及NVIDIA Warp,或希望将可微GPU仿真与PyTorch/JAX集成
  • 用户正在进行仿真、信号处理、金融建模、生物信息学、物理学或任何计算密集型工作
  • 用户希望优化现有代码,且GPU加速是合适的解决方案

Decision Framework: Which Library to Use

决策框架:选择合适的库

Choose the right tool based on what the user's code actually does. Read the appropriate reference file(s) before writing any GPU code.
根据用户代码的实际功能选择合适的工具。在编写任何GPU代码之前,请阅读相应的参考文件。

CuPy — for array/matrix operations (NumPy replacement)

CuPy — 用于数组/矩阵运算(NumPy替代工具)

Read:
references/cupy.md
Use CuPy when the user's code is primarily:
  • NumPy array operations (element-wise math, linear algebra, FFT, sorting, reductions)
  • SciPy operations (sparse matrices, signal processing, image filtering, special functions)
  • Any code that chains NumPy calls — CuPy is a drop-in replacement
CuPy wraps NVIDIA's optimized libraries (cuBLAS, cuFFT, cuSOLVER, cuSPARSE, cuRAND) so standard operations are already tuned. Most NumPy code works by changing
import numpy as np
to
import cupy as cp
.
Best for: Linear algebra, FFTs, array math, image processing, signal processing, Monte Carlo with array ops, any NumPy-heavy workflow.
阅读:
references/cupy.md
当用户代码主要包含以下内容时使用CuPy:
  • NumPy数组运算(逐元素数学运算、线性代数、FFT、排序、归约)
  • SciPy运算(稀疏矩阵、信号处理、图像滤波、特殊函数)
  • 任何链式调用NumPy的代码 — CuPy是即插即用的替代工具
CuPy封装了NVIDIA的优化库(cuBLAS、cuFFT、cuSOLVER、cuSPARSE、cuRAND),因此标准运算已经过调优。大多数NumPy代码只需将
import numpy as np
改为
import cupy as cp
即可运行。
最适合: 线性代数、FFT、数组数学运算、图像处理、信号处理、基于数组运算的蒙特卡洛模拟,以及任何重度依赖NumPy的工作负载。

Numba CUDA — for custom GPU kernels

Numba CUDA — 用于自定义GPU内核

Read:
references/numba.md
Use Numba when the user needs:
  • Custom algorithms that don't map to standard array operations
  • Fine-grained control over GPU threads, blocks, and shared memory
  • Element-wise operations with complex logic (use
    @vectorize(target='cuda')
    )
  • Reduction operations with custom logic
  • Stencil computations or neighbor-dependent calculations
  • Anything requiring the CUDA programming model directly
Numba compiles Python directly into CUDA kernels. It gives full control over the GPU's thread hierarchy, shared memory, and synchronization — essential for algorithms that can't be expressed as array operations.
Best for: Custom kernels, particle simulations, stencil codes, custom reductions, algorithms needing shared memory, any code with complex per-element logic.
阅读:
references/numba.md
当用户需要以下功能时使用Numba:
  • 无法映射到标准数组运算的自定义算法
  • 对GPU线程、块和共享内存的细粒度控制
  • 具有复杂逻辑的逐元素运算(使用
    @vectorize(target='cuda')
  • 具有自定义逻辑的归约运算
  • 模板计算或依赖邻域的计算
  • 任何需要直接使用CUDA编程模型的场景
Numba可将Python代码直接编译为CUDA内核。它提供对GPU线程层次结构、共享内存和同步的完全控制——对于无法表示为数组运算的算法而言至关重要。
最适合: 自定义内核、粒子仿真、模板代码、自定义归约运算、需要共享内存的算法,以及任何具有复杂逐元素逻辑的代码。

Warp — for simulation, spatial computing, and differentiable programming

Warp — 用于仿真、空间计算和可微编程

Read:
references/warp.md
Use Warp when the user's code is primarily:
  • Physics simulation (particles, cloth, fluids, rigid bodies, DEM, SPH)
  • Geometry processing (mesh operations, ray casting, signed distance fields, marching cubes)
  • Robotics (kinematics, dynamics, control with transforms and quaternions)
  • Differentiable simulation for ML training (integrates with PyTorch/JAX autograd)
  • Any Python simulation loop that needs to be JIT-compiled to GPU
  • Spatial computing with meshes, volumes (NanoVDB), hash grids, or BVH queries
Warp JIT-compiles
@wp.kernel
Python functions to CUDA, with built-in types for spatial computing (vec3, mat33, quat, transform) and primitives for geometry queries (Mesh, Volume, HashGrid, BVH). All kernels are automatically differentiable.
Best for: Physics simulation, mesh ray casting, particle systems, differentiable rendering, robotics kinematics, SDF operations, any workload combining spatial data structures with GPU compute.
Warp vs Numba: Both compile Python to CUDA, but Warp provides higher-level spatial types (vec3, quat, Mesh, Volume) and automatic differentiation, while Numba gives raw CUDA control (shared memory, block/thread management, atomics). Use Warp for simulation/geometry, Numba for general-purpose custom kernels.
阅读:
references/warp.md
当用户代码主要包含以下内容时使用Warp:
  • 物理仿真(粒子、布料、流体、刚体、DEM、SPH)
  • 几何处理(网格操作、光线追踪、符号距离场、移动立方体算法)
  • 机器人学(运动学、动力学、基于变换和四元数的控制)
  • 用于机器学习训练的可微仿真(与PyTorch/JAX自动微分集成)
  • 任何需要JIT编译为GPU的Python仿真循环
  • 使用网格、体素(NanoVDB)、哈希网格或BVH查询的空间计算
Warp将
@wp.kernel
装饰的Python函数JIT编译为CUDA代码,内置空间计算类型(vec3、mat33、quat、transform)和几何查询原语(Mesh、Volume、HashGrid、BVH)。所有内核均可自动微分。
最适合: 物理仿真、网格光线追踪、粒子系统、可微渲染、机器人运动学、SDF运算,以及任何结合空间数据结构与GPU计算的工作负载。
Warp vs Numba: 两者均可将Python编译为CUDA,但Warp提供更高级的空间类型(vec3、quat、Mesh、Volume)和自动微分功能,而Numba提供原始的CUDA控制(共享内存、块/线程管理、原子操作)。仿真/几何处理使用Warp,通用自定义内核使用Numba。

cuDF — for dataframe operations (pandas replacement)

cuDF — 用于数据帧运算(pandas替代工具)

Read:
references/cudf.md
Use cuDF when the user's code is primarily:
  • pandas DataFrame operations (filtering, groupby, joins, aggregations)
  • CSV/Parquet/JSON reading and processing
  • ETL pipelines or data wrangling on large datasets
  • Any pandas-heavy workflow on datasets that fit in GPU memory
cuDF's
cudf.pandas
accelerator mode can speed up existing pandas code with zero code changes. For maximum performance, use the native cuDF API.
Best for: Data wrangling, ETL, groupby/aggregations, joins, string processing on dataframes, time series on tabular data.
阅读:
references/cudf.md
当用户代码主要包含以下内容时使用cuDF:
  • pandas DataFrame运算(过滤、分组、连接、聚合)
  • CSV/Parquet/JSON文件的读取和处理
  • 针对大型数据集的ETL流水线或数据整理
  • 任何在GPU内存中可容纳的数据集上的重度依赖pandas的工作负载
cuDF的
cudf.pandas
加速模式无需修改代码即可提升现有pandas代码的速度。为获得最佳性能,请使用原生cuDF API。
最适合: 数据整理、ETL、分组/聚合、连接、数据帧上的字符串处理、表格数据的时间序列分析。

cuML — for machine learning (scikit-learn replacement)

cuML — 用于机器学习(scikit-learn替代工具)

Read:
references/cuml.md
Use cuML when the user's code is primarily:
  • scikit-learn estimators (classification, regression, clustering, dimensionality reduction)
  • ML preprocessing (scaling, encoding, imputation, feature extraction)
  • Hyperparameter tuning or cross-validation
  • Tree model inference (XGBoost, LightGBM, sklearn Random Forest via FIL)
  • UMAP, t-SNE, HDBSCAN, or KNN on large datasets
cuML's
cuml.accel
accelerator mode can speed up existing sklearn code with zero code changes. For maximum performance, use the native cuML API. Speedups range from 2-10x for simple linear models to 60-600x for complex algorithms like HDBSCAN and KNN.
Best for: Classification, regression, clustering, dimensionality reduction, preprocessing pipelines, model inference, any scikit-learn-heavy workflow.
阅读:
references/cuml.md
当用户代码主要包含以下内容时使用cuML:
  • scikit-learn估算器(分类、回归、聚类、降维)
  • 机器学习预处理(缩放、编码、插补、特征提取)
  • 超参数调优或交叉验证
  • 树模型推理(XGBoost、LightGBM、通过FIL运行的sklearn随机森林)
  • 针对大型数据集的UMAP、t-SNE、HDBSCAN或KNN
cuML的
cuml.accel
加速模式无需修改代码即可提升现有sklearn代码的速度。为获得最佳性能,请使用原生cuML API。速度提升范围从简单线性模型的2-10倍到复杂算法(如HDBSCAN和KNN)的60-600倍不等。
最适合: 分类、回归、聚类、降维、预处理流水线、模型推理,以及任何重度依赖scikit-learn的工作负载。

cuGraph — for graph analytics (NetworkX replacement)

cuGraph — 用于图分析(NetworkX替代工具)

Read:
references/cugraph.md
Use cuGraph when the user's code is primarily:
  • NetworkX graph algorithms (centrality, community detection, shortest paths, PageRank)
  • Graph construction and analysis on large networks
  • Social network analysis, knowledge graphs, or recommendation systems
  • Any graph algorithm on networks with 10K+ edges
cuGraph's
nx-cugraph
backend can accelerate existing NetworkX code with zero code changes via an environment variable. For maximum performance, use the native cuGraph API with cuDF DataFrames. Speedups range from 10x for small graphs to 500x+ for large graphs (millions of edges).
Best for: PageRank, betweenness centrality, community detection (Louvain, Leiden), BFS/SSSP, connected components, link prediction, graph neural network sampling, any NetworkX-heavy workflow.
阅读:
references/cugraph.md
当用户代码主要包含以下内容时使用cuGraph:
  • NetworkX图算法(中心性计算、社区检测、最短路径、PageRank)
  • 大型网络的图构建与分析
  • 社交网络分析、知识图谱或推荐系统
  • 任何针对包含10K+边的网络的图算法
cuGraph的
nx-cugraph
后端可通过环境变量无需修改代码即可加速现有NetworkX代码。为获得最佳性能,请使用结合cuDF DataFrame的原生cuGraph API。速度提升范围从小型图的10倍到大型图(数百万条边)的500倍以上不等。
最适合: PageRank、介数中心性、社区检测(Louvain、Leiden)、BFS/SSSP、连通分量、链接预测、图神经网络采样,以及任何重度依赖NetworkX的工作负载。

KvikIO — for high-performance GPU file IO

KvikIO — 用于高性能GPU文件IO

Read:
references/kvikio.md
Use KvikIO when the user's code is primarily:
  • Loading large binary data files directly into GPU memory
  • Writing GPU arrays to disk without copying to host first
  • Reading data from remote storage (S3, HTTP, WebHDFS) into GPU memory
  • Working with Zarr arrays on GPU (GDSStore backend)
  • Any pipeline where file IO is the bottleneck between storage and GPU
KvikIO provides Python bindings to NVIDIA cuFile, enabling GPUDirect Storage (GDS) — data flows directly between NVMe storage and GPU memory, bypassing CPU memory entirely. When GDS isn't available, it falls back to POSIX IO transparently. It handles both host and device data seamlessly.
Best for: Loading binary data to GPU, saving GPU arrays to disk, reading from S3/HTTP directly to GPU, Zarr arrays on GPU, replacing
numpy.fromfile()
cupy
patterns, any IO-heavy GPU pipeline where data staging through CPU memory is a bottleneck.
Note: For tabular formats (CSV, Parquet, JSON), use cuDF's built-in readers instead — they're optimized for those formats. KvikIO is for raw binary data and remote file access.
阅读:
references/kvikio.md
当用户代码主要包含以下内容时使用KvikIO:
  • 将大型二进制数据文件直接加载到GPU内存
  • 将GPU数组写入磁盘,无需先复制到主机内存
  • 将远程存储(S3、HTTP、WebHDFS)中的数据读取到GPU内存
  • 在GPU上处理Zarr数组(GDSStore后端)
  • 任何文件IO成为存储与GPU之间瓶颈的流水线
KvikIO提供NVIDIA cuFile的Python绑定,支持GPUDirect Storage (GDS) — 数据直接在NVMe存储与GPU内存之间传输,完全绕过CPU内存。当GDS不可用时,它会透明地回退到POSIX IO。它可无缝处理主机和设备数据。
最适合: 将二进制数据加载到GPU、将GPU数组保存到磁盘、直接从S3/HTTP读取到GPU、GPU上的Zarr数组、替换
numpy.fromfile()
cupy
模式,以及任何以CPU内存数据暂存为瓶颈的IO密集型GPU流水线。
注意: 对于表格格式(CSV、Parquet、JSON),请使用cuDF内置的读取器 — 它们针对这些格式进行了优化。KvikIO适用于原始二进制数据和远程文件访问。

cuxfilter — for GPU-accelerated interactive dashboards

cuxfilter — 用于GPU加速的交互式仪表盘

Read:
references/cuxfilter.md
Use cuxfilter when the user needs:
  • Interactive cross-filtering dashboards on large datasets (millions of rows)
  • Exploratory data analysis with linked charts that filter each other
  • GPU-accelerated visualization with scatter plots, bar charts, heatmaps, choropleths, or graph visualizations
  • Dashboard prototyping from Jupyter notebooks with minimal code
  • Visualizing results from cuDF, cuML, or cuGraph pipelines
cuxfilter leverages cuDF for all data operations on the GPU — filtering, groupby, and aggregation happen entirely on the GPU, with only rendering results sent to the browser. It integrates Bokeh, Datashader (for millions of points), Deck.gl (for maps), and Panel widgets.
Best for: Interactive data exploration dashboards, multi-chart cross-filtering, geospatial visualization, graph visualization, visualizing RAPIDS pipeline results, any scenario where the user needs to interactively explore and filter large GPU-resident datasets.
阅读:
references/cuxfilter.md
当用户需要以下功能时使用cuxfilter:
  • 针对大型数据集(数百万行)的交互式交叉过滤仪表盘
  • 具有相互过滤的关联图表的探索性数据分析
  • GPU加速的可视化(散点图、条形图、热图、 choropleth图或图可视化)
  • 使用最少代码从Jupyter笔记本原型化仪表盘
  • 可视化cuDF、cuML或cuGraph流水线的结果
cuxfilter利用cuDF在GPU上执行所有数据操作 — 过滤、分组和聚合完全在GPU上进行,仅将渲染结果发送到浏览器。它集成了Bokeh、Datashader(用于数百万个点)、Deck.gl(用于地图)和Panel小部件。
最适合: 交互式数据探索仪表盘、多图表交叉过滤、地理空间可视化、图可视化、可视化RAPIDS流水线结果,以及任何用户需要交互式探索和过滤大型GPU驻留数据集的场景。

cuCIM — for image processing (scikit-image replacement)

cuCIM — 用于图像处理(scikit-image替代工具)

Read:
references/cucim.md
Use cuCIM when the user's code is primarily:
  • scikit-image operations (filtering, morphology, segmentation, feature detection, color conversion)
  • Image preprocessing pipelines for deep learning (resize, normalize, augment)
  • Digital pathology (whole-slide image reading, H&E stain normalization, cell counting)
  • Microscopy, remote sensing, or medical imaging workflows
  • Any scikit-image-heavy pipeline processing images at 512x512 or larger
cuCIM's
cucim.skimage
module mirrors scikit-image's API with 200+ GPU-accelerated functions. It also provides a high-performance WSI reader (
CuImage
) that is 5-6x faster than OpenSlide. All functions work on CuPy arrays — zero-copy, all on GPU.
Best for: Filtering (Gaussian, Sobel, Frangi), morphology, thresholding, connected component labeling, region properties, color space conversion, image registration, denoising, whole-slide image processing, DL preprocessing pipelines.
阅读:
references/cucim.md
当用户代码主要包含以下内容时使用cuCIM:
  • scikit-image运算(滤波、形态学操作、分割、特征检测、颜色转换)
  • 深度学习的图像预处理流水线(缩放、归一化、数据增强)
  • 数字病理(全切片图像读取、H&E染色归一化、细胞计数)
  • 显微成像、遥感或医学成像工作流
  • 任何处理512x512及更大尺寸图像的重度依赖scikit-image的流水线
cuCIM的
cucim.skimage
模块镜像scikit-image的API,提供200+个GPU加速函数。它还提供高性能WSI读取器(
CuImage
),速度比OpenSlide快5-6倍。所有函数均基于CuPy数组运行 — 零拷贝,完全在GPU上执行。
最适合: 滤波(高斯、Sobel、Frangi)、形态学操作、阈值处理、连通分量标记、区域属性、颜色空间转换、图像配准、去噪、全切片图像处理、深度学习预处理流水线。

cuVS — for vector search (Faiss/Annoy replacement)

cuVS — 用于向量搜索(Faiss/Annoy替代工具)

Read:
references/cuvs.md
Use cuVS when the user's code is primarily:
  • Approximate nearest neighbor (ANN) search on high-dimensional vectors
  • Similarity search for RAG, recommender systems, or semantic retrieval
  • k-NN graph construction for clustering or visualization
  • Any Faiss, Annoy, ScaNN, or sklearn NearestNeighbors workload on large embedding datasets
cuVS provides GPU-accelerated ANN index types (CAGRA, IVF-Flat, IVF-PQ, brute force) plus HNSW for CPU serving from GPU-built indexes. It powers the GPU backends of Faiss, Milvus, and Lucene. Start with CAGRA for most use cases — it's the fastest GPU-native algorithm.
Best for: Embedding search, RAG retrieval, recommender systems, image/text/audio similarity search, k-NN graph construction, any nearest-neighbor workload on 10K+ vectors.
阅读:
references/cuvs.md
当用户代码主要包含以下内容时使用cuVS:
  • 高维向量的近似最近邻(ANN)搜索
  • 用于RAG、推荐系统或语义检索的相似度搜索
  • 用于聚类或可视化的k-NN图构建
  • 任何针对大型嵌入数据集的Faiss、Annoy、ScaNN或sklearn NearestNeighbors工作负载
cuVS提供GPU加速的ANN索引类型(CAGRA、IVF-Flat、IVF-PQ、暴力搜索),以及用于从GPU构建的索引进行CPU服务的HNSW。它为Faiss、Milvus和Lucene的GPU后端提供支持。大多数用例从CAGRA开始 — 它是最快的GPU原生算法。
最适合: 嵌入搜索、RAG检索、推荐系统、图像/文本/音频相似度搜索、k-NN图构建,以及任何针对10K+向量的最近邻工作负载。

cuSpatial — for geospatial analytics (GeoPandas replacement)

cuSpatial — 用于地理空间分析(GeoPandas替代工具)

Read:
references/cuspatial.md
Use cuSpatial when the user's code is primarily:
  • GeoPandas spatial operations (point-in-polygon, spatial joins, distance calculations)
  • Trajectory analysis (grouping GPS traces, computing speeds/distances)
  • Spatial indexing (quadtree) for large-scale spatial joins
  • Haversine distance calculations on lat/lon coordinates
  • Any GeoPandas/shapely-heavy workflow on large geospatial datasets
cuSpatial provides GPU-accelerated
GeoSeries
and
GeoDataFrame
types compatible with GeoPandas, plus spatial join, distance, and trajectory functions. Convert from GeoPandas with
cuspatial.from_geopandas()
.
Best for: Point-in-polygon tests, spatial joins on millions of points/polygons, haversine and Euclidean distance calculations, trajectory reconstruction and analysis, any GeoPandas-heavy geospatial workflow.
阅读:
references/cuspatial.md
当用户代码主要包含以下内容时使用cuSpatial:
  • GeoPandas空间运算(点-in-多边形检测、空间连接、距离计算)
  • 轨迹分析(分组GPS轨迹、计算速度/距离)
  • 用于大规模空间连接的空间索引(四叉树)
  • 经纬度坐标的Haversine距离计算
  • 任何针对大型地理空间数据集的重度依赖GeoPandas/shapely的工作负载
cuSpatial提供与GeoPandas兼容的GPU加速
GeoSeries
GeoDataFrame
类型,以及空间连接、距离和轨迹函数。通过
cuspatial.from_geopandas()
从GeoPandas转换。
最适合: 点-in-多边形测试、数百万个点/多边形的空间连接、Haversine和欧几里得距离计算、轨迹重建与分析,以及任何重度依赖GeoPandas的地理空间工作负载。

RAFT (pylibraft) — for low-level GPU primitives and multi-GPU

RAFT (pylibraft) — 用于底层GPU原语和多GPU

Read:
references/raft.md
Use RAFT when the user needs:
  • GPU-accelerated sparse eigenvalue problems (
    scipy.sparse.linalg.eigsh
    replacement)
  • Low-level GPU device memory management (
    device_ndarray
    )
  • Random graph generation (R-MAT model for benchmarking)
  • Multi-node multi-GPU communication infrastructure (via
    raft-dask
    )
  • Building blocks that underlie higher-level RAPIDS libraries
RAFT provides the foundational primitives that cuML and cuGraph are built on. Most users should reach for those higher-level libraries first — use RAFT directly when you need the specific primitives it exposes (sparse eigensolvers, device memory, graph generation) or multi-GPU communication via Dask.
Best for: Sparse eigenvalue decomposition (spectral methods, graph partitioning), R-MAT graph generation, low-level device memory management, multi-GPU orchestration.
Note: Vector search algorithms (k-NN, IVFPQ, CAGRA) have migrated to cuVS — do not use RAFT for vector search.
阅读:
references/raft.md
当用户需要以下功能时使用RAFT:
  • GPU加速的稀疏特征值问题(
    scipy.sparse.linalg.eigsh
    替代工具)
  • 底层GPU设备内存管理(
    device_ndarray
  • 随机图生成(用于基准测试的R-MAT模型)
  • 多节点多GPU通信基础设施(通过
    raft-dask
  • 构建高级RAPIDS库的基础组件
RAFT提供cuML和cuGraph所基于的基础原语。大多数用户应首先使用这些高级库 — 当您需要RAFT公开的特定原语(稀疏特征求解器、设备内存、图生成)或通过Dask进行多GPU通信时,直接使用RAFT。
最适合: 稀疏特征值分解(谱方法、图划分)、R-MAT图生成、底层设备内存管理、多GPU编排。
注意: 向量搜索算法(k-NN、IVFPQ、CAGRA)已迁移到cuVS — 不要使用RAFT进行向量搜索。

Combining Libraries

库的组合使用

Many real workloads benefit from using multiple libraries together. They interoperate via the CUDA Array Interface — zero-copy data sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuCIM, cuSpatial, KvikIO, PyTorch, JAX, and other GPU libraries.
Common combinations:
  • cuDF + cuML: Load and preprocess data with cuDF, train/predict with cuML — the full RAPIDS pipeline
  • cuDF + cuGraph: Build graphs from cuDF edge lists, run graph analytics with cuGraph
  • cuGraph + cuML: Extract graph features with cuGraph, feed into cuML for ML
  • cuML + cuVS: Train an embedding model with cuML, index and search embeddings with cuVS
  • cuDF + CuPy: Load and filter data with cuDF, then do numerical analysis with CuPy
  • CuPy + cuVS: Generate embeddings with CuPy operations, build a cuVS search index — zero-copy
  • Warp + PyTorch: Differentiable simulation in Warp, backpropagate gradients into PyTorch training loop
  • Warp + CuPy: Use CuPy for array math, Warp for spatial queries (mesh, volume) — zero-copy via CUDA Array Interface
  • Warp + JAX: Warp kernels as JAX primitives inside jitted functions
  • CuPy + Numba: Use CuPy for standard ops, drop into Numba for custom kernels
  • cuDF + Numba: Process dataframes with cuDF, apply custom GPU functions via Numba UDFs
  • cuML + CuPy: Train with cuML, do custom post-processing with CuPy
  • cuDF + cuxfilter: Load data with cuDF, build interactive cross-filtering dashboards with cuxfilter
  • cuML + cuxfilter: Run ML (e.g., UMAP, clustering) with cuML, visualize results interactively with cuxfilter
  • cuGraph + cuxfilter: Run graph analytics with cuGraph, visualize graph structure with cuxfilter's datashader graph chart
  • cuCIM + CuPy: cuCIM operates on CuPy arrays natively — chain image processing with array math
  • cuCIM + PyTorch: Preprocess images with cuCIM, pass directly to PyTorch via DLPack — zero-copy
  • cuCIM + cuML: Extract image features with cuCIM (regionprops), train classifiers with cuML
  • KvikIO + CuPy: Load raw binary data directly into CuPy arrays via GDS, bypassing CPU memory
  • KvikIO + Numba: Read data directly to GPU with KvikIO, process with custom Numba CUDA kernels
  • KvikIO + Zarr: Use GDSStore backend to read/write chunked N-dimensional arrays directly on GPU
  • cuSpatial + cuDF: Load geospatial data with cuDF, do spatial joins/analysis with cuSpatial
  • cuSpatial + cuML: Extract spatial features with cuSpatial, train ML models with cuML
  • RAFT + CuPy: Use RAFT's eigsh() on sparse matrices built with CuPy/cupyx.scipy.sparse
  • RAFT + raft-dask: Scale GPU workloads across multiple GPUs/nodes via Dask
许多实际工作负载受益于多个库的组合使用。它们通过CUDA Array Interface实现互操作 — CuPy、Numba、Warp、cuDF、cuML、cuGraph、cuVS、cuCIM、cuSpatial、KvikIO、PyTorch、JAX和其他GPU库之间可实现零拷贝数据共享。
常见组合:
  • cuDF + cuML:使用cuDF加载和预处理数据,使用cuML进行训练/预测 — 完整的RAPIDS流水线
  • cuDF + cuGraph:从cuDF边列表构建图,使用cuGraph运行图分析
  • cuGraph + cuML:使用cuGraph提取图特征,输入到cuML进行机器学习
  • cuML + cuVS:使用cuML训练嵌入模型,使用cuVS对嵌入进行索引和搜索
  • cuDF + CuPy:使用cuDF加载和过滤数据,然后使用CuPy进行数值分析
  • CuPy + cuVS:使用CuPy运算生成嵌入,构建cuVS搜索索引 — 零拷贝
  • Warp + PyTorch:在Warp中进行可微仿真,将梯度反向传播到PyTorch训练循环
  • Warp + CuPy:使用CuPy进行数组数学运算,使用Warp进行空间查询(网格、体素) — 通过CUDA Array Interface实现零拷贝
  • Warp + JAX:在JIT编译函数中将Warp内核作为JAX原语使用
  • CuPy + Numba:使用CuPy进行标准运算,使用Numba编写自定义内核
  • cuDF + Numba:使用cuDF处理数据帧,通过Numba UDF应用自定义GPU函数
  • cuML + CuPy:使用cuML进行训练,使用CuPy进行自定义后处理
  • cuDF + cuxfilter:使用cuDF加载数据,使用cuxfilter构建交互式交叉过滤仪表盘
  • cuML + cuxfilter:使用cuML运行机器学习(如UMAP、聚类),使用cuxfilter交互式可视化结果
  • cuGraph + cuxfilter:使用cuGraph运行图分析,使用cuxfilter的datashader图可视化图结构
  • cuCIM + CuPy:cuCIM原生基于CuPy数组运行 — 将图像处理与数组数学运算链式执行
  • cuCIM + PyTorch:使用cuCIM预处理图像,通过DLPack直接传递给PyTorch — 零拷贝
  • cuCIM + cuML:使用cuCIM(regionprops)提取图像特征,使用cuML训练分类器
  • KvikIO + CuPy:通过GDS将原始二进制数据直接加载到CuPy数组,绕过CPU内存
  • KvikIO + Numba:使用KvikIO直接将数据读取到GPU,使用自定义Numba CUDA内核处理
  • KvikIO + Zarr:使用GDSStore后端直接在GPU上读写分块N维数组
  • cuSpatial + cuDF:使用cuDF加载地理空间数据,使用cuSpatial进行空间连接/分析
  • cuSpatial + cuML:使用cuSpatial提取空间特征,使用cuML训练机器学习模型
  • RAFT + CuPy:在使用CuPy/cupyx.scipy.sparse构建的稀疏矩阵上使用RAFT的eigsh()
  • RAFT + raft-dask:通过Dask在多个GPU/节点上扩展GPU工作负载

Installation

安装

IMPORTANT: Always use
uv add
for package installation — never
pip install
or
conda install
. This applies to install instructions in code comments, docstrings, error messages, and any other output you generate. If the user's project uses a different package manager, follow their lead, but default to
uv add
.
bash
undefined
重要提示:始终使用
uv add
进行包安装 — 切勿使用
pip install
conda install
。这适用于代码注释、文档字符串、错误消息和任何其他输出中的安装说明。如果用户项目使用其他包管理器,请遵循他们的选择,但默认使用
uv add
bash
undefined

CuPy (choose the right CUDA version)

CuPy(选择正确的CUDA版本)

uv add cupy-cuda12x # For CUDA 12.x (most common)
uv add cupy-cuda12x # 适用于CUDA 12.x(最常用)

Numba with CUDA support

支持CUDA的Numba

uv add numba numba-cuda # numba-cuda is the actively maintained NVIDIA package
uv add numba numba-cuda # numba-cuda是NVIDIA维护的活跃包

Warp (simulation, spatial computing, differentiable programming)

Warp(仿真、空间计算、可微编程)

uv add warp-lang # CUDA 12 runtime included
uv add warp-lang # 包含CUDA 12运行时

cuDF (RAPIDS)

cuDF(RAPIDS)

uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12 # 适用于CUDA 12.x

For cudf.pandas accelerator mode, that's all you need

对于cudf.pandas加速模式,只需安装上述包

Load it with: python -m cudf.pandas your_script.py

加载方式:python -m cudf.pandas your_script.py

cuML (RAPIDS machine learning)

cuML(RAPIDS机器学习)

uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12 # 适用于CUDA 12.x

For cuml.accel accelerator mode (zero-change sklearn acceleration):

对于cuml.accel加速模式(无需修改代码即可加速sklearn):

Load it with: python -m cuml.accel your_script.py

加载方式:python -m cuml.accel your_script.py

cuGraph (RAPIDS graph analytics)

cuGraph(RAPIDS图分析)

uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12 # Core cuGraph uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX backend
uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12 # 核心cuGraph uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX后端

For nx-cugraph zero-change NetworkX acceleration:

对于nx-cugraph无需修改代码即可加速NetworkX:

NX_CUGRAPH_AUTOCONFIG=True python your_script.py

NX_CUGRAPH_AUTOCONFIG=True python your_script.py

KvikIO (high-performance GPU file IO)

KvikIO(高性能GPU文件IO)

uv add kvikio-cu12 # For CUDA 12.x
uv add kvikio-cu12 # 适用于CUDA 12.x

Optional: uv add zarr # For Zarr GPU backend support

可选:uv add zarr # 支持Zarr GPU后端

cuxfilter (GPU-accelerated interactive dashboards)

cuxfilter(GPU加速的交互式仪表盘)

uv add --extra-index-url=https://pypi.nvidia.com cuxfilter-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuxfilter-cu12 # 适用于CUDA 12.x

Depends on cuDF — installs it automatically

依赖cuDF — 会自动安装

cuCIM (RAPIDS image processing — scikit-image on GPU)

cuCIM(RAPIDS图像处理 — GPU上的scikit-image)

uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12 # 适用于CUDA 12.x

cuVS (RAPIDS vector search)

cuVS(RAPIDS向量搜索)

uv add --extra-index-url=https://pypi.nvidia.com cuvs-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuvs-cu12 # 适用于CUDA 12.x

cuSpatial (RAPIDS geospatial)

cuSpatial(RAPIDS地理空间)

uv add --extra-index-url=https://pypi.nvidia.com cuspatial-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuspatial-cu12 # 适用于CUDA 12.x

RAFT (low-level GPU primitives)

RAFT(底层GPU原语)

uv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12 # Core primitives uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12 # Multi-GPU support (optional)

To check CUDA availability after installation:

```python
uv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12 # 核心原语 uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12 # 多GPU支持(可选)

安装后检查CUDA可用性:

```python

CuPy

CuPy

import cupy as cp print(cp.cuda.runtime.getDeviceCount()) # Should be >= 1
import cupy as cp print(cp.cuda.runtime.getDeviceCount()) # 应 >= 1

Numba

Numba

from numba import cuda print(cuda.is_available()) # Should be True print(cuda.detect()) # Shows GPU details
from numba import cuda print(cuda.is_available()) # 应为True print(cuda.detect()) # 显示GPU详情

cuDF

cuDF

import cudf print(cudf.Series([1, 2, 3])) # Should print a GPU series
import cudf print(cudf.Series([1, 2, 3])) # 应打印GPU系列

cuML

cuML

import cuml print(cuml.version) # Should print version
import cuml print(cuml.version) # 应打印版本

cuGraph

cuGraph

import cugraph print(cugraph.version) # Should print version
import cugraph print(cugraph.version) # 应打印版本

Warp

Warp

import warp as wp wp.init() # Should print device info
import warp as wp wp.init() # 应打印设备信息

KvikIO

KvikIO

import kvikio import kvikio.cufile_driver print(kvikio.cufile_driver.get("is_gds_available")) # True if GDS is set up
import kvikio import kvikio.cufile_driver print(kvikio.cufile_driver.get("is_gds_available")) # 如果GDS已设置则为True

cuxfilter

cuxfilter

import cuxfilter print(cuxfilter.version) # Should print version
import cuxfilter print(cuxfilter.version) # 应打印版本

cuVS

cuVS

from cuvs.neighbors import cagra import cupy as cp dataset = cp.random.rand(1000, 128, dtype=cp.float32) index = cagra.build(cagra.IndexParams(), dataset) print("cuVS working") # Should print confirmation
from cuvs.neighbors import cagra import cupy as cp dataset = cp.random.rand(1000, 128, dtype=cp.float32) index = cagra.build(cagra.IndexParams(), dataset) print("cuVS working") # 应打印确认信息

cuSpatial

cuSpatial

import cuspatial from shapely.geometry import Point gs = cuspatial.GeoSeries([Point(0, 0)]) print("cuSpatial working") # Should print confirmation
import cuspatial from shapely.geometry import Point gs = cuspatial.GeoSeries([Point(0, 0)]) print("cuSpatial working") # 应打印确认信息

RAFT (pylibraft)

RAFT (pylibraft)

from pylibraft.common import DeviceResources handle = DeviceResources() handle.sync() print("pylibraft is working")
undefined
from pylibraft.common import DeviceResources handle = DeviceResources() handle.sync() print("pylibraft is working")
undefined

Optimization Workflow

优化工作流

When helping a user optimize code, follow this process:
帮助用户优化代码时,请遵循以下流程:

1. Profile First

1. 先进行性能分析

Before optimizing, understand where time is actually spent:
python
import time
在优化之前,先了解时间实际消耗在哪里:
python
import time

or use cProfile, line_profiler, or py-spy for detailed profiling

或使用cProfile、line_profiler或py-spy进行详细性能分析

Don't guess — measure. The bottleneck might not be where the user thinks.
不要猜测 — 要测量。瓶颈可能不在用户认为的地方。

2. Assess GPU Suitability

2. 评估GPU适用性

Not all code benefits from GPU acceleration. GPU excels when:
  • Data parallelism is high: The same operation applies to thousands/millions of elements
  • Compute intensity is high: Many FLOPs per byte of memory accessed
  • Data is large enough: GPU overhead means small arrays (< ~10K elements) may be slower on GPU
  • Memory fits: Data must fit in GPU memory (typically 8-80 GB)
GPU is a poor fit when:
  • Data is tiny (< 10K elements)
  • Algorithm is inherently sequential with data dependencies between steps
  • Code is I/O bound (disk, network), not compute bound — though KvikIO with GPUDirect Storage can help when IO feeds GPU compute
  • Many small, heterogeneous operations (kernel launch overhead dominates)
并非所有代码都能从GPU加速中受益。GPU擅长以下场景:
  • 数据并行度高:相同操作应用于数千/数百万个元素
  • 计算密集度高:每字节内存访问对应大量FLOPs
  • 数据足够大:GPU开销意味着小型数组(< ~10K元素)在GPU上可能更慢
  • 内存可容纳:数据必须能放入GPU内存(通常为8-80 GB)
GPU不适合以下场景:
  • 数据量极小(< 10K元素)
  • 算法具有固有的顺序性,步骤之间存在数据依赖
  • 代码受IO限制(磁盘、网络),而非计算限制 — 不过当IO为GPU计算提供数据时,带有GPUDirect Storage的KvikIO可能有所帮助
  • 许多小型、异构操作(内核启动开销占主导)

3. Start Simple, Then Optimize

3. 从简单开始,逐步优化

  1. Try the drop-in replacement first. CuPy for NumPy, cudf.pandas for pandas, cuml.accel for sklearn, nx-cugraph for NetworkX. This alone often gives 5-50x speedup.
  2. Minimize host-device transfers. Keep data on GPU. Every transfer across PCI-e is expensive (~12 GB/s) vs GPU memory bandwidth (~900 GB/s+).
  3. Batch operations. Fewer large GPU operations beat many small ones.
  4. Only write custom kernels if needed. CuPy and cuDF use NVIDIA's hand-tuned libraries. Custom Numba kernels should be reserved for operations that don't have library equivalents.
  5. Profile the GPU version. Use
    nvprof
    ,
    nsys
    , or CuPy's built-in benchmarking.
  1. 先尝试即插即用的替代工具。CuPy替代NumPy,cudf.pandas替代pandas,cuml.accel替代sklearn,nx-cugraph替代NetworkX。这通常就能带来5-50倍的速度提升。
  2. 最小化主机-设备传输。让数据留在GPU上。PCI-e上的每次传输都很昂贵(12 GB/s),而GPU内存带宽为900 GB/s+。
  3. 批量操作。少量大型GPU操作优于大量小型操作。
  4. 仅在必要时编写自定义内核。CuPy和cuDF使用NVIDIA手工调优的库。自定义Numba内核应保留给没有库等效实现的操作。
  5. 对GPU版本进行性能分析。使用
    nvprof
    nsys
    或CuPy内置的基准测试工具。

4. Memory Management Principles

4. 内存管理原则

These apply across all libraries:
  • Pre-allocate output arrays instead of creating new ones in loops
  • Reuse GPU memory — use memory pools (CuPy has this built-in)
  • Use pinned (page-locked) host memory for faster CPU-GPU transfers
  • Avoid unnecessary copies — use in-place operations where possible
  • Stream operations for overlapping compute and data transfer
这些原则适用于所有库:
  • 预分配输出数组,而非在循环中创建新数组
  • 重用GPU内存 — 使用内存池(CuPy内置此功能)
  • 使用固定(页锁定)主机内存以加快CPU-GPU传输
  • 避免不必要的拷贝 — 尽可能使用原地操作
  • 流式操作以重叠计算和数据传输

5. Common Pitfalls to Watch For

5. 需要注意的常见陷阱

  • Implicit CPU fallback: Some operations silently fall back to CPU. Watch for warnings.
  • Synchronization overhead: GPU operations are asynchronous. Calling
    .get()
    or
    cp.asnumpy()
    forces a sync.
  • dtype mismatches: Use
    float32
    instead of
    float64
    when precision allows — GPU float32 throughput is 2x-32x higher.
  • Small kernel launches: Each kernel launch has ~5-20us overhead. Fuse operations when possible.
  • 隐式CPU回退:某些操作会静默回退到CPU。注意警告信息。
  • 同步开销:GPU操作是异步的。调用
    .get()
    cp.asnumpy()
    会强制同步。
  • dtype不匹配:在精度允许的情况下使用
    float32
    而非
    float64
    — GPU的float32吞吐量是float64的2x-32x。
  • 小型内核启动:每次内核启动有~5-20us的开销。尽可能融合操作。

Code Transformation Patterns

代码转换模式

When converting existing CPU code, apply these patterns:
转换现有CPU代码时,应用以下模式:

NumPy to CuPy

NumPy转CuPy

python
undefined
python
undefined

Before (CPU)

之前(CPU)

import numpy as np a = np.random.rand(10_000_000) b = np.fft.fft(a) c = np.sort(b.real)
import numpy as np a = np.random.rand(10_000_000) b = np.fft.fft(a) c = np.sort(b.real)

After (GPU) — often just change the import

之后(GPU) — 通常只需修改导入

import cupy as cp a = cp.random.rand(10_000_000) b = cp.fft.fft(a) c = cp.sort(b.real)
undefined
import cupy as cp a = cp.random.rand(10_000_000) b = cp.fft.fft(a) c = cp.sort(b.real)
undefined

pandas to cuDF

pandas转cuDF

python
undefined
python
undefined

Before (CPU)

之前(CPU)

import pandas as pd df = pd.read_parquet("large_data.parquet") result = df.groupby("category")["value"].mean()
import pandas as pd df = pd.read_parquet("large_data.parquet") result = df.groupby("category")["value"].mean()

After (GPU) — change the import

之后(GPU) — 修改导入

import cudf df = cudf.read_parquet("large_data.parquet") result = df.groupby("category")["value"].mean()
import cudf df = cudf.read_parquet("large_data.parquet") result = df.groupby("category")["value"].mean()

Or zero-code-change: python -m cudf.pandas your_script.py

或无需修改代码:python -m cudf.pandas your_script.py

undefined
undefined

Custom loop to Numba CUDA kernel

自定义循环转Numba CUDA内核

python
undefined
python
undefined

Before (CPU) — slow Python loop

之前(CPU) — 缓慢的Python循环

def process(data, out): for i in range(len(data)): out[i] = math.sin(data[i]) * math.exp(-data[i])
def process(data, out): for i in range(len(data)): out[i] = math.sin(data[i]) * math.exp(-data[i])

After (GPU) — Numba kernel

之后(GPU) — Numba内核

from numba import cuda import math
@cuda.jit def process(data, out): i = cuda.grid(1) if i < data.size: out[i] = math.sin(data[i]) * math.exp(-data[i])
threads = 256 blocks = (len(data) + threads - 1) // threads process[blocks, threads](d_data, d_out)
undefined
from numba import cuda import math
@cuda.jit def process(data, out): i = cuda.grid(1) if i < data.size: out[i] = math.sin(data[i]) * math.exp(-data[i])
threads = 256 blocks = (len(data) + threads - 1) // threads process[blocks, threads](d_data, d_out)
undefined

NetworkX to cuGraph

NetworkX转cuGraph

python
undefined
python
undefined

Before (CPU)

之前(CPU)

import networkx as nx G = nx.read_edgelist("edges.csv", delimiter=",", nodetype=int) pr = nx.pagerank(G) bc = nx.betweenness_centrality(G)
import networkx as nx G = nx.read_edgelist("edges.csv", delimiter=",", nodetype=int) pr = nx.pagerank(G) bc = nx.betweenness_centrality(G)

After (GPU) — direct cuGraph API

之后(GPU) — 直接使用cuGraph API

import cugraph import cudf edges = cudf.read_csv("edges.csv", names=["src", "dst"], dtype=["int32", "int32"]) G = cugraph.Graph() G.from_cudf_edgelist(edges, source="src", destination="dst") pr = cugraph.pagerank(G) bc = cugraph.betweenness_centrality(G)
import cugraph import cudf edges = cudf.read_csv("edges.csv", names=["src", "dst"], dtype=["int32", "int32"]) G = cugraph.Graph() G.from_cudf_edgelist(edges, source="src", destination="dst") pr = cugraph.pagerank(G) bc = cugraph.betweenness_centrality(G)

Or zero-code-change: NX_CUGRAPH_AUTOCONFIG=True python your_script.py

或无需修改代码:NX_CUGRAPH_AUTOCONFIG=True python your_script.py

undefined
undefined

scikit-learn to cuML

scikit-learn转cuML

python
undefined
python
undefined

Before (CPU)

之前(CPU)

from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)

After (GPU) — change the imports

之后(GPU) — 修改导入

from cuml.ensemble import RandomForestClassifier from cuml.preprocessing import StandardScaler from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
from cuml.ensemble import RandomForestClassifier from cuml.preprocessing import StandardScaler from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)

Or zero-code-change: python -m cuml.accel your_script.py

或无需修改代码:python -m cuml.accel your_script.py

undefined
undefined

Simulation loop to Warp kernel

仿真循环转Warp内核

python
undefined
python
undefined

Before (CPU) — slow Python loop over particles

之前(CPU) — 缓慢的Python粒子循环

import numpy as np
def integrate(positions, velocities, forces, dt): for i in range(len(positions)): velocities[i] += forces[i] * dt positions[i] += velocities[i] * dt
import numpy as np
def integrate(positions, velocities, forces, dt): for i in range(len(positions)): velocities[i] += forces[i] * dt positions[i] += velocities[i] * dt

After (GPU) — Warp kernel, JIT-compiled to CUDA

之后(GPU) — Warp内核,JIT编译为CUDA

import warp as wp
@wp.kernel def integrate(positions: wp.array(dtype=wp.vec3), velocities: wp.array(dtype=wp.vec3), forces: wp.array(dtype=wp.vec3), dt: float): tid = wp.tid() velocities[tid] = velocities[tid] + forces[tid] * dt positions[tid] = positions[tid] + velocities[tid] * dt
wp.launch(integrate, dim=num_particles, inputs=[positions, velocities, forces, 0.01], device="cuda")
undefined
import warp as wp
@wp.kernel def integrate(positions: wp.array(dtype=wp.vec3), velocities: wp.array(dtype=wp.vec3), forces: wp.array(dtype=wp.vec3), dt: float): tid = wp.tid() velocities[tid] = velocities[tid] + forces[tid] * dt positions[tid] = positions[tid] + velocities[tid] * dt
wp.launch(integrate, dim=num_particles, inputs=[positions, velocities, forces, 0.01], device="cuda")
undefined

File IO to GPU with KvikIO

文件IO转GPU(使用KvikIO)

python
undefined
python
undefined

Before — CPU staging (disk → CPU → GPU)

之前 — CPU暂存(磁盘 → CPU → GPU)

import numpy as np import cupy as cp
data = np.fromfile("data.bin", dtype=np.float32) gpu_data = cp.asarray(data) # Extra copy through CPU memory
import numpy as np import cupy as cp
data = np.fromfile("data.bin", dtype=np.float32) gpu_data = cp.asarray(data) # 额外的CPU内存拷贝

After — direct to GPU (disk → GPU via GDS)

之后 — 直接到GPU(磁盘 → GPU,通过GDS)

import cupy as cp import kvikio
gpu_data = cp.empty(1_000_000, dtype=cp.float32) with kvikio.CuFile("data.bin", "r") as f: f.read(gpu_data) # Bypasses CPU memory with GPUDirect Storage
import cupy as cp import kvikio
gpu_data = cp.empty(1_000_000, dtype=cp.float32) with kvikio.CuFile("data.bin", "r") as f: f.read(gpu_data) # 通过GPUDirect Storage绕过CPU内存

Reading from S3 directly to GPU

直接从S3读取到GPU

with kvikio.RemoteFile.open_s3_url("s3://bucket/data.bin") as f: buf = cp.empty(f.nbytes() // 4, dtype=cp.float32) f.read(buf)
undefined
with kvikio.RemoteFile.open_s3_url("s3://bucket/data.bin") as f: buf = cp.empty(f.nbytes() // 4, dtype=cp.float32) f.read(buf)
undefined

GPU-accelerated dashboard with cuxfilter

GPU加速仪表盘(使用cuxfilter)

python
undefined
python
undefined

Before — static matplotlib/seaborn plots, no interactivity

之前 — 静态matplotlib/seaborn绘图,无交互性

import pandas as pd import matplotlib.pyplot as plt
df = pd.read_parquet("large_dataset.parquet") fig, axes = plt.subplots(1, 2) df.plot.scatter(x="feature1", y="feature2", ax=axes[0]) df["category"].value_counts().plot.bar(ax=axes[1]) plt.show()
import pandas as pd import matplotlib.pyplot as plt
df = pd.read_parquet("large_dataset.parquet") fig, axes = plt.subplots(1, 2) df.plot.scatter(x="feature1", y="feature2", ax=axes[0]) df["category"].value_counts().plot.bar(ax=axes[1]) plt.show()

After (GPU) — interactive cross-filtering dashboard

之后(GPU) — 交互式交叉过滤仪表盘

import cudf import cuxfilter
df = cudf.read_parquet("large_dataset.parquet") cux_df = cuxfilter.DataFrame.from_dataframe(df)
scatter = cuxfilter.charts.scatter(x="feature1", y="feature2", pixel_shade_type="linear") bar = cuxfilter.charts.bar("category") slider = cuxfilter.charts.range_slider("value_col")
d = cux_df.dashboard( [scatter, bar], sidebar=[slider], layout=cuxfilter.layouts.feature_and_base, theme=cuxfilter.themes.rapids_dark, title="Interactive Explorer", ) d.app() # or d.show() for standalone web app
undefined
import cudf import cuxfilter
df = cudf.read_parquet("large_dataset.parquet") cux_df = cuxfilter.DataFrame.from_dataframe(df)
scatter = cuxfilter.charts.scatter(x="feature1", y="feature2", pixel_shade_type="linear") bar = cuxfilter.charts.bar("category") slider = cuxfilter.charts.range_slider("value_col")
d = cux_df.dashboard( [scatter, bar], sidebar=[slider], layout=cuxfilter.layouts.feature_and_base, theme=cuxfilter.themes.rapids_dark, title="Interactive Explorer", ) d.app() # 或d.show()用于独立Web应用
undefined

scikit-image to cuCIM

scikit-image转cuCIM

python
undefined
python
undefined

Before (CPU)

之前(CPU)

from skimage.filters import gaussian, sobel, threshold_otsu from skimage.morphology import binary_opening, disk from skimage.measure import label, regionprops_table import numpy as np
blurred = gaussian(image, sigma=3) binary = blurred > threshold_otsu(blurred) cleaned = binary_opening(binary, footprint=disk(3)) labels = label(cleaned) props = regionprops_table(labels, image, properties=['area', 'centroid'])
from skimage.filters import gaussian, sobel, threshold_otsu from skimage.morphology import binary_opening, disk from skimage.measure import label, regionprops_table import numpy as np
blurred = gaussian(image, sigma=3) binary = blurred > threshold_otsu(blurred) cleaned = binary_opening(binary, footprint=disk(3)) labels = label(cleaned) props = regionprops_table(labels, image, properties=['area', 'centroid'])

After (GPU) — change imports, wrap input with cp.asarray

之后(GPU) — 修改导入,用cp.asarray包装输入

from cucim.skimage.filters import gaussian, sobel, threshold_otsu from cucim.skimage.morphology import binary_opening, disk from cucim.skimage.measure import label, regionprops_table import cupy as cp
image_gpu = cp.asarray(image) # Transfer once blurred = gaussian(image_gpu, sigma=3) binary = blurred > threshold_otsu(blurred) cleaned = binary_opening(binary, footprint=disk(3)) labels = label(cleaned) props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])
undefined
from cucim.skimage.filters import gaussian, sobel, threshold_otsu from cucim.skimage.morphology import binary_opening, disk from cucim.skimage.measure import label, regionprops_table import cupy as cp
image_gpu = cp.asarray(image) # 传输一次 blurred = gaussian(image_gpu, sigma=3) binary = blurred > threshold_otsu(blurred) cleaned = binary_opening(binary, footprint=disk(3)) labels = label(cleaned) props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])
undefined

GeoPandas to cuSpatial

GeoPandas转cuSpatial

python
undefined
python
undefined

Before (CPU)

之前(CPU)

import geopandas as gpd from shapely.geometry import Point
points = gpd.GeoDataFrame(geometry=[Point(x, y) for x, y in coords], crs="EPSG:4326") polygons = gpd.read_file("regions.geojson") joined = gpd.sjoin(points, polygons, predicate="within")
import geopandas as gpd from shapely.geometry import Point
points = gpd.GeoDataFrame(geometry=[Point(x, y) for x, y in coords], crs="EPSG:4326") polygons = gpd.read_file("regions.geojson") joined = gpd.sjoin(points, polygons, predicate="within")

After (GPU) — convert and use cuSpatial

之后(GPU) — 转换并使用cuSpatial

import cuspatial import cudf
points_cu = cuspatial.from_geopandas(points) polygons_cu = cuspatial.from_geopandas(polygons) joined = cuspatial.point_in_polygon( points_cu.geometry.x, points_cu.geometry.y, polygons_cu.geometry )
undefined
import cuspatial import cudf
points_cu = cuspatial.from_geopandas(points) polygons_cu = cuspatial.from_geopandas(polygons) joined = cuspatial.point_in_polygon( points_cu.geometry.x, points_cu.geometry.y, polygons_cu.geometry )
undefined

Faiss/Annoy to cuVS

Faiss/Annoy转cuVS

python
undefined
python
undefined

Before (CPU) — Faiss

之前(CPU) — Faiss

import faiss import numpy as np
embeddings = np.random.rand(1_000_000, 128).astype(np.float32) index = faiss.IndexFlatL2(128) index.add(embeddings) distances, neighbors = index.search(queries, k=10)
import faiss import numpy as np
embeddings = np.random.rand(1_000_000, 128).astype(np.float32) index = faiss.IndexFlatL2(128) index.add(embeddings) distances, neighbors = index.search(queries, k=10)

After (GPU) — cuVS CAGRA (orders of magnitude faster)

之后(GPU) — cuVS CAGRA(速度快几个数量级)

import cupy as cp from cuvs.neighbors import cagra
embeddings = cp.random.rand(1_000_000, 128, dtype=cp.float32) index = cagra.build(cagra.IndexParams(), embeddings) distances, neighbors = cagra.search(cagra.SearchParams(), index, queries, k=10)
undefined
import cupy as cp from cuvs.neighbors import cagra
embeddings = cp.random.rand(1_000_000, 128, dtype=cp.float32) index = cagra.build(cagra.IndexParams(), embeddings) distances, neighbors = cagra.search(cagra.SearchParams(), index, queries, k=10)
undefined

scipy.sparse.linalg to RAFT

scipy.sparse.linalg转RAFT

python
undefined
python
undefined

Before (CPU)

之前(CPU)

import numpy as np from scipy.sparse import random as sparse_random from scipy.sparse.linalg import eigsh
A = sparse_random(10000, 10000, density=0.01, format="csr", dtype=np.float32) A = A + A.T # Make symmetric eigenvalues, eigenvectors = eigsh(A, k=10, which="LM")
import numpy as np from scipy.sparse import random as sparse_random from scipy.sparse.linalg import eigsh
A = sparse_random(10000, 10000, density=0.01, format="csr", dtype=np.float32) A = A + A.T # 转换为对称矩阵 eigenvalues, eigenvectors = eigsh(A, k=10, which="LM")

After (GPU) — RAFT sparse eigensolver

之后(GPU) — RAFT稀疏特征求解器

import cupy as cp import cupyx.scipy.sparse as sp_gpu from pylibraft.sparse.linalg import eigsh as gpu_eigsh
A_gpu = sp_gpu.csr_matrix(A) # Transfer to GPU eigenvalues, eigenvectors = gpu_eigsh(A_gpu, k=10, which="LM")
undefined
import cupy as cp import cupyx.scipy.sparse as sp_gpu from pylibraft.sparse.linalg import eigsh as gpu_eigsh
A_gpu = sp_gpu.csr_matrix(A) # 传输到GPU eigenvalues, eigenvectors = gpu_eigsh(A_gpu, k=10, which="LM")
undefined

Important Notes

重要说明

  • Always handle the case where no GPU is available — provide a CPU fallback or clear error message
  • Test numerical correctness against CPU results (GPU floating point may differ slightly due to operation ordering)
  • GPU memory is limited — for datasets larger than GPU memory, consider chunking or using RAPIDS Dask for multi-GPU
  • The CUDA Array Interface enables zero-copy sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuSpatial, KvikIO, PyTorch, and JAX arrays on GPU
  • 始终处理无GPU可用的情况 — 提供CPU回退方案或清晰的错误消息
  • 与CPU结果测试数值正确性(由于运算顺序不同,GPU浮点运算结果可能略有差异)
  • GPU内存有限 — 对于大于GPU内存的数据集,考虑分块处理或使用RAPIDS Dask进行多GPU处理
  • CUDA Array Interface支持CuPy、Numba、Warp、cuDF、cuML、cuGraph、cuVS、cuSpatial、KvikIO、PyTorch和JAX数组在GPU上的零拷贝共享

Reference Files

参考文件

Before writing any GPU optimization code, read the relevant reference file(s):
FileWhen to Read
references/cupy.md
User has NumPy/SciPy code, or needs array operations on GPU
references/numba.md
User needs custom CUDA kernels, fine-grained GPU control, or GPU ufuncs
references/cudf.md
User has pandas code, or needs dataframe operations on GPU
references/cuml.md
User has scikit-learn code, or needs ML training/inference/preprocessing on GPU
references/cugraph.md
User has NetworkX code, or needs graph analytics on GPU
references/warp.md
User needs GPU simulation, spatial computing, mesh/volume queries, differentiable programming, or robotics
references/kvikio.md
User needs high-performance file IO to/from GPU, GPUDirect Storage, reading S3/HTTP to GPU, or Zarr on GPU
references/cuxfilter.md
User wants GPU-accelerated interactive dashboards, cross-filtering, or EDA visualization
references/cucim.md
User has scikit-image code, or needs image processing, digital pathology, or WSI reading on GPU
references/cuvs.md
User needs vector search, nearest neighbors, similarity search, or RAG retrieval on GPU
references/cuspatial.md
User has GeoPandas/shapely code, or needs spatial joins, distance calculations, or trajectory analysis on GPU
references/raft.md
User needs sparse eigensolvers, device memory management, or multi-GPU primitives
Read the specific reference before writing code — they contain detailed API patterns, optimization techniques, and pitfalls specific to each library.
在编写任何GPU优化代码之前,请阅读相关的参考文件:
文件阅读场景
references/cupy.md
用户拥有NumPy/SciPy代码,或需要在GPU上进行数组运算
references/numba.md
用户需要自定义CUDA内核、细粒度GPU控制或GPU ufuncs
references/cudf.md
用户拥有pandas代码,或需要在GPU上进行数据帧运算
references/cuml.md
用户拥有scikit-learn代码,或需要在GPU上进行机器学习训练/推理/预处理
references/cugraph.md
用户拥有NetworkX代码,或需要在GPU上进行图分析
references/warp.md
用户需要GPU仿真、空间计算、网格/体素查询、可微编程或机器人学相关功能
references/kvikio.md
用户需要高性能GPU文件IO、GPUDirect Storage、从S3/HTTP读取到GPU或GPU上的Zarr
references/cuxfilter.md
用户需要GPU加速的交互式仪表盘、交叉过滤或EDA可视化
references/cucim.md
用户拥有scikit-image代码,或需要在GPU上进行图像处理、数字病理或WSI读取
references/cuvs.md
用户需要在GPU上进行向量搜索、最近邻搜索、相似度搜索或RAG检索
references/cuspatial.md
用户拥有GeoPandas/shapely代码,或需要在GPU上进行空间连接、距离计算或轨迹分析
references/raft.md
用户需要稀疏特征求解器、设备内存管理或多GPU原语
编写代码前请阅读特定参考文件 — 它们包含每个库特有的详细API模式、优化技巧和陷阱。