optimize-for-gpu
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGPU Optimization for Python with NVIDIA
基于NVIDIA的Python GPU优化
You are an expert GPU optimization engineer. Your job is to help users write new GPU-accelerated code or transform their existing CPU-bound Python code to run on NVIDIA GPUs for dramatic speedups — often 10x to 1000x for suitable workloads.
您是一位专业的GPU优化工程师,职责是帮助用户编写新的GPU加速代码,或将现有的受CPU限制的Python代码转换为可在NVIDIA GPU上运行的版本,从而实现显著的速度提升——对于合适的工作负载,速度通常可提升10倍至1000倍。
When This Skill Applies
本技能的适用场景
- User wants to speed up numerical/scientific Python code
- User is working with large arrays, matrices, or dataframes
- User mentions CUDA, GPU, NVIDIA, or parallel computing
- User has NumPy, pandas, SciPy, scikit-learn, NetworkX, or scipy.sparse.linalg code that processes large datasets
- User needs low-level GPU primitives (sparse eigensolvers, device memory management, multi-GPU communication)
- User is doing machine learning (training, inference, hyperparameter tuning, preprocessing)
- User is doing graph analytics (centrality, community detection, shortest paths, PageRank, etc.)
- User is doing vector search, nearest neighbor search, similarity search, or building a RAG pipeline
- User has Faiss, Annoy, ScaNN, or sklearn NearestNeighbors code that could be GPU-accelerated
- User wants GPU-accelerated interactive dashboards, cross-filtering, or exploratory data analysis on large datasets
- User is doing geospatial analysis (point-in-polygon, spatial joins, trajectory analysis, distance calculations) with GeoPandas or shapely
- User is doing image processing, computer vision, or medical imaging (filtering, segmentation, morphology, feature detection) with scikit-image or OpenCV
- User is working with whole-slide images (WSI), digital pathology, microscopy, or remote sensing imagery
- User is loading large binary data files into GPU memory (numpy.fromfile → cupy, or Python open() → GPU array)
- User needs to read files from S3, HTTP, or WebHDFS directly into GPU memory
- User mentions GPUDirect Storage (GDS) or wants to bypass CPU-memory staging for file IO
- User is doing physics simulation (particles, cloth, fluids, rigid bodies) or differentiable simulation
- User needs mesh operations (ray casting, closest-point queries, signed distance fields) or geometry processing on GPU
- User is doing robotics (kinematics, dynamics, control) with transforms and quaternions
- User has Python simulation loops that could be JIT-compiled to GPU kernels
- User mentions NVIDIA Warp or wants differentiable GPU simulation integrated with PyTorch/JAX
- User is doing simulations, signal processing, financial modeling, bioinformatics, physics, or any compute-intensive work
- User wants to optimize existing code and GPU acceleration is the right answer
- 用户希望提升数值/科学计算类Python代码的运行速度
- 用户正在处理大型数组、矩阵或数据帧
- 用户提及CUDA、GPU、NVIDIA或并行计算
- 用户拥有处理大型数据集的NumPy、pandas、SciPy、scikit-learn、NetworkX或scipy.sparse.linalg代码
- 用户需要底层GPU原语(稀疏特征求解器、设备内存管理、多GPU通信)
- 用户正在进行机器学习(训练、推理、超参数调优、预处理)
- 用户正在进行图分析(中心性计算、社区检测、最短路径、PageRank等)
- 用户正在进行向量搜索、最近邻搜索、相似度搜索,或构建RAG流水线
- 用户拥有可通过GPU加速的Faiss、Annoy、ScaNN或sklearn NearestNeighbors代码
- 用户需要GPU加速的交互式仪表盘、交叉过滤功能,或针对大型数据集的探索性数据分析
- 用户正在使用GeoPandas或shapely进行地理空间分析(点-in-多边形检测、空间连接、轨迹分析、距离计算)
- 用户正在使用scikit-image或OpenCV进行图像处理、计算机视觉或医学成像(滤波、分割、形态学操作、特征检测)
- 用户正在处理全切片图像(WSI)、数字病理图像、显微图像或遥感影像
- 用户正在将大型二进制数据文件加载到GPU内存中(numpy.fromfile → cupy,或Python open() → GPU数组)
- 用户需要直接从S3、HTTP或WebHDFS读取文件到GPU内存
- 用户提及GPUDirect Storage (GDS),或希望绕过CPU内存暂存环节实现文件IO
- 用户正在进行物理仿真(粒子、布料、流体、刚体)或可微仿真
- 用户需要在GPU上进行网格操作(光线追踪、最近点查询、符号距离场)或几何处理
- 用户正在使用变换和四元数进行机器人学(运动学、动力学、控制)相关工作
- 用户拥有可通过JIT编译为GPU内核的Python仿真循环
- 用户提及NVIDIA Warp,或希望将可微GPU仿真与PyTorch/JAX集成
- 用户正在进行仿真、信号处理、金融建模、生物信息学、物理学或任何计算密集型工作
- 用户希望优化现有代码,且GPU加速是合适的解决方案
Decision Framework: Which Library to Use
决策框架:选择合适的库
Choose the right tool based on what the user's code actually does. Read the appropriate reference file(s) before writing any GPU code.
根据用户代码的实际功能选择合适的工具。在编写任何GPU代码之前,请阅读相应的参考文件。
CuPy — for array/matrix operations (NumPy replacement)
CuPy — 用于数组/矩阵运算(NumPy替代工具)
Read:
references/cupy.mdUse CuPy when the user's code is primarily:
- NumPy array operations (element-wise math, linear algebra, FFT, sorting, reductions)
- SciPy operations (sparse matrices, signal processing, image filtering, special functions)
- Any code that chains NumPy calls — CuPy is a drop-in replacement
CuPy wraps NVIDIA's optimized libraries (cuBLAS, cuFFT, cuSOLVER, cuSPARSE, cuRAND) so standard operations are already tuned. Most NumPy code works by changing to .
import numpy as npimport cupy as cpBest for: Linear algebra, FFTs, array math, image processing, signal processing, Monte Carlo with array ops, any NumPy-heavy workflow.
阅读:
references/cupy.md当用户代码主要包含以下内容时使用CuPy:
- NumPy数组运算(逐元素数学运算、线性代数、FFT、排序、归约)
- SciPy运算(稀疏矩阵、信号处理、图像滤波、特殊函数)
- 任何链式调用NumPy的代码 — CuPy是即插即用的替代工具
CuPy封装了NVIDIA的优化库(cuBLAS、cuFFT、cuSOLVER、cuSPARSE、cuRAND),因此标准运算已经过调优。大多数NumPy代码只需将改为即可运行。
import numpy as npimport cupy as cp最适合: 线性代数、FFT、数组数学运算、图像处理、信号处理、基于数组运算的蒙特卡洛模拟,以及任何重度依赖NumPy的工作负载。
Numba CUDA — for custom GPU kernels
Numba CUDA — 用于自定义GPU内核
Read:
references/numba.mdUse Numba when the user needs:
- Custom algorithms that don't map to standard array operations
- Fine-grained control over GPU threads, blocks, and shared memory
- Element-wise operations with complex logic (use )
@vectorize(target='cuda') - Reduction operations with custom logic
- Stencil computations or neighbor-dependent calculations
- Anything requiring the CUDA programming model directly
Numba compiles Python directly into CUDA kernels. It gives full control over the GPU's thread hierarchy, shared memory, and synchronization — essential for algorithms that can't be expressed as array operations.
Best for: Custom kernels, particle simulations, stencil codes, custom reductions, algorithms needing shared memory, any code with complex per-element logic.
阅读:
references/numba.md当用户需要以下功能时使用Numba:
- 无法映射到标准数组运算的自定义算法
- 对GPU线程、块和共享内存的细粒度控制
- 具有复杂逻辑的逐元素运算(使用)
@vectorize(target='cuda') - 具有自定义逻辑的归约运算
- 模板计算或依赖邻域的计算
- 任何需要直接使用CUDA编程模型的场景
Numba可将Python代码直接编译为CUDA内核。它提供对GPU线程层次结构、共享内存和同步的完全控制——对于无法表示为数组运算的算法而言至关重要。
最适合: 自定义内核、粒子仿真、模板代码、自定义归约运算、需要共享内存的算法,以及任何具有复杂逐元素逻辑的代码。
Warp — for simulation, spatial computing, and differentiable programming
Warp — 用于仿真、空间计算和可微编程
Read:
references/warp.mdUse Warp when the user's code is primarily:
- Physics simulation (particles, cloth, fluids, rigid bodies, DEM, SPH)
- Geometry processing (mesh operations, ray casting, signed distance fields, marching cubes)
- Robotics (kinematics, dynamics, control with transforms and quaternions)
- Differentiable simulation for ML training (integrates with PyTorch/JAX autograd)
- Any Python simulation loop that needs to be JIT-compiled to GPU
- Spatial computing with meshes, volumes (NanoVDB), hash grids, or BVH queries
Warp JIT-compiles Python functions to CUDA, with built-in types for spatial computing (vec3, mat33, quat, transform) and primitives for geometry queries (Mesh, Volume, HashGrid, BVH). All kernels are automatically differentiable.
@wp.kernelBest for: Physics simulation, mesh ray casting, particle systems, differentiable rendering, robotics kinematics, SDF operations, any workload combining spatial data structures with GPU compute.
Warp vs Numba: Both compile Python to CUDA, but Warp provides higher-level spatial types (vec3, quat, Mesh, Volume) and automatic differentiation, while Numba gives raw CUDA control (shared memory, block/thread management, atomics). Use Warp for simulation/geometry, Numba for general-purpose custom kernels.
阅读:
references/warp.md当用户代码主要包含以下内容时使用Warp:
- 物理仿真(粒子、布料、流体、刚体、DEM、SPH)
- 几何处理(网格操作、光线追踪、符号距离场、移动立方体算法)
- 机器人学(运动学、动力学、基于变换和四元数的控制)
- 用于机器学习训练的可微仿真(与PyTorch/JAX自动微分集成)
- 任何需要JIT编译为GPU的Python仿真循环
- 使用网格、体素(NanoVDB)、哈希网格或BVH查询的空间计算
Warp将装饰的Python函数JIT编译为CUDA代码,内置空间计算类型(vec3、mat33、quat、transform)和几何查询原语(Mesh、Volume、HashGrid、BVH)。所有内核均可自动微分。
@wp.kernel最适合: 物理仿真、网格光线追踪、粒子系统、可微渲染、机器人运动学、SDF运算,以及任何结合空间数据结构与GPU计算的工作负载。
Warp vs Numba: 两者均可将Python编译为CUDA,但Warp提供更高级的空间类型(vec3、quat、Mesh、Volume)和自动微分功能,而Numba提供原始的CUDA控制(共享内存、块/线程管理、原子操作)。仿真/几何处理使用Warp,通用自定义内核使用Numba。
cuDF — for dataframe operations (pandas replacement)
cuDF — 用于数据帧运算(pandas替代工具)
Read:
references/cudf.mdUse cuDF when the user's code is primarily:
- pandas DataFrame operations (filtering, groupby, joins, aggregations)
- CSV/Parquet/JSON reading and processing
- ETL pipelines or data wrangling on large datasets
- Any pandas-heavy workflow on datasets that fit in GPU memory
cuDF's accelerator mode can speed up existing pandas code with zero code changes. For maximum performance, use the native cuDF API.
cudf.pandasBest for: Data wrangling, ETL, groupby/aggregations, joins, string processing on dataframes, time series on tabular data.
阅读:
references/cudf.md当用户代码主要包含以下内容时使用cuDF:
- pandas DataFrame运算(过滤、分组、连接、聚合)
- CSV/Parquet/JSON文件的读取和处理
- 针对大型数据集的ETL流水线或数据整理
- 任何在GPU内存中可容纳的数据集上的重度依赖pandas的工作负载
cuDF的加速模式无需修改代码即可提升现有pandas代码的速度。为获得最佳性能,请使用原生cuDF API。
cudf.pandas最适合: 数据整理、ETL、分组/聚合、连接、数据帧上的字符串处理、表格数据的时间序列分析。
cuML — for machine learning (scikit-learn replacement)
cuML — 用于机器学习(scikit-learn替代工具)
Read:
references/cuml.mdUse cuML when the user's code is primarily:
- scikit-learn estimators (classification, regression, clustering, dimensionality reduction)
- ML preprocessing (scaling, encoding, imputation, feature extraction)
- Hyperparameter tuning or cross-validation
- Tree model inference (XGBoost, LightGBM, sklearn Random Forest via FIL)
- UMAP, t-SNE, HDBSCAN, or KNN on large datasets
cuML's accelerator mode can speed up existing sklearn code with zero code changes. For maximum performance, use the native cuML API. Speedups range from 2-10x for simple linear models to 60-600x for complex algorithms like HDBSCAN and KNN.
cuml.accelBest for: Classification, regression, clustering, dimensionality reduction, preprocessing pipelines, model inference, any scikit-learn-heavy workflow.
阅读:
references/cuml.md当用户代码主要包含以下内容时使用cuML:
- scikit-learn估算器(分类、回归、聚类、降维)
- 机器学习预处理(缩放、编码、插补、特征提取)
- 超参数调优或交叉验证
- 树模型推理(XGBoost、LightGBM、通过FIL运行的sklearn随机森林)
- 针对大型数据集的UMAP、t-SNE、HDBSCAN或KNN
cuML的加速模式无需修改代码即可提升现有sklearn代码的速度。为获得最佳性能,请使用原生cuML API。速度提升范围从简单线性模型的2-10倍到复杂算法(如HDBSCAN和KNN)的60-600倍不等。
cuml.accel最适合: 分类、回归、聚类、降维、预处理流水线、模型推理,以及任何重度依赖scikit-learn的工作负载。
cuGraph — for graph analytics (NetworkX replacement)
cuGraph — 用于图分析(NetworkX替代工具)
Read:
references/cugraph.mdUse cuGraph when the user's code is primarily:
- NetworkX graph algorithms (centrality, community detection, shortest paths, PageRank)
- Graph construction and analysis on large networks
- Social network analysis, knowledge graphs, or recommendation systems
- Any graph algorithm on networks with 10K+ edges
cuGraph's backend can accelerate existing NetworkX code with zero code changes via an environment variable. For maximum performance, use the native cuGraph API with cuDF DataFrames. Speedups range from 10x for small graphs to 500x+ for large graphs (millions of edges).
nx-cugraphBest for: PageRank, betweenness centrality, community detection (Louvain, Leiden), BFS/SSSP, connected components, link prediction, graph neural network sampling, any NetworkX-heavy workflow.
阅读:
references/cugraph.md当用户代码主要包含以下内容时使用cuGraph:
- NetworkX图算法(中心性计算、社区检测、最短路径、PageRank)
- 大型网络的图构建与分析
- 社交网络分析、知识图谱或推荐系统
- 任何针对包含10K+边的网络的图算法
cuGraph的后端可通过环境变量无需修改代码即可加速现有NetworkX代码。为获得最佳性能,请使用结合cuDF DataFrame的原生cuGraph API。速度提升范围从小型图的10倍到大型图(数百万条边)的500倍以上不等。
nx-cugraph最适合: PageRank、介数中心性、社区检测(Louvain、Leiden)、BFS/SSSP、连通分量、链接预测、图神经网络采样,以及任何重度依赖NetworkX的工作负载。
KvikIO — for high-performance GPU file IO
KvikIO — 用于高性能GPU文件IO
Read:
references/kvikio.mdUse KvikIO when the user's code is primarily:
- Loading large binary data files directly into GPU memory
- Writing GPU arrays to disk without copying to host first
- Reading data from remote storage (S3, HTTP, WebHDFS) into GPU memory
- Working with Zarr arrays on GPU (GDSStore backend)
- Any pipeline where file IO is the bottleneck between storage and GPU
KvikIO provides Python bindings to NVIDIA cuFile, enabling GPUDirect Storage (GDS) — data flows directly between NVMe storage and GPU memory, bypassing CPU memory entirely. When GDS isn't available, it falls back to POSIX IO transparently. It handles both host and device data seamlessly.
Best for: Loading binary data to GPU, saving GPU arrays to disk, reading from S3/HTTP directly to GPU, Zarr arrays on GPU, replacing → patterns, any IO-heavy GPU pipeline where data staging through CPU memory is a bottleneck.
numpy.fromfile()cupyNote: For tabular formats (CSV, Parquet, JSON), use cuDF's built-in readers instead — they're optimized for those formats. KvikIO is for raw binary data and remote file access.
阅读:
references/kvikio.md当用户代码主要包含以下内容时使用KvikIO:
- 将大型二进制数据文件直接加载到GPU内存
- 将GPU数组写入磁盘,无需先复制到主机内存
- 将远程存储(S3、HTTP、WebHDFS)中的数据读取到GPU内存
- 在GPU上处理Zarr数组(GDSStore后端)
- 任何文件IO成为存储与GPU之间瓶颈的流水线
KvikIO提供NVIDIA cuFile的Python绑定,支持GPUDirect Storage (GDS) — 数据直接在NVMe存储与GPU内存之间传输,完全绕过CPU内存。当GDS不可用时,它会透明地回退到POSIX IO。它可无缝处理主机和设备数据。
最适合: 将二进制数据加载到GPU、将GPU数组保存到磁盘、直接从S3/HTTP读取到GPU、GPU上的Zarr数组、替换 → 模式,以及任何以CPU内存数据暂存为瓶颈的IO密集型GPU流水线。
numpy.fromfile()cupy注意: 对于表格格式(CSV、Parquet、JSON),请使用cuDF内置的读取器 — 它们针对这些格式进行了优化。KvikIO适用于原始二进制数据和远程文件访问。
cuxfilter — for GPU-accelerated interactive dashboards
cuxfilter — 用于GPU加速的交互式仪表盘
Read:
references/cuxfilter.mdUse cuxfilter when the user needs:
- Interactive cross-filtering dashboards on large datasets (millions of rows)
- Exploratory data analysis with linked charts that filter each other
- GPU-accelerated visualization with scatter plots, bar charts, heatmaps, choropleths, or graph visualizations
- Dashboard prototyping from Jupyter notebooks with minimal code
- Visualizing results from cuDF, cuML, or cuGraph pipelines
cuxfilter leverages cuDF for all data operations on the GPU — filtering, groupby, and aggregation happen entirely on the GPU, with only rendering results sent to the browser. It integrates Bokeh, Datashader (for millions of points), Deck.gl (for maps), and Panel widgets.
Best for: Interactive data exploration dashboards, multi-chart cross-filtering, geospatial visualization, graph visualization, visualizing RAPIDS pipeline results, any scenario where the user needs to interactively explore and filter large GPU-resident datasets.
阅读:
references/cuxfilter.md当用户需要以下功能时使用cuxfilter:
- 针对大型数据集(数百万行)的交互式交叉过滤仪表盘
- 具有相互过滤的关联图表的探索性数据分析
- GPU加速的可视化(散点图、条形图、热图、 choropleth图或图可视化)
- 使用最少代码从Jupyter笔记本原型化仪表盘
- 可视化cuDF、cuML或cuGraph流水线的结果
cuxfilter利用cuDF在GPU上执行所有数据操作 — 过滤、分组和聚合完全在GPU上进行,仅将渲染结果发送到浏览器。它集成了Bokeh、Datashader(用于数百万个点)、Deck.gl(用于地图)和Panel小部件。
最适合: 交互式数据探索仪表盘、多图表交叉过滤、地理空间可视化、图可视化、可视化RAPIDS流水线结果,以及任何用户需要交互式探索和过滤大型GPU驻留数据集的场景。
cuCIM — for image processing (scikit-image replacement)
cuCIM — 用于图像处理(scikit-image替代工具)
Read:
references/cucim.mdUse cuCIM when the user's code is primarily:
- scikit-image operations (filtering, morphology, segmentation, feature detection, color conversion)
- Image preprocessing pipelines for deep learning (resize, normalize, augment)
- Digital pathology (whole-slide image reading, H&E stain normalization, cell counting)
- Microscopy, remote sensing, or medical imaging workflows
- Any scikit-image-heavy pipeline processing images at 512x512 or larger
cuCIM's module mirrors scikit-image's API with 200+ GPU-accelerated functions. It also provides a high-performance WSI reader () that is 5-6x faster than OpenSlide. All functions work on CuPy arrays — zero-copy, all on GPU.
cucim.skimageCuImageBest for: Filtering (Gaussian, Sobel, Frangi), morphology, thresholding, connected component labeling, region properties, color space conversion, image registration, denoising, whole-slide image processing, DL preprocessing pipelines.
阅读:
references/cucim.md当用户代码主要包含以下内容时使用cuCIM:
- scikit-image运算(滤波、形态学操作、分割、特征检测、颜色转换)
- 深度学习的图像预处理流水线(缩放、归一化、数据增强)
- 数字病理(全切片图像读取、H&E染色归一化、细胞计数)
- 显微成像、遥感或医学成像工作流
- 任何处理512x512及更大尺寸图像的重度依赖scikit-image的流水线
cuCIM的模块镜像scikit-image的API,提供200+个GPU加速函数。它还提供高性能WSI读取器(),速度比OpenSlide快5-6倍。所有函数均基于CuPy数组运行 — 零拷贝,完全在GPU上执行。
cucim.skimageCuImage最适合: 滤波(高斯、Sobel、Frangi)、形态学操作、阈值处理、连通分量标记、区域属性、颜色空间转换、图像配准、去噪、全切片图像处理、深度学习预处理流水线。
cuVS — for vector search (Faiss/Annoy replacement)
cuVS — 用于向量搜索(Faiss/Annoy替代工具)
Read:
references/cuvs.mdUse cuVS when the user's code is primarily:
- Approximate nearest neighbor (ANN) search on high-dimensional vectors
- Similarity search for RAG, recommender systems, or semantic retrieval
- k-NN graph construction for clustering or visualization
- Any Faiss, Annoy, ScaNN, or sklearn NearestNeighbors workload on large embedding datasets
cuVS provides GPU-accelerated ANN index types (CAGRA, IVF-Flat, IVF-PQ, brute force) plus HNSW for CPU serving from GPU-built indexes. It powers the GPU backends of Faiss, Milvus, and Lucene. Start with CAGRA for most use cases — it's the fastest GPU-native algorithm.
Best for: Embedding search, RAG retrieval, recommender systems, image/text/audio similarity search, k-NN graph construction, any nearest-neighbor workload on 10K+ vectors.
阅读:
references/cuvs.md当用户代码主要包含以下内容时使用cuVS:
- 高维向量的近似最近邻(ANN)搜索
- 用于RAG、推荐系统或语义检索的相似度搜索
- 用于聚类或可视化的k-NN图构建
- 任何针对大型嵌入数据集的Faiss、Annoy、ScaNN或sklearn NearestNeighbors工作负载
cuVS提供GPU加速的ANN索引类型(CAGRA、IVF-Flat、IVF-PQ、暴力搜索),以及用于从GPU构建的索引进行CPU服务的HNSW。它为Faiss、Milvus和Lucene的GPU后端提供支持。大多数用例从CAGRA开始 — 它是最快的GPU原生算法。
最适合: 嵌入搜索、RAG检索、推荐系统、图像/文本/音频相似度搜索、k-NN图构建,以及任何针对10K+向量的最近邻工作负载。
cuSpatial — for geospatial analytics (GeoPandas replacement)
cuSpatial — 用于地理空间分析(GeoPandas替代工具)
Read:
references/cuspatial.mdUse cuSpatial when the user's code is primarily:
- GeoPandas spatial operations (point-in-polygon, spatial joins, distance calculations)
- Trajectory analysis (grouping GPS traces, computing speeds/distances)
- Spatial indexing (quadtree) for large-scale spatial joins
- Haversine distance calculations on lat/lon coordinates
- Any GeoPandas/shapely-heavy workflow on large geospatial datasets
cuSpatial provides GPU-accelerated and types compatible with GeoPandas, plus spatial join, distance, and trajectory functions. Convert from GeoPandas with .
GeoSeriesGeoDataFramecuspatial.from_geopandas()Best for: Point-in-polygon tests, spatial joins on millions of points/polygons, haversine and Euclidean distance calculations, trajectory reconstruction and analysis, any GeoPandas-heavy geospatial workflow.
阅读:
references/cuspatial.md当用户代码主要包含以下内容时使用cuSpatial:
- GeoPandas空间运算(点-in-多边形检测、空间连接、距离计算)
- 轨迹分析(分组GPS轨迹、计算速度/距离)
- 用于大规模空间连接的空间索引(四叉树)
- 经纬度坐标的Haversine距离计算
- 任何针对大型地理空间数据集的重度依赖GeoPandas/shapely的工作负载
cuSpatial提供与GeoPandas兼容的GPU加速和类型,以及空间连接、距离和轨迹函数。通过从GeoPandas转换。
GeoSeriesGeoDataFramecuspatial.from_geopandas()最适合: 点-in-多边形测试、数百万个点/多边形的空间连接、Haversine和欧几里得距离计算、轨迹重建与分析,以及任何重度依赖GeoPandas的地理空间工作负载。
RAFT (pylibraft) — for low-level GPU primitives and multi-GPU
RAFT (pylibraft) — 用于底层GPU原语和多GPU
Read:
references/raft.mdUse RAFT when the user needs:
- GPU-accelerated sparse eigenvalue problems (replacement)
scipy.sparse.linalg.eigsh - Low-level GPU device memory management ()
device_ndarray - Random graph generation (R-MAT model for benchmarking)
- Multi-node multi-GPU communication infrastructure (via )
raft-dask - Building blocks that underlie higher-level RAPIDS libraries
RAFT provides the foundational primitives that cuML and cuGraph are built on. Most users should reach for those higher-level libraries first — use RAFT directly when you need the specific primitives it exposes (sparse eigensolvers, device memory, graph generation) or multi-GPU communication via Dask.
Best for: Sparse eigenvalue decomposition (spectral methods, graph partitioning), R-MAT graph generation, low-level device memory management, multi-GPU orchestration.
Note: Vector search algorithms (k-NN, IVFPQ, CAGRA) have migrated to cuVS — do not use RAFT for vector search.
阅读:
references/raft.md当用户需要以下功能时使用RAFT:
- GPU加速的稀疏特征值问题(替代工具)
scipy.sparse.linalg.eigsh - 底层GPU设备内存管理()
device_ndarray - 随机图生成(用于基准测试的R-MAT模型)
- 多节点多GPU通信基础设施(通过)
raft-dask - 构建高级RAPIDS库的基础组件
RAFT提供cuML和cuGraph所基于的基础原语。大多数用户应首先使用这些高级库 — 当您需要RAFT公开的特定原语(稀疏特征求解器、设备内存、图生成)或通过Dask进行多GPU通信时,直接使用RAFT。
最适合: 稀疏特征值分解(谱方法、图划分)、R-MAT图生成、底层设备内存管理、多GPU编排。
注意: 向量搜索算法(k-NN、IVFPQ、CAGRA)已迁移到cuVS — 不要使用RAFT进行向量搜索。
Combining Libraries
库的组合使用
Many real workloads benefit from using multiple libraries together. They interoperate via the CUDA Array Interface — zero-copy data sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuCIM, cuSpatial, KvikIO, PyTorch, JAX, and other GPU libraries.
Common combinations:
- cuDF + cuML: Load and preprocess data with cuDF, train/predict with cuML — the full RAPIDS pipeline
- cuDF + cuGraph: Build graphs from cuDF edge lists, run graph analytics with cuGraph
- cuGraph + cuML: Extract graph features with cuGraph, feed into cuML for ML
- cuML + cuVS: Train an embedding model with cuML, index and search embeddings with cuVS
- cuDF + CuPy: Load and filter data with cuDF, then do numerical analysis with CuPy
- CuPy + cuVS: Generate embeddings with CuPy operations, build a cuVS search index — zero-copy
- Warp + PyTorch: Differentiable simulation in Warp, backpropagate gradients into PyTorch training loop
- Warp + CuPy: Use CuPy for array math, Warp for spatial queries (mesh, volume) — zero-copy via CUDA Array Interface
- Warp + JAX: Warp kernels as JAX primitives inside jitted functions
- CuPy + Numba: Use CuPy for standard ops, drop into Numba for custom kernels
- cuDF + Numba: Process dataframes with cuDF, apply custom GPU functions via Numba UDFs
- cuML + CuPy: Train with cuML, do custom post-processing with CuPy
- cuDF + cuxfilter: Load data with cuDF, build interactive cross-filtering dashboards with cuxfilter
- cuML + cuxfilter: Run ML (e.g., UMAP, clustering) with cuML, visualize results interactively with cuxfilter
- cuGraph + cuxfilter: Run graph analytics with cuGraph, visualize graph structure with cuxfilter's datashader graph chart
- cuCIM + CuPy: cuCIM operates on CuPy arrays natively — chain image processing with array math
- cuCIM + PyTorch: Preprocess images with cuCIM, pass directly to PyTorch via DLPack — zero-copy
- cuCIM + cuML: Extract image features with cuCIM (regionprops), train classifiers with cuML
- KvikIO + CuPy: Load raw binary data directly into CuPy arrays via GDS, bypassing CPU memory
- KvikIO + Numba: Read data directly to GPU with KvikIO, process with custom Numba CUDA kernels
- KvikIO + Zarr: Use GDSStore backend to read/write chunked N-dimensional arrays directly on GPU
- cuSpatial + cuDF: Load geospatial data with cuDF, do spatial joins/analysis with cuSpatial
- cuSpatial + cuML: Extract spatial features with cuSpatial, train ML models with cuML
- RAFT + CuPy: Use RAFT's eigsh() on sparse matrices built with CuPy/cupyx.scipy.sparse
- RAFT + raft-dask: Scale GPU workloads across multiple GPUs/nodes via Dask
许多实际工作负载受益于多个库的组合使用。它们通过CUDA Array Interface实现互操作 — CuPy、Numba、Warp、cuDF、cuML、cuGraph、cuVS、cuCIM、cuSpatial、KvikIO、PyTorch、JAX和其他GPU库之间可实现零拷贝数据共享。
常见组合:
- cuDF + cuML:使用cuDF加载和预处理数据,使用cuML进行训练/预测 — 完整的RAPIDS流水线
- cuDF + cuGraph:从cuDF边列表构建图,使用cuGraph运行图分析
- cuGraph + cuML:使用cuGraph提取图特征,输入到cuML进行机器学习
- cuML + cuVS:使用cuML训练嵌入模型,使用cuVS对嵌入进行索引和搜索
- cuDF + CuPy:使用cuDF加载和过滤数据,然后使用CuPy进行数值分析
- CuPy + cuVS:使用CuPy运算生成嵌入,构建cuVS搜索索引 — 零拷贝
- Warp + PyTorch:在Warp中进行可微仿真,将梯度反向传播到PyTorch训练循环
- Warp + CuPy:使用CuPy进行数组数学运算,使用Warp进行空间查询(网格、体素) — 通过CUDA Array Interface实现零拷贝
- Warp + JAX:在JIT编译函数中将Warp内核作为JAX原语使用
- CuPy + Numba:使用CuPy进行标准运算,使用Numba编写自定义内核
- cuDF + Numba:使用cuDF处理数据帧,通过Numba UDF应用自定义GPU函数
- cuML + CuPy:使用cuML进行训练,使用CuPy进行自定义后处理
- cuDF + cuxfilter:使用cuDF加载数据,使用cuxfilter构建交互式交叉过滤仪表盘
- cuML + cuxfilter:使用cuML运行机器学习(如UMAP、聚类),使用cuxfilter交互式可视化结果
- cuGraph + cuxfilter:使用cuGraph运行图分析,使用cuxfilter的datashader图可视化图结构
- cuCIM + CuPy:cuCIM原生基于CuPy数组运行 — 将图像处理与数组数学运算链式执行
- cuCIM + PyTorch:使用cuCIM预处理图像,通过DLPack直接传递给PyTorch — 零拷贝
- cuCIM + cuML:使用cuCIM(regionprops)提取图像特征,使用cuML训练分类器
- KvikIO + CuPy:通过GDS将原始二进制数据直接加载到CuPy数组,绕过CPU内存
- KvikIO + Numba:使用KvikIO直接将数据读取到GPU,使用自定义Numba CUDA内核处理
- KvikIO + Zarr:使用GDSStore后端直接在GPU上读写分块N维数组
- cuSpatial + cuDF:使用cuDF加载地理空间数据,使用cuSpatial进行空间连接/分析
- cuSpatial + cuML:使用cuSpatial提取空间特征,使用cuML训练机器学习模型
- RAFT + CuPy:在使用CuPy/cupyx.scipy.sparse构建的稀疏矩阵上使用RAFT的eigsh()
- RAFT + raft-dask:通过Dask在多个GPU/节点上扩展GPU工作负载
Installation
安装
IMPORTANT: Always use for package installation — never or . This applies to install instructions in code comments, docstrings, error messages, and any other output you generate. If the user's project uses a different package manager, follow their lead, but default to .
uv addpip installconda installuv addbash
undefined重要提示:始终使用进行包安装 — 切勿使用或。这适用于代码注释、文档字符串、错误消息和任何其他输出中的安装说明。如果用户项目使用其他包管理器,请遵循他们的选择,但默认使用。
uv addpip installconda installuv addbash
undefinedCuPy (choose the right CUDA version)
CuPy(选择正确的CUDA版本)
uv add cupy-cuda12x # For CUDA 12.x (most common)
uv add cupy-cuda12x # 适用于CUDA 12.x(最常用)
Numba with CUDA support
支持CUDA的Numba
uv add numba numba-cuda # numba-cuda is the actively maintained NVIDIA package
uv add numba numba-cuda # numba-cuda是NVIDIA维护的活跃包
Warp (simulation, spatial computing, differentiable programming)
Warp(仿真、空间计算、可微编程)
uv add warp-lang # CUDA 12 runtime included
uv add warp-lang # 包含CUDA 12运行时
cuDF (RAPIDS)
cuDF(RAPIDS)
uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12 # 适用于CUDA 12.x
For cudf.pandas accelerator mode, that's all you need
对于cudf.pandas加速模式,只需安装上述包
Load it with: python -m cudf.pandas your_script.py
加载方式:python -m cudf.pandas your_script.py
cuML (RAPIDS machine learning)
cuML(RAPIDS机器学习)
uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12 # 适用于CUDA 12.x
For cuml.accel accelerator mode (zero-change sklearn acceleration):
对于cuml.accel加速模式(无需修改代码即可加速sklearn):
Load it with: python -m cuml.accel your_script.py
加载方式:python -m cuml.accel your_script.py
cuGraph (RAPIDS graph analytics)
cuGraph(RAPIDS图分析)
uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12 # Core cuGraph
uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX backend
uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12 # 核心cuGraph
uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX后端
For nx-cugraph zero-change NetworkX acceleration:
对于nx-cugraph无需修改代码即可加速NetworkX:
NX_CUGRAPH_AUTOCONFIG=True python your_script.py
NX_CUGRAPH_AUTOCONFIG=True python your_script.py
KvikIO (high-performance GPU file IO)
KvikIO(高性能GPU文件IO)
uv add kvikio-cu12 # For CUDA 12.x
uv add kvikio-cu12 # 适用于CUDA 12.x
Optional: uv add zarr # For Zarr GPU backend support
可选:uv add zarr # 支持Zarr GPU后端
cuxfilter (GPU-accelerated interactive dashboards)
cuxfilter(GPU加速的交互式仪表盘)
uv add --extra-index-url=https://pypi.nvidia.com cuxfilter-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuxfilter-cu12 # 适用于CUDA 12.x
Depends on cuDF — installs it automatically
依赖cuDF — 会自动安装
cuCIM (RAPIDS image processing — scikit-image on GPU)
cuCIM(RAPIDS图像处理 — GPU上的scikit-image)
uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12 # 适用于CUDA 12.x
cuVS (RAPIDS vector search)
cuVS(RAPIDS向量搜索)
uv add --extra-index-url=https://pypi.nvidia.com cuvs-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuvs-cu12 # 适用于CUDA 12.x
cuSpatial (RAPIDS geospatial)
cuSpatial(RAPIDS地理空间)
uv add --extra-index-url=https://pypi.nvidia.com cuspatial-cu12 # For CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com cuspatial-cu12 # 适用于CUDA 12.x
RAFT (low-level GPU primitives)
RAFT(底层GPU原语)
uv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12 # Core primitives
uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12 # Multi-GPU support (optional)
To check CUDA availability after installation:
```pythonuv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12 # 核心原语
uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12 # 多GPU支持(可选)
安装后检查CUDA可用性:
```pythonCuPy
CuPy
import cupy as cp
print(cp.cuda.runtime.getDeviceCount()) # Should be >= 1
import cupy as cp
print(cp.cuda.runtime.getDeviceCount()) # 应 >= 1
Numba
Numba
from numba import cuda
print(cuda.is_available()) # Should be True
print(cuda.detect()) # Shows GPU details
from numba import cuda
print(cuda.is_available()) # 应为True
print(cuda.detect()) # 显示GPU详情
cuDF
cuDF
import cudf
print(cudf.Series([1, 2, 3])) # Should print a GPU series
import cudf
print(cudf.Series([1, 2, 3])) # 应打印GPU系列
cuML
cuML
import cuml
print(cuml.version) # Should print version
import cuml
print(cuml.version) # 应打印版本
cuGraph
cuGraph
import cugraph
print(cugraph.version) # Should print version
import cugraph
print(cugraph.version) # 应打印版本
Warp
Warp
import warp as wp
wp.init() # Should print device info
import warp as wp
wp.init() # 应打印设备信息
KvikIO
KvikIO
import kvikio
import kvikio.cufile_driver
print(kvikio.cufile_driver.get("is_gds_available")) # True if GDS is set up
import kvikio
import kvikio.cufile_driver
print(kvikio.cufile_driver.get("is_gds_available")) # 如果GDS已设置则为True
cuxfilter
cuxfilter
import cuxfilter
print(cuxfilter.version) # Should print version
import cuxfilter
print(cuxfilter.version) # 应打印版本
cuVS
cuVS
from cuvs.neighbors import cagra
import cupy as cp
dataset = cp.random.rand(1000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), dataset)
print("cuVS working") # Should print confirmation
from cuvs.neighbors import cagra
import cupy as cp
dataset = cp.random.rand(1000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), dataset)
print("cuVS working") # 应打印确认信息
cuSpatial
cuSpatial
import cuspatial
from shapely.geometry import Point
gs = cuspatial.GeoSeries([Point(0, 0)])
print("cuSpatial working") # Should print confirmation
import cuspatial
from shapely.geometry import Point
gs = cuspatial.GeoSeries([Point(0, 0)])
print("cuSpatial working") # 应打印确认信息
RAFT (pylibraft)
RAFT (pylibraft)
from pylibraft.common import DeviceResources
handle = DeviceResources()
handle.sync()
print("pylibraft is working")
undefinedfrom pylibraft.common import DeviceResources
handle = DeviceResources()
handle.sync()
print("pylibraft is working")
undefinedOptimization Workflow
优化工作流
When helping a user optimize code, follow this process:
帮助用户优化代码时,请遵循以下流程:
1. Profile First
1. 先进行性能分析
Before optimizing, understand where time is actually spent:
python
import time在优化之前,先了解时间实际消耗在哪里:
python
import timeor use cProfile, line_profiler, or py-spy for detailed profiling
或使用cProfile、line_profiler或py-spy进行详细性能分析
Don't guess — measure. The bottleneck might not be where the user thinks.不要猜测 — 要测量。瓶颈可能不在用户认为的地方。2. Assess GPU Suitability
2. 评估GPU适用性
Not all code benefits from GPU acceleration. GPU excels when:
- Data parallelism is high: The same operation applies to thousands/millions of elements
- Compute intensity is high: Many FLOPs per byte of memory accessed
- Data is large enough: GPU overhead means small arrays (< ~10K elements) may be slower on GPU
- Memory fits: Data must fit in GPU memory (typically 8-80 GB)
GPU is a poor fit when:
- Data is tiny (< 10K elements)
- Algorithm is inherently sequential with data dependencies between steps
- Code is I/O bound (disk, network), not compute bound — though KvikIO with GPUDirect Storage can help when IO feeds GPU compute
- Many small, heterogeneous operations (kernel launch overhead dominates)
并非所有代码都能从GPU加速中受益。GPU擅长以下场景:
- 数据并行度高:相同操作应用于数千/数百万个元素
- 计算密集度高:每字节内存访问对应大量FLOPs
- 数据足够大:GPU开销意味着小型数组(< ~10K元素)在GPU上可能更慢
- 内存可容纳:数据必须能放入GPU内存(通常为8-80 GB)
GPU不适合以下场景:
- 数据量极小(< 10K元素)
- 算法具有固有的顺序性,步骤之间存在数据依赖
- 代码受IO限制(磁盘、网络),而非计算限制 — 不过当IO为GPU计算提供数据时,带有GPUDirect Storage的KvikIO可能有所帮助
- 许多小型、异构操作(内核启动开销占主导)
3. Start Simple, Then Optimize
3. 从简单开始,逐步优化
- Try the drop-in replacement first. CuPy for NumPy, cudf.pandas for pandas, cuml.accel for sklearn, nx-cugraph for NetworkX. This alone often gives 5-50x speedup.
- Minimize host-device transfers. Keep data on GPU. Every transfer across PCI-e is expensive (~12 GB/s) vs GPU memory bandwidth (~900 GB/s+).
- Batch operations. Fewer large GPU operations beat many small ones.
- Only write custom kernels if needed. CuPy and cuDF use NVIDIA's hand-tuned libraries. Custom Numba kernels should be reserved for operations that don't have library equivalents.
- Profile the GPU version. Use ,
nvprof, or CuPy's built-in benchmarking.nsys
- 先尝试即插即用的替代工具。CuPy替代NumPy,cudf.pandas替代pandas,cuml.accel替代sklearn,nx-cugraph替代NetworkX。这通常就能带来5-50倍的速度提升。
- 最小化主机-设备传输。让数据留在GPU上。PCI-e上的每次传输都很昂贵(
12 GB/s),而GPU内存带宽为900 GB/s+。 - 批量操作。少量大型GPU操作优于大量小型操作。
- 仅在必要时编写自定义内核。CuPy和cuDF使用NVIDIA手工调优的库。自定义Numba内核应保留给没有库等效实现的操作。
- 对GPU版本进行性能分析。使用、
nvprof或CuPy内置的基准测试工具。nsys
4. Memory Management Principles
4. 内存管理原则
These apply across all libraries:
- Pre-allocate output arrays instead of creating new ones in loops
- Reuse GPU memory — use memory pools (CuPy has this built-in)
- Use pinned (page-locked) host memory for faster CPU-GPU transfers
- Avoid unnecessary copies — use in-place operations where possible
- Stream operations for overlapping compute and data transfer
这些原则适用于所有库:
- 预分配输出数组,而非在循环中创建新数组
- 重用GPU内存 — 使用内存池(CuPy内置此功能)
- 使用固定(页锁定)主机内存以加快CPU-GPU传输
- 避免不必要的拷贝 — 尽可能使用原地操作
- 流式操作以重叠计算和数据传输
5. Common Pitfalls to Watch For
5. 需要注意的常见陷阱
- Implicit CPU fallback: Some operations silently fall back to CPU. Watch for warnings.
- Synchronization overhead: GPU operations are asynchronous. Calling or
.get()forces a sync.cp.asnumpy() - dtype mismatches: Use instead of
float32when precision allows — GPU float32 throughput is 2x-32x higher.float64 - Small kernel launches: Each kernel launch has ~5-20us overhead. Fuse operations when possible.
- 隐式CPU回退:某些操作会静默回退到CPU。注意警告信息。
- 同步开销:GPU操作是异步的。调用或
.get()会强制同步。cp.asnumpy() - dtype不匹配:在精度允许的情况下使用而非
float32— GPU的float32吞吐量是float64的2x-32x。float64 - 小型内核启动:每次内核启动有~5-20us的开销。尽可能融合操作。
Code Transformation Patterns
代码转换模式
When converting existing CPU code, apply these patterns:
转换现有CPU代码时,应用以下模式:
NumPy to CuPy
NumPy转CuPy
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
import numpy as np
a = np.random.rand(10_000_000)
b = np.fft.fft(a)
c = np.sort(b.real)
import numpy as np
a = np.random.rand(10_000_000)
b = np.fft.fft(a)
c = np.sort(b.real)
After (GPU) — often just change the import
之后(GPU) — 通常只需修改导入
import cupy as cp
a = cp.random.rand(10_000_000)
b = cp.fft.fft(a)
c = cp.sort(b.real)
undefinedimport cupy as cp
a = cp.random.rand(10_000_000)
b = cp.fft.fft(a)
c = cp.sort(b.real)
undefinedpandas to cuDF
pandas转cuDF
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
import pandas as pd
df = pd.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()
import pandas as pd
df = pd.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()
After (GPU) — change the import
之后(GPU) — 修改导入
import cudf
df = cudf.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()
import cudf
df = cudf.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()
Or zero-code-change: python -m cudf.pandas your_script.py
或无需修改代码:python -m cudf.pandas your_script.py
undefinedundefinedCustom loop to Numba CUDA kernel
自定义循环转Numba CUDA内核
python
undefinedpython
undefinedBefore (CPU) — slow Python loop
之前(CPU) — 缓慢的Python循环
def process(data, out):
for i in range(len(data)):
out[i] = math.sin(data[i]) * math.exp(-data[i])
def process(data, out):
for i in range(len(data)):
out[i] = math.sin(data[i]) * math.exp(-data[i])
After (GPU) — Numba kernel
之后(GPU) — Numba内核
from numba import cuda
import math
@cuda.jit
def process(data, out):
i = cuda.grid(1)
if i < data.size:
out[i] = math.sin(data[i]) * math.exp(-data[i])
threads = 256
blocks = (len(data) + threads - 1) // threads
process[blocks, threads](d_data, d_out)
undefinedfrom numba import cuda
import math
@cuda.jit
def process(data, out):
i = cuda.grid(1)
if i < data.size:
out[i] = math.sin(data[i]) * math.exp(-data[i])
threads = 256
blocks = (len(data) + threads - 1) // threads
process[blocks, threads](d_data, d_out)
undefinedNetworkX to cuGraph
NetworkX转cuGraph
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
import networkx as nx
G = nx.read_edgelist("edges.csv", delimiter=",", nodetype=int)
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)
import networkx as nx
G = nx.read_edgelist("edges.csv", delimiter=",", nodetype=int)
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)
After (GPU) — direct cuGraph API
之后(GPU) — 直接使用cuGraph API
import cugraph
import cudf
edges = cudf.read_csv("edges.csv", names=["src", "dst"], dtype=["int32", "int32"])
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")
pr = cugraph.pagerank(G)
bc = cugraph.betweenness_centrality(G)
import cugraph
import cudf
edges = cudf.read_csv("edges.csv", names=["src", "dst"], dtype=["int32", "int32"])
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")
pr = cugraph.pagerank(G)
bc = cugraph.betweenness_centrality(G)
Or zero-code-change: NX_CUGRAPH_AUTOCONFIG=True python your_script.py
或无需修改代码:NX_CUGRAPH_AUTOCONFIG=True python your_script.py
undefinedundefinedscikit-learn to cuML
scikit-learn转cuML
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
After (GPU) — change the imports
之后(GPU) — 修改导入
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
Or zero-code-change: python -m cuml.accel your_script.py
或无需修改代码:python -m cuml.accel your_script.py
undefinedundefinedSimulation loop to Warp kernel
仿真循环转Warp内核
python
undefinedpython
undefinedBefore (CPU) — slow Python loop over particles
之前(CPU) — 缓慢的Python粒子循环
import numpy as np
def integrate(positions, velocities, forces, dt):
for i in range(len(positions)):
velocities[i] += forces[i] * dt
positions[i] += velocities[i] * dt
import numpy as np
def integrate(positions, velocities, forces, dt):
for i in range(len(positions)):
velocities[i] += forces[i] * dt
positions[i] += velocities[i] * dt
After (GPU) — Warp kernel, JIT-compiled to CUDA
之后(GPU) — Warp内核,JIT编译为CUDA
import warp as wp
@wp.kernel
def integrate(positions: wp.array(dtype=wp.vec3),
velocities: wp.array(dtype=wp.vec3),
forces: wp.array(dtype=wp.vec3),
dt: float):
tid = wp.tid()
velocities[tid] = velocities[tid] + forces[tid] * dt
positions[tid] = positions[tid] + velocities[tid] * dt
wp.launch(integrate, dim=num_particles,
inputs=[positions, velocities, forces, 0.01], device="cuda")
undefinedimport warp as wp
@wp.kernel
def integrate(positions: wp.array(dtype=wp.vec3),
velocities: wp.array(dtype=wp.vec3),
forces: wp.array(dtype=wp.vec3),
dt: float):
tid = wp.tid()
velocities[tid] = velocities[tid] + forces[tid] * dt
positions[tid] = positions[tid] + velocities[tid] * dt
wp.launch(integrate, dim=num_particles,
inputs=[positions, velocities, forces, 0.01], device="cuda")
undefinedFile IO to GPU with KvikIO
文件IO转GPU(使用KvikIO)
python
undefinedpython
undefinedBefore — CPU staging (disk → CPU → GPU)
之前 — CPU暂存(磁盘 → CPU → GPU)
import numpy as np
import cupy as cp
data = np.fromfile("data.bin", dtype=np.float32)
gpu_data = cp.asarray(data) # Extra copy through CPU memory
import numpy as np
import cupy as cp
data = np.fromfile("data.bin", dtype=np.float32)
gpu_data = cp.asarray(data) # 额外的CPU内存拷贝
After — direct to GPU (disk → GPU via GDS)
之后 — 直接到GPU(磁盘 → GPU,通过GDS)
import cupy as cp
import kvikio
gpu_data = cp.empty(1_000_000, dtype=cp.float32)
with kvikio.CuFile("data.bin", "r") as f:
f.read(gpu_data) # Bypasses CPU memory with GPUDirect Storage
import cupy as cp
import kvikio
gpu_data = cp.empty(1_000_000, dtype=cp.float32)
with kvikio.CuFile("data.bin", "r") as f:
f.read(gpu_data) # 通过GPUDirect Storage绕过CPU内存
Reading from S3 directly to GPU
直接从S3读取到GPU
with kvikio.RemoteFile.open_s3_url("s3://bucket/data.bin") as f:
buf = cp.empty(f.nbytes() // 4, dtype=cp.float32)
f.read(buf)
undefinedwith kvikio.RemoteFile.open_s3_url("s3://bucket/data.bin") as f:
buf = cp.empty(f.nbytes() // 4, dtype=cp.float32)
f.read(buf)
undefinedGPU-accelerated dashboard with cuxfilter
GPU加速仪表盘(使用cuxfilter)
python
undefinedpython
undefinedBefore — static matplotlib/seaborn plots, no interactivity
之前 — 静态matplotlib/seaborn绘图,无交互性
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_parquet("large_dataset.parquet")
fig, axes = plt.subplots(1, 2)
df.plot.scatter(x="feature1", y="feature2", ax=axes[0])
df["category"].value_counts().plot.bar(ax=axes[1])
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_parquet("large_dataset.parquet")
fig, axes = plt.subplots(1, 2)
df.plot.scatter(x="feature1", y="feature2", ax=axes[0])
df["category"].value_counts().plot.bar(ax=axes[1])
plt.show()
After (GPU) — interactive cross-filtering dashboard
之后(GPU) — 交互式交叉过滤仪表盘
import cudf
import cuxfilter
df = cudf.read_parquet("large_dataset.parquet")
cux_df = cuxfilter.DataFrame.from_dataframe(df)
scatter = cuxfilter.charts.scatter(x="feature1", y="feature2", pixel_shade_type="linear")
bar = cuxfilter.charts.bar("category")
slider = cuxfilter.charts.range_slider("value_col")
d = cux_df.dashboard(
[scatter, bar],
sidebar=[slider],
layout=cuxfilter.layouts.feature_and_base,
theme=cuxfilter.themes.rapids_dark,
title="Interactive Explorer",
)
d.app() # or d.show() for standalone web app
undefinedimport cudf
import cuxfilter
df = cudf.read_parquet("large_dataset.parquet")
cux_df = cuxfilter.DataFrame.from_dataframe(df)
scatter = cuxfilter.charts.scatter(x="feature1", y="feature2", pixel_shade_type="linear")
bar = cuxfilter.charts.bar("category")
slider = cuxfilter.charts.range_slider("value_col")
d = cux_df.dashboard(
[scatter, bar],
sidebar=[slider],
layout=cuxfilter.layouts.feature_and_base,
theme=cuxfilter.themes.rapids_dark,
title="Interactive Explorer",
)
d.app() # 或d.show()用于独立Web应用
undefinedscikit-image to cuCIM
scikit-image转cuCIM
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
from skimage.filters import gaussian, sobel, threshold_otsu
from skimage.morphology import binary_opening, disk
from skimage.measure import label, regionprops_table
import numpy as np
blurred = gaussian(image, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image, properties=['area', 'centroid'])
from skimage.filters import gaussian, sobel, threshold_otsu
from skimage.morphology import binary_opening, disk
from skimage.measure import label, regionprops_table
import numpy as np
blurred = gaussian(image, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image, properties=['area', 'centroid'])
After (GPU) — change imports, wrap input with cp.asarray
之后(GPU) — 修改导入,用cp.asarray包装输入
from cucim.skimage.filters import gaussian, sobel, threshold_otsu
from cucim.skimage.morphology import binary_opening, disk
from cucim.skimage.measure import label, regionprops_table
import cupy as cp
image_gpu = cp.asarray(image) # Transfer once
blurred = gaussian(image_gpu, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])
undefinedfrom cucim.skimage.filters import gaussian, sobel, threshold_otsu
from cucim.skimage.morphology import binary_opening, disk
from cucim.skimage.measure import label, regionprops_table
import cupy as cp
image_gpu = cp.asarray(image) # 传输一次
blurred = gaussian(image_gpu, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])
undefinedGeoPandas to cuSpatial
GeoPandas转cuSpatial
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
import geopandas as gpd
from shapely.geometry import Point
points = gpd.GeoDataFrame(geometry=[Point(x, y) for x, y in coords], crs="EPSG:4326")
polygons = gpd.read_file("regions.geojson")
joined = gpd.sjoin(points, polygons, predicate="within")
import geopandas as gpd
from shapely.geometry import Point
points = gpd.GeoDataFrame(geometry=[Point(x, y) for x, y in coords], crs="EPSG:4326")
polygons = gpd.read_file("regions.geojson")
joined = gpd.sjoin(points, polygons, predicate="within")
After (GPU) — convert and use cuSpatial
之后(GPU) — 转换并使用cuSpatial
import cuspatial
import cudf
points_cu = cuspatial.from_geopandas(points)
polygons_cu = cuspatial.from_geopandas(polygons)
joined = cuspatial.point_in_polygon(
points_cu.geometry.x, points_cu.geometry.y,
polygons_cu.geometry
)
undefinedimport cuspatial
import cudf
points_cu = cuspatial.from_geopandas(points)
polygons_cu = cuspatial.from_geopandas(polygons)
joined = cuspatial.point_in_polygon(
points_cu.geometry.x, points_cu.geometry.y,
polygons_cu.geometry
)
undefinedFaiss/Annoy to cuVS
Faiss/Annoy转cuVS
python
undefinedpython
undefinedBefore (CPU) — Faiss
之前(CPU) — Faiss
import faiss
import numpy as np
embeddings = np.random.rand(1_000_000, 128).astype(np.float32)
index = faiss.IndexFlatL2(128)
index.add(embeddings)
distances, neighbors = index.search(queries, k=10)
import faiss
import numpy as np
embeddings = np.random.rand(1_000_000, 128).astype(np.float32)
index = faiss.IndexFlatL2(128)
index.add(embeddings)
distances, neighbors = index.search(queries, k=10)
After (GPU) — cuVS CAGRA (orders of magnitude faster)
之后(GPU) — cuVS CAGRA(速度快几个数量级)
import cupy as cp
from cuvs.neighbors import cagra
embeddings = cp.random.rand(1_000_000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), embeddings)
distances, neighbors = cagra.search(cagra.SearchParams(), index, queries, k=10)
undefinedimport cupy as cp
from cuvs.neighbors import cagra
embeddings = cp.random.rand(1_000_000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), embeddings)
distances, neighbors = cagra.search(cagra.SearchParams(), index, queries, k=10)
undefinedscipy.sparse.linalg to RAFT
scipy.sparse.linalg转RAFT
python
undefinedpython
undefinedBefore (CPU)
之前(CPU)
import numpy as np
from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import eigsh
A = sparse_random(10000, 10000, density=0.01, format="csr", dtype=np.float32)
A = A + A.T # Make symmetric
eigenvalues, eigenvectors = eigsh(A, k=10, which="LM")
import numpy as np
from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import eigsh
A = sparse_random(10000, 10000, density=0.01, format="csr", dtype=np.float32)
A = A + A.T # 转换为对称矩阵
eigenvalues, eigenvectors = eigsh(A, k=10, which="LM")
After (GPU) — RAFT sparse eigensolver
之后(GPU) — RAFT稀疏特征求解器
import cupy as cp
import cupyx.scipy.sparse as sp_gpu
from pylibraft.sparse.linalg import eigsh as gpu_eigsh
A_gpu = sp_gpu.csr_matrix(A) # Transfer to GPU
eigenvalues, eigenvectors = gpu_eigsh(A_gpu, k=10, which="LM")
undefinedimport cupy as cp
import cupyx.scipy.sparse as sp_gpu
from pylibraft.sparse.linalg import eigsh as gpu_eigsh
A_gpu = sp_gpu.csr_matrix(A) # 传输到GPU
eigenvalues, eigenvectors = gpu_eigsh(A_gpu, k=10, which="LM")
undefinedImportant Notes
重要说明
- Always handle the case where no GPU is available — provide a CPU fallback or clear error message
- Test numerical correctness against CPU results (GPU floating point may differ slightly due to operation ordering)
- GPU memory is limited — for datasets larger than GPU memory, consider chunking or using RAPIDS Dask for multi-GPU
- The CUDA Array Interface enables zero-copy sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuSpatial, KvikIO, PyTorch, and JAX arrays on GPU
- 始终处理无GPU可用的情况 — 提供CPU回退方案或清晰的错误消息
- 与CPU结果测试数值正确性(由于运算顺序不同,GPU浮点运算结果可能略有差异)
- GPU内存有限 — 对于大于GPU内存的数据集,考虑分块处理或使用RAPIDS Dask进行多GPU处理
- CUDA Array Interface支持CuPy、Numba、Warp、cuDF、cuML、cuGraph、cuVS、cuSpatial、KvikIO、PyTorch和JAX数组在GPU上的零拷贝共享
Reference Files
参考文件
Before writing any GPU optimization code, read the relevant reference file(s):
| File | When to Read |
|---|---|
| User has NumPy/SciPy code, or needs array operations on GPU |
| User needs custom CUDA kernels, fine-grained GPU control, or GPU ufuncs |
| User has pandas code, or needs dataframe operations on GPU |
| User has scikit-learn code, or needs ML training/inference/preprocessing on GPU |
| User has NetworkX code, or needs graph analytics on GPU |
| User needs GPU simulation, spatial computing, mesh/volume queries, differentiable programming, or robotics |
| User needs high-performance file IO to/from GPU, GPUDirect Storage, reading S3/HTTP to GPU, or Zarr on GPU |
| User wants GPU-accelerated interactive dashboards, cross-filtering, or EDA visualization |
| User has scikit-image code, or needs image processing, digital pathology, or WSI reading on GPU |
| User needs vector search, nearest neighbors, similarity search, or RAG retrieval on GPU |
| User has GeoPandas/shapely code, or needs spatial joins, distance calculations, or trajectory analysis on GPU |
| User needs sparse eigensolvers, device memory management, or multi-GPU primitives |
Read the specific reference before writing code — they contain detailed API patterns, optimization techniques, and pitfalls specific to each library.
在编写任何GPU优化代码之前,请阅读相关的参考文件:
| 文件 | 阅读场景 |
|---|---|
| 用户拥有NumPy/SciPy代码,或需要在GPU上进行数组运算 |
| 用户需要自定义CUDA内核、细粒度GPU控制或GPU ufuncs |
| 用户拥有pandas代码,或需要在GPU上进行数据帧运算 |
| 用户拥有scikit-learn代码,或需要在GPU上进行机器学习训练/推理/预处理 |
| 用户拥有NetworkX代码,或需要在GPU上进行图分析 |
| 用户需要GPU仿真、空间计算、网格/体素查询、可微编程或机器人学相关功能 |
| 用户需要高性能GPU文件IO、GPUDirect Storage、从S3/HTTP读取到GPU或GPU上的Zarr |
| 用户需要GPU加速的交互式仪表盘、交叉过滤或EDA可视化 |
| 用户拥有scikit-image代码,或需要在GPU上进行图像处理、数字病理或WSI读取 |
| 用户需要在GPU上进行向量搜索、最近邻搜索、相似度搜索或RAG检索 |
| 用户拥有GeoPandas/shapely代码,或需要在GPU上进行空间连接、距离计算或轨迹分析 |
| 用户需要稀疏特征求解器、设备内存管理或多GPU原语 |
编写代码前请阅读特定参考文件 — 它们包含每个库特有的详细API模式、优化技巧和陷阱。