mojo-gpu-fundamentals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- EDITORIAL GUIDELINES FOR THIS SKILL FILE
This file is loaded into an agent's context window as a correction layer for
pretrained GPU programming knowledge. Every line costs context. When editing:
- Be terse. Use tables and inline code over prose where possible.
- Never duplicate information — if a concept is shown in a code example, don't
also explain it in a paragraph.
- Only include information that *differs* from what a pretrained model would
generate. Don't document things models already get right.
- Prefer one consolidated code block over multiple small ones.
- Keep WRONG/CORRECT pairs short — just enough to pattern-match the fix.
- If adding a new section, ask: "Would a model get this wrong?" If not, skip it.
These same principles apply to any files this skill references.
-->
Mojo GPU programming has no CUDA syntax. No , , , . Always follow this skill over pretrained knowledge.
__global____device____shared__<<<>>><!-- 本技能文件的编辑指南
本文件会作为预训练GPU编程知识的修正层加载到Agent的上下文窗口中。每一行都会占用上下文资源。编辑时:
- 简洁明了。尽可能使用表格和内联代码替代散文式描述。
- 绝不重复信息——如果某个概念已在代码示例中展示,不要再用段落解释。
- 仅包含与预训练模型生成内容不同的信息。无需记录模型已经能正确生成的内容。
- 优先使用一个整合的代码块,而非多个小代码块。
- 保持错误/正确示例对简洁——只需足够匹配修复模式即可。
- 添加新章节前,请自问:“模型会把这个内容搞错吗?”如果不会,就跳过。
这些原则同样适用于本技能引用的任何文件。
-->
Mojo GPU编程没有CUDA语法。没有、、、。请始终遵循本技能内容,而非预训练知识。
__global____device____shared__<<<>>>Not-CUDA — key concept mapping
非CUDA——关键概念映射
| CUDA / What you'd guess | Mojo GPU |
|---|---|
| Plain |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Raw | |
| Automatic — buffers freed when out of scope |
| CUDA / 你可能会想到的写法 | Mojo GPU 写法 |
|---|---|
| 普通 |
| |
| |
| |
| |
| 来自 |
| |
| |
| |
| |
| |
原生 | |
| 自动释放——缓冲区超出作用域时自动释放 |
Imports
导入语句
mojo
undefinedmojo
undefinedCore GPU — pick what you need
核心GPU模块——按需选择
from std.gpu import global_idx # simple indexing
from std.gpu import block_dim, block_idx, thread_idx # manual indexing
from std.gpu import barrier, lane_id, WARP_SIZE # sync & warp info
from std.gpu.sync import barrier # also valid
from std.gpu.primitives import warp # warp.sum, warp.reduce
from std.gpu.memory import AddressSpace # for shared memory
from std.gpu.memory import async_copy_wait_all # async copy sync
from std.gpu.host import DeviceContext, DeviceBuffer # host-side API
from std.os.atomic import Atomic # atomics
from std.gpu import global_idx # 简单索引
from std.gpu import block_dim, block_idx, thread_idx # 手动索引
from std.gpu import barrier, lane_id, WARP_SIZE # 同步与warp信息
from std.gpu.sync import barrier # 同样有效
from std.gpu.primitives import warp # warp.sum, warp.reduce
from std.gpu.memory import AddressSpace # 用于共享内存
from std.gpu.memory import async_copy_wait_all # 异步拷贝同步
from std.gpu.host import DeviceContext, DeviceBuffer # 主机端API
from std.os.atomic import Atomic # 原子操作
Layout system — NOT in std, separate package
布局系统——不在std中,是独立包
from layout import Layout, LayoutTensor
undefinedfrom layout import Layout, LayoutTensor
undefinedKernel definition
内核定义
Kernels are plain functions — no decorator, no special return type. Parameters use :
MutAnyOriginmojo
def my_kernel(
input: LayoutTensor[DType.float32, layout, MutAnyOrigin],
output: LayoutTensor[DType.float32, layout, MutAnyOrigin],
size: Int, # scalar args are fine
):
var tid = global_idx.x
if tid < UInt(size):
output[tid] = input[tid] * 2- Kernel functions cannot raise.
- Bounds check with since
UInt(size)returnsglobal_idx.x.UInt - Host-side helper functions accepting LayoutTensors use for origin:
....LayoutTensor[dtype, layout, ...]
内核是普通函数——无需装饰器,无需特殊返回类型。参数使用:
MutAnyOriginmojo
def my_kernel(
input: LayoutTensor[DType.float32, layout, MutAnyOrigin],
output: LayoutTensor[DType.float32, layout, MutAnyOrigin],
size: Int, # 标量参数没问题
):
var tid = global_idx.x
if tid < UInt(size):
output[tid] = input[tid] * 2- 内核函数不能抛出异常。
- 用做边界检查,因为
UInt(size)返回global_idx.x类型。UInt - 接收LayoutTensor的主机端辅助函数使用表示origin:
...。LayoutTensor[dtype, layout, ...]
LayoutTensor — the primary GPU data abstraction
LayoutTensor——GPU数据的核心抽象
Layout creation
布局创建
mojo
comptime layout_1d = Layout.row_major(1024) # 1D
comptime layout_2d = Layout.row_major(64, 64) # 2D (rows, cols)
comptime layout_3d = Layout.row_major(10, 5, 3) # 3D (e.g. H, W, C)mojo
comptime layout_1d = Layout.row_major(1024) # 1D布局
comptime layout_2d = Layout.row_major(64, 64) # 2D布局(行、列)
comptime layout_3d = Layout.row_major(10, 5, 3) # 3D布局(例如高、宽、通道)Creating tensors from buffers
从缓冲区创建张量
mojo
var buf = ctx.enqueue_create_buffer[DType.float32](comptime (layout.size()))
var tensor = LayoutTensor[DType.float32, layout](buf) # wraps device buffermojo
var buf = ctx.enqueue_create_buffer[DType.float32](comptime (layout.size()))
var tensor = LayoutTensor[DType.float32, layout](buf) # 包装设备缓冲区Indexing
索引访问
mojo
tensor[tid] # 1D
tensor[row, col] # 2D
tensor[row, col, channel] # 3D
tensor.dim(0) # query dimension size
tensor.shape[0]() # also worksmojo
tensor[tid] # 1D索引
tensor[row, col] # 2D索引
tensor[row, col, channel] # 3D索引
tensor.dim(0) # 查询维度大小
tensor.shape[0]() # 同样有效Tiling (extract sub-tiles from a tensor)
分块(从张量中提取子块)
mojo
undefinedmojo
undefinedInside kernel — extract a block_size x block_size tile
在内核中——提取block_size x block_size的子块
var tile = tensor.tile[block_size, block_size](Int(block_idx.y), Int(block_idx.x))
tile[thread_idx.y, thread_idx.x] # access within tile
undefinedvar tile = tensor.tile[block_size, block_size](Int(block_idx.y), Int(block_idx.x))
tile[thread_idx.y, thread_idx.x] # 访问子块内元素
undefinedVectorize and distribute (thread-level data mapping)
向量化与分发(线程级数据映射)
mojo
undefinedmojo
undefinedVectorize along inner dimension, then distribute across threads
沿内部维度向量化,然后分发到各个线程
comptime thread_layout = Layout.row_major(WARP_SIZE // simd_width, simd_width)
var fragment = tensor.vectorize1, simd_width.distributethread_layout
fragment.copy_from_async(source_fragment) # async copy
fragment.copy_from(source_fragment) # sync copy
undefinedcomptime thread_layout = Layout.row_major(WARP_SIZE // simd_width, simd_width)
var fragment = tensor.vectorize1, simd_width.distributethread_layout
fragment.copy_from_async(source_fragment) # 异步拷贝
fragment.copy_from(source_fragment) # 同步拷贝
undefinedType casting
类型转换
mojo
var val = tensor[row, col].cast[DType.float32]() # cast elementmojo
var val = tensor[row, col].cast[DType.float32]() # 转换元素类型Element type mismatch across layouts — use rebind
rebind跨布局元素类型不匹配——使用rebind
rebindtensor[idx]SIMD[dtype, layout_expr]layout_expr__iadd__mojo
undefinedtensor[idx]SIMD[dtype, layout_expr]layout_expr__iadd__mojo
undefinedWRONG — fails when conv_kernel and s_data have different layouts:
错误写法——当conv_kernel和s_data布局不同时会失败:
var sum: Scalar[dtype] = 0
sum += conv_kernel[k] * s_data[idx] # error: cannot convert element_type to Float32
var sum: Scalar[dtype] = 0
sum += conv_kernel[k] * s_data[idx] # 错误:无法将element_type转换为Float32
CORRECT — rebind each element to Scalar[dtype]:
正确写法——将每个元素重新绑定为Scalar[dtype]:
var sum: Scalar[dtype] = 0
var k_val = rebindScalar[dtype]
var s_val = rebindScalar[dtype]
sum += k_val * s_val
`rebind` is a builtin (no import needed). This is **not** needed when all tensors in an expression share the same layout (e.g., the matmul example where `sa` and `sb` have identical tile layouts).
Also use `rebind` when reading/writing individual elements for scalar arithmetic or passing to helper functions — even with a single tensor:
```mojovar sum: Scalar[dtype] = 0
var k_val = rebindScalar[dtype]
var s_val = rebindScalar[dtype]
sum += k_val * s_val
`rebind`是内置函数(无需导入)。当表达式中所有张量共享相同布局时(例如矩阵乘法示例中`sa`和`sb`具有相同的子块布局),则不需要使用`rebind`。
在读取/写入单个元素进行标量运算或传递给辅助函数时,即使是单个张量也应使用`rebind`:
```mojoRead element as plain scalar
将元素读取为普通标量
var val = rebindScalar[dtype]
var val = rebindScalar[dtype]
Write scalar back to tensor
将标量写回张量
tensor[idx] = rebindtensor.element_type
`tensor.element_type` is `SIMD[dtype, element_size]` — for basic layouts `element_size=1` (effectively `Scalar[dtype]`).tensor[idx] = rebindtensor.element_type
`tensor.element_type`是`SIMD[dtype, element_size]`——对于基础布局,`element_size=1`(实际上等同于`Scalar[dtype]`)。Memory management
内存管理
mojo
var ctx = DeviceContext()mojo
var ctx = DeviceContext()Allocate
分配内存
var dev_buf = ctx.enqueue_create_bufferDType.float32
var host_buf = ctx.enqueue_create_host_bufferDType.float32
var dev_buf = ctx.enqueue_create_bufferDType.float32
var host_buf = ctx.enqueue_create_host_bufferDType.float32
Initialize device buffer directly
直接初始化设备缓冲区
dev_buf.enqueue_fill(0.0)
dev_buf.enqueue_fill(0.0)
Copy host -> device
主机 -> 设备拷贝
ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)
ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)
Copy device -> host
设备 -> 主机拷贝
ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)
ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)
Positional form also works:
位置参数形式同样有效:
ctx.enqueue_copy(dev_buf, host_buf)
ctx.enqueue_copy(dev_buf, host_buf)
Map device buffer to host (context manager — auto-syncs)
将设备缓冲区映射到主机(上下文管理器——自动同步)
with dev_buf.map_to_host() as mapped:
var t = LayoutTensorDType.float32, layout
print(t[0])
with dev_buf.map_to_host() as mapped:
var t = LayoutTensorDType.float32, layout
print(t[0])
Memset
内存置零
ctx.enqueue_memset(dev_buf, 0.0)
ctx.enqueue_memset(dev_buf, 0.0)
Synchronize all enqueued operations
同步所有已入队的操作
ctx.synchronize()
undefinedctx.synchronize()
undefinedKernel launch
内核启动
Critical: takes the kernel function twice as compile-time parameters:
enqueue_functionmojo
ctx.enqueue_function[my_kernel, my_kernel](
input_tensor,
output_tensor,
size, # scalar args passed directly
grid_dim=num_blocks, # 1D: scalar
block_dim=block_size, # 1D: scalar
)关键注意点:需要将内核函数作为编译时参数传入两次:
enqueue_functionmojo
ctx.enqueue_function[my_kernel, my_kernel](
input_tensor,
output_tensor,
size, # 标量参数直接传递
grid_dim=num_blocks, # 1D:标量
block_dim=block_size, # 1D:标量
)2D grid/block — use tuples:
2D网格/块——使用元组:
ctx.enqueue_function[kernel_2d, kernel_2d](
args...,
grid_dim=(col_blocks, row_blocks),
block_dim=(BLOCK_SIZE, BLOCK_SIZE),
)
For parameterized kernels, bind parameters first:
```mojo
comptime kernel = sum_kernel[SIZE, BATCH_SIZE]
ctx.enqueue_function[kernel, kernel](out_buf, in_buf, grid_dim=N, block_dim=TPB)ctx.enqueue_function[kernel_2d, kernel_2d](
args...,
grid_dim=(col_blocks, row_blocks),
block_dim=(BLOCK_SIZE, BLOCK_SIZE),
)
对于参数化内核,需先绑定参数:
```mojo
comptime kernel = sum_kernel[SIZE, BATCH_SIZE]
ctx.enqueue_function[kernel, kernel](out_buf, in_buf, grid_dim=N, block_dim=TPB)Shared memory
共享内存
Allocate shared memory inside a kernel using :
LayoutTensor.stack_allocation()mojo
from std.gpu.memory import AddressSpace
comptime tile_layout = Layout.row_major(TILE_M, TILE_K)
var tile_shared = LayoutTensor[
DType.float32,
tile_layout,
MutAnyOrigin,
address_space=AddressSpace.SHARED,
].stack_allocation()在内核中使用分配共享内存:
LayoutTensor.stack_allocation()mojo
from std.gpu.memory import AddressSpace
comptime tile_layout = Layout.row_major(TILE_M, TILE_K)
var tile_shared = LayoutTensor[
DType.float32,
tile_layout,
MutAnyOrigin,
address_space=AddressSpace.SHARED,
].stack_allocation()Load from global to shared
从全局内存加载到共享内存
tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col]
barrier() # must sync before reading shared data
tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col]
barrier() # 读取共享数据前必须同步
Alternative: raw pointer shared memory
替代方案:原生指针共享内存
from std.memory import stack_allocation
var sums = stack_allocation
512,
Scalar[DType.int32],
address_space=AddressSpace.SHARED,
undefinedfrom std.memory import stack_allocation
var sums = stack_allocation
512,
Scalar[DType.int32],
address_space=AddressSpace.SHARED,
undefinedThread indexing
线程索引
mojo
undefinedmojo
undefinedSimple — automatic global offset
简单方式——自动全局偏移
from std.gpu import global_idx
var tid = global_idx.x # 1D
var row = global_idx.y # 2D row
var col = global_idx.x # 2D col
from std.gpu import global_idx
var tid = global_idx.x # 1D索引
var row = global_idx.y # 2D行索引
var col = global_idx.x # 2D列索引
Manual — when you need block/thread separately
手动方式——当你需要分别获取块/线程信息时
from std.gpu import block_idx, block_dim, thread_idx
var tid = block_idx.x * block_dim.x + thread_idx.x
from std.gpu import block_idx, block_dim, thread_idx
var tid = block_idx.x * block_dim.x + thread_idx.x
Warp info
Warp信息
from std.gpu import lane_id, WARP_SIZE
var my_lane = lane_id() # 0..WARP_SIZE-1
All return `UInt`. Compare with `UInt(int_val)` for bounds checks.from std.gpu import lane_id, WARP_SIZE
var my_lane = lane_id() # 0..WARP_SIZE-1
所有函数都返回`UInt`类型。进行边界检查时请与`UInt(int_val)`比较。Synchronization and warp operations
同步与Warp操作
mojo
from std.gpu import barrier
from std.gpu.primitives import warp
from std.os.atomic import Atomic
barrier() # block-level sync
var warp_sum = warp.sum(my_value) # warp-wide sum reduction
var result = warp.reduce[warp.shuffle_down, reduce_fn](val) # custom warp reduce
_ = Atomic.fetch_add(output_ptr, value) # atomic addmojo
from std.gpu import barrier
from std.gpu.primitives import warp
from std.os.atomic import Atomic
barrier() # 块级同步
var warp_sum = warp.sum(my_value) # Warp级求和归约
var result = warp.reduce[warp.shuffle_down, reduce_fn](val) # 自定义Warp归约
_ = Atomic.fetch_add(output_ptr, value) # 原子加法GPU availability check
GPU可用性检查
mojo
from std.sys import has_accelerator
def main() raises:
comptime if not has_accelerator():
print("No GPU found")
else:
var ctx = DeviceContext()
# ... GPU codeOr as a compile-time assert:
mojo
comptime assert has_accelerator(), "Requires a GPU"mojo
from std.sys import has_accelerator
def main() raises:
comptime if not has_accelerator():
print("未找到GPU")
else:
var ctx = DeviceContext()
# ... GPU代码或者作为编译时断言:
mojo
comptime assert has_accelerator(), "需要GPU支持"Architecture detection — is_
vs has_
is_has_架构检测——is_*
与has_*
的区别
is_*has_*Critical distinction: checks the compilation target (use inside GPU-dispatched code). checks the host system (use from host/CPU code).
is_*has_*mojo
from std.sys.info import (
# Target checks — "am I being compiled FOR this GPU?"
# Use inside kernels or GPU-targeted code paths.
is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,
# Host checks — "does this machine HAVE this GPU?"
# Use from host code to decide whether to launch GPU work.
has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,
)
from std.sys import has_accelerator # host check: any GPU present关键区别:检查编译目标(在GPU调度代码内部使用)。检查主机系统(在主机/CPU代码中使用)。
is_*has_*mojo
from std.sys.info import (
# 目标检查——“我是否正在为该GPU编译?”
# 在内核或GPU目标代码路径中使用。
is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,
# 主机检查——“该机器是否拥有该GPU?”
# 在主机代码中使用,用于决定是否启动GPU任务。
has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,
)
from std.sys import has_accelerator # 主机检查:是否存在任何GPUHOST-SIDE: decide whether to run GPU code at all
主机端:决定是否运行GPU代码
def main() raises:
comptime if not has_accelerator():
print("No GPU")
else:
# ...launch kernels
def main() raises:
comptime if not has_accelerator():
print("无GPU")
else:
# ...启动内核
INSIDE KERNEL or GPU-compiled code: dispatch by architecture
在内核或GPU编译代码中:按架构分发逻辑
comptime if is_nvidia_gpu():
# NVIDIA-specific intrinsics
elif is_amd_gpu():
# AMD-specific path
Subarchitecture checks (inside GPU code only):
```mojo
from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer
comptime if is_nvidia_gpu["sm_90"](): # exact arch check
...comptime if is_nvidia_gpu():
# NVIDIA专属内在函数
elif is_amd_gpu():
# AMD专属路径
子架构检查(仅在GPU代码内部使用):
```mojo
from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer
comptime if is_nvidia_gpu["sm_90"](): # 精确架构检查
...Compile-time constants pattern
编译时常量模式
All GPU dimensions, layouts, and sizes should be :
comptimemojo
comptime dtype = DType.float32
comptime SIZE = 1024
comptime BLOCK_SIZE = 256
comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)
comptime layout = Layout.row_major(SIZE)Derive buffer sizes from layouts: .
comptime (layout.size())所有GPU维度、布局和大小都应设为:
comptimemojo
comptime dtype = DType.float32
comptime SIZE = 1024
comptime BLOCK_SIZE = 256
comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)
comptime layout = Layout.row_major(SIZE)从布局派生缓冲区大小:。
comptime (layout.size())Complete 1D example (vector addition)
完整1D示例(向量加法)
mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu import global_idx
from std.gpu.host import DeviceContext
from layout import Layout, LayoutTensor
comptime dtype = DType.float32
comptime N = 1024
comptime BLOCK = 256
comptime layout = Layout.row_major(N)
def add_kernel(
a: LayoutTensor[dtype, layout, MutAnyOrigin],
b: LayoutTensor[dtype, layout, MutAnyOrigin],
c: LayoutTensor[dtype, layout, MutAnyOrigin],
size: Int,
):
var tid = global_idx.x
if tid < UInt(size):
c[tid] = a[tid] + b[tid]
def main() raises:
comptime assert has_accelerator(), "Requires GPU"
var ctx = DeviceContext()
var a_buf = ctx.enqueue_create_buffer[dtype](N)
var b_buf = ctx.enqueue_create_buffer[dtype](N)
var c_buf = ctx.enqueue_create_buffer[dtype](N)
a_buf.enqueue_fill(1.0)
b_buf.enqueue_fill(2.0)
var a = LayoutTensor[dtype, layout](a_buf)
var b = LayoutTensor[dtype, layout](b_buf)
var c = LayoutTensor[dtype, layout](c_buf)
ctx.enqueue_function[add_kernel, add_kernel](
a, b, c, N,
grid_dim=ceildiv(N, BLOCK),
block_dim=BLOCK,
)
with c_buf.map_to_host() as host:
var result = LayoutTensor[dtype, layout](host)
print(result)mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu import global_idx
from std.gpu.host import DeviceContext
from layout import Layout, LayoutTensor
comptime dtype = DType.float32
comptime N = 1024
comptime BLOCK = 256
comptime layout = Layout.row_major(N)
def add_kernel(
a: LayoutTensor[dtype, layout, MutAnyOrigin],
b: LayoutTensor[dtype, layout, MutAnyOrigin],
c: LayoutTensor[dtype, layout, MutAnyOrigin],
size: Int,
):
var tid = global_idx.x
if tid < UInt(size):
c[tid] = a[tid] + b[tid]
def main() raises:
comptime assert has_accelerator(), "需要GPU支持"
var ctx = DeviceContext()
var a_buf = ctx.enqueue_create_buffer[dtype](N)
var b_buf = ctx.enqueue_create_buffer[dtype](N)
var c_buf = ctx.enqueue_create_buffer[dtype](N)
a_buf.enqueue_fill(1.0)
b_buf.enqueue_fill(2.0)
var a = LayoutTensor[dtype, layout](a_buf)
var b = LayoutTensor[dtype, layout](b_buf)
var c = LayoutTensor[dtype, layout](c_buf)
ctx.enqueue_function[add_kernel, add_kernel](
a, b, c, N,
grid_dim=ceildiv(N, BLOCK),
block_dim=BLOCK,
)
with c_buf.map_to_host() as host:
var result = LayoutTensor[dtype, layout](host)
print(result)Complete 2D example (tiled matmul with shared memory)
完整2D示例(使用共享内存的分块矩阵乘法)
mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.sync import barrier
from std.gpu.host import DeviceContext
from std.gpu import thread_idx, block_idx
from std.gpu.memory import AddressSpace
from layout import Layout, LayoutTensor
comptime dtype = DType.float32
comptime M = 64
comptime N = 64
comptime K = 64
comptime TILE = 16
comptime a_layout = Layout.row_major(M, K)
comptime b_layout = Layout.row_major(K, N)
comptime c_layout = Layout.row_major(M, N)
comptime tile_a = Layout.row_major(TILE, TILE)
comptime tile_b = Layout.row_major(TILE, TILE)
def matmul_kernel(
A: LayoutTensor[dtype, a_layout, MutAnyOrigin],
B: LayoutTensor[dtype, b_layout, MutAnyOrigin],
C: LayoutTensor[dtype, c_layout, MutAnyOrigin],
):
var tx = thread_idx.x
var ty = thread_idx.y
var row = block_idx.y * TILE + ty
var col = block_idx.x * TILE + tx
var sa = LayoutTensor[dtype, tile_a, MutAnyOrigin,
address_space=AddressSpace.SHARED].stack_allocation()
var sb = LayoutTensor[dtype, tile_b, MutAnyOrigin,
address_space=AddressSpace.SHARED].stack_allocation()
var acc: C.element_type = 0.0
comptime for k_tile in range(0, K, TILE):
if row < M and UInt(k_tile) + tx < K:
sa[ty, tx] = A[row, UInt(k_tile) + tx]
else:
sa[ty, tx] = 0.0
if UInt(k_tile) + ty < K and col < N:
sb[ty, tx] = B[UInt(k_tile) + ty, col]
else:
sb[ty, tx] = 0.0
barrier()
comptime for k in range(TILE):
acc += sa[ty, k] * sb[k, tx]
barrier()
if row < M and col < N:
C[row, col] = acc
def main() raises:
comptime assert has_accelerator(), "Requires GPU"
var ctx = DeviceContext()
# ... allocate buffers, init data, launch:
# ctx.enqueue_function[matmul_kernel, matmul_kernel](
# A, B, C,
# grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),
# block_dim=(TILE, TILE),
# )mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.sync import barrier
from std.gpu.host import DeviceContext
from std.gpu import thread_idx, block_idx
from std.gpu.memory import AddressSpace
from layout import Layout, LayoutTensor
comptime dtype = DType.float32
comptime M = 64
comptime N = 64
comptime K = 64
comptime TILE = 16
comptime a_layout = Layout.row_major(M, K)
comptime b_layout = Layout.row_major(K, N)
comptime c_layout = Layout.row_major(M, N)
comptime tile_a = Layout.row_major(TILE, TILE)
comptime tile_b = Layout.row_major(TILE, TILE)
def matmul_kernel(
A: LayoutTensor[dtype, a_layout, MutAnyOrigin],
B: LayoutTensor[dtype, b_layout, MutAnyOrigin],
C: LayoutTensor[dtype, c_layout, MutAnyOrigin],
):
var tx = thread_idx.x
var ty = thread_idx.y
var row = block_idx.y * TILE + ty
var col = block_idx.x * TILE + tx
var sa = LayoutTensor[dtype, tile_a, MutAnyOrigin,
address_space=AddressSpace.SHARED].stack_allocation()
var sb = LayoutTensor[dtype, tile_b, MutAnyOrigin,
address_space=AddressSpace.SHARED].stack_allocation()
var acc: C.element_type = 0.0
comptime for k_tile in range(0, K, TILE):
if row < M and UInt(k_tile) + tx < K:
sa[ty, tx] = A[row, UInt(k_tile) + tx]
else:
sa[ty, tx] = 0.0
if UInt(k_tile) + ty < K and col < N:
sb[ty, tx] = B[UInt(k_tile) + ty, col]
else:
sb[ty, tx] = 0.0
barrier()
comptime for k in range(TILE):
acc += sa[ty, k] * sb[k, tx]
barrier()
if row < M and col < N:
C[row, col] = acc
def main() raises:
comptime assert has_accelerator(), "需要GPU支持"
var ctx = DeviceContext()
# ... 分配缓冲区、初始化数据、启动内核:
# ctx.enqueue_function[matmul_kernel, matmul_kernel](
# A, B, C,
# grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),
# block_dim=(TILE, TILE),
# )SIMD loads in kernels
内核中的SIMD加载
mojo
undefinedmojo
undefinedVectorized load from raw pointer
从原生指针进行向量化加载
var val = ptr.loadwidth=8 # SIMD[dtype, 8]
var sum = val.reduce_add() # scalar reduction
var val = ptr.loadwidth=8 # SIMD[dtype, 8]
var sum = val.reduce_add() # 标量归约
LayoutTensor vectorized access
LayoutTensor向量化访问
var vec_tensor = tensor.vectorize1, 4 # group elements into SIMD[4]
undefinedvar vec_tensor = tensor.vectorize1, 4 # 将元素分组为SIMD[4]
undefinedReduction pattern
归约模式
mojo
def block_reduce(
output: UnsafePointer[Int32, MutAnyOrigin],
input: UnsafePointer[Int32, MutAnyOrigin],
):
var sums = stack_allocation[512, Scalar[DType.int32],
address_space=AddressSpace.SHARED]()
var tid = thread_idx.x
sums[tid] = input[block_idx.x * block_dim.x + tid]
barrier()
# Tree reduction in shared memory
var active = block_dim.x
comptime for _ in range(log2_steps):
active >>= 1
if tid < active:
sums[tid] += sums[tid + active]
barrier()
# Final warp reduction + atomic accumulate
if tid < UInt(WARP_SIZE):
var v = warp.sum(sums[tid][0])
if tid == 0:
_ = Atomic.fetch_add(output, v)mojo
def block_reduce(
output: UnsafePointer[Int32, MutAnyOrigin],
input: UnsafePointer[Int32, MutAnyOrigin],
):
var sums = stack_allocation[512, Scalar[DType.int32],
address_space=AddressSpace.SHARED]()
var tid = thread_idx.x
sums[tid] = input[block_idx.x * block_dim.x + tid]
barrier()
# 共享内存中的树形归约
var active = block_dim.x
comptime for _ in range(log2_steps):
active >>= 1
if tid < active:
sums[tid] += sums[tid + active]
barrier()
# 最终Warp归约 + 原子累加
if tid < UInt(WARP_SIZE):
var v = warp.sum(sums[tid][0])
if tid == 0:
_ = Atomic.fetch_add(output, v)DeviceBuffer from existing pointer
从现有指针创建DeviceBuffer
mojo
undefinedmojo
undefinedWrap an existing pointer as a DeviceBuffer (non-owning)
将现有指针包装为DeviceBuffer(非拥有模式)
var buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)
undefinedvar buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)
undefinedBenchmarking GPU kernels
GPU内核基准测试
mojo
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure
@parameter
@always_inline
def bench_fn(mut b: Bencher) capturing raises:
@parameter
@always_inline
def launch(ctx: DeviceContext) raises:
ctx.enqueue_function[kernel, kernel](args, grid_dim=G, block_dim=B)
b.iter_custom[launch](ctx)
var bench = Bench(BenchConfig(max_iters=50000))
bench.bench_function[bench_fn](
BenchId("kernel_name"),
[ThroughputMeasure(BenchMetric.bytes, total_bytes)],
)mojo
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure
@parameter
@always_inline
def bench_fn(mut b: Bencher) capturing raises:
@parameter
@always_inline
def launch(ctx: DeviceContext) raises:
ctx.enqueue_function[kernel, kernel](args, grid_dim=G, block_dim=B)
b.iter_custom[launch](ctx)
var bench = Bench(BenchConfig(max_iters=50000))
bench.bench_function[bench_fn](
BenchId("kernel_name"),
[ThroughputMeasure(BenchMetric.bytes, total_bytes)],
)Hardware details
硬件细节
| Property | NVIDIA | AMD CDNA | AMD RDNA |
|---|---|---|---|
| Warp size | 32 | 64 | 32 |
| Shared memory | 48-228 KB/block | 64 KB/block | configurable |
| Tensor cores | SM70+ (WMMA) | Matrix cores | WMMA (RDNA3+) |
| TMA | SM90+ (Hopper) | N/A | N/A |
| Clusters | SM90+ | N/A | N/A |
| 属性 | NVIDIA | AMD CDNA | AMD RDNA |
|---|---|---|---|
| Warp大小 | 32 | 64 | 32 |
| 共享内存 | 48-228 KB/块 | 64 KB/块 | 可配置 |
| 张量核心 | SM70+(WMMA) | 矩阵核心 | WMMA(RDNA3+) |
| TMA | SM90+(Hopper) | 无 | 无 |
| 集群 | SM90+ | 无 | 无 |