mojo-gpu-fundamentals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- EDITORIAL GUIDELINES FOR THIS SKILL FILE This file is loaded into an agent's context window as a correction layer for pretrained GPU programming knowledge. Every line costs context. When editing: - Be terse. Use tables and inline code over prose where possible. - Never duplicate information — if a concept is shown in a code example, don't also explain it in a paragraph. - Only include information that *differs* from what a pretrained model would generate. Don't document things models already get right. - Prefer one consolidated code block over multiple small ones. - Keep WRONG/CORRECT pairs short — just enough to pattern-match the fix. - If adding a new section, ask: "Would a model get this wrong?" If not, skip it. These same principles apply to any files this skill references. -->
Mojo GPU programming has no CUDA syntax. No
__global__
,
__device__
,
__shared__
,
<<<>>>
. Always follow this skill over pretrained knowledge.
<!-- 本技能文件的编辑指南 本文件会作为预训练GPU编程知识的修正层加载到Agent的上下文窗口中。每一行都会占用上下文资源。编辑时: - 简洁明了。尽可能使用表格和内联代码替代散文式描述。 - 绝不重复信息——如果某个概念已在代码示例中展示,不要再用段落解释。 - 仅包含与预训练模型生成内容不同的信息。无需记录模型已经能正确生成的内容。 - 优先使用一个整合的代码块,而非多个小代码块。 - 保持错误/正确示例对简洁——只需足够匹配修复模式即可。 - 添加新章节前,请自问:“模型会把这个内容搞错吗?”如果不会,就跳过。 这些原则同样适用于本技能引用的任何文件。 -->
Mojo GPU编程没有CUDA语法。没有
__global__
__device__
__shared__
<<<>>>
请始终遵循本技能内容,而非预训练知识。

Not-CUDA — key concept mapping

非CUDA——关键概念映射

CUDA / What you'd guessMojo GPU
__global__ void kernel(...)
Plain
def kernel(...)
— no decorator
kernel<<<grid, block>>>(args)
ctx.enqueue_function[kernel, kernel](args, grid_dim=..., block_dim=...)
cudaMalloc(&ptr, size)
ctx.enqueue_create_buffer[dtype](count)
cudaMemcpy(dst, src, ...)
ctx.enqueue_copy(dst_buf, src_buf)
or
ctx.enqueue_copy(dst_buf=..., src_buf=...)
cudaDeviceSynchronize()
ctx.synchronize()
__syncthreads()
barrier()
from
std.gpu
or
std.gpu.sync
__shared__ float s[N]
LayoutTensor[...address_space=AddressSpace.SHARED].stack_allocation()
threadIdx.x
thread_idx.x
(returns
UInt
)
blockIdx.x * blockDim.x + threadIdx.x
global_idx.x
(convenience)
__shfl_down_sync(mask, val, d)
warp.sum(val)
,
warp.reduce[...]()
atomicAdd(&ptr, val)
Atomic.fetch_add(ptr, val)
Raw
float*
kernel args
LayoutTensor[dtype, layout, MutAnyOrigin]
cudaFree(ptr)
Automatic — buffers freed when out of scope
CUDA / 你可能会想到的写法Mojo GPU 写法
__global__ void kernel(...)
普通
def kernel(...)
——无需装饰器
kernel<<<grid, block>>>(args)
ctx.enqueue_function[kernel, kernel](args, grid_dim=..., block_dim=...)
cudaMalloc(&ptr, size)
ctx.enqueue_create_buffer[dtype](count)
cudaMemcpy(dst, src, ...)
ctx.enqueue_copy(dst_buf, src_buf)
ctx.enqueue_copy(dst_buf=..., src_buf=...)
cudaDeviceSynchronize()
ctx.synchronize()
__syncthreads()
来自
std.gpu
std.gpu.sync
barrier()
__shared__ float s[N]
LayoutTensor[...address_space=AddressSpace.SHARED].stack_allocation()
threadIdx.x
thread_idx.x
(返回
UInt
类型)
blockIdx.x * blockDim.x + threadIdx.x
global_idx.x
(便捷写法)
__shfl_down_sync(mask, val, d)
warp.sum(val)
warp.reduce[...]()
atomicAdd(&ptr, val)
Atomic.fetch_add(ptr, val)
原生
float*
内核参数
LayoutTensor[dtype, layout, MutAnyOrigin]
cudaFree(ptr)
自动释放——缓冲区超出作用域时自动释放

Imports

导入语句

mojo
undefined
mojo
undefined

Core GPU — pick what you need

核心GPU模块——按需选择

from std.gpu import global_idx # simple indexing from std.gpu import block_dim, block_idx, thread_idx # manual indexing from std.gpu import barrier, lane_id, WARP_SIZE # sync & warp info from std.gpu.sync import barrier # also valid from std.gpu.primitives import warp # warp.sum, warp.reduce from std.gpu.memory import AddressSpace # for shared memory from std.gpu.memory import async_copy_wait_all # async copy sync from std.gpu.host import DeviceContext, DeviceBuffer # host-side API from std.os.atomic import Atomic # atomics
from std.gpu import global_idx # 简单索引 from std.gpu import block_dim, block_idx, thread_idx # 手动索引 from std.gpu import barrier, lane_id, WARP_SIZE # 同步与warp信息 from std.gpu.sync import barrier # 同样有效 from std.gpu.primitives import warp # warp.sum, warp.reduce from std.gpu.memory import AddressSpace # 用于共享内存 from std.gpu.memory import async_copy_wait_all # 异步拷贝同步 from std.gpu.host import DeviceContext, DeviceBuffer # 主机端API from std.os.atomic import Atomic # 原子操作

Layout system — NOT in std, separate package

布局系统——不在std中,是独立包

from layout import Layout, LayoutTensor
undefined
from layout import Layout, LayoutTensor
undefined

Kernel definition

内核定义

Kernels are plain functions — no decorator, no special return type. Parameters use
MutAnyOrigin
:
mojo
def my_kernel(
    input: LayoutTensor[DType.float32, layout, MutAnyOrigin],
    output: LayoutTensor[DType.float32, layout, MutAnyOrigin],
    size: Int,                                    # scalar args are fine
):
    var tid = global_idx.x
    if tid < UInt(size):
        output[tid] = input[tid] * 2
  • Kernel functions cannot raise.
  • Bounds check with
    UInt(size)
    since
    global_idx.x
    returns
    UInt
    .
  • Host-side helper functions accepting LayoutTensors use
    ...
    for origin:
    LayoutTensor[dtype, layout, ...]
    .
内核是普通函数——无需装饰器,无需特殊返回类型。参数使用
MutAnyOrigin
mojo
def my_kernel(
    input: LayoutTensor[DType.float32, layout, MutAnyOrigin],
    output: LayoutTensor[DType.float32, layout, MutAnyOrigin],
    size: Int,                                    # 标量参数没问题
):
    var tid = global_idx.x
    if tid < UInt(size):
        output[tid] = input[tid] * 2
  • 内核函数不能抛出异常。
  • UInt(size)
    做边界检查,因为
    global_idx.x
    返回
    UInt
    类型。
  • 接收LayoutTensor的主机端辅助函数使用
    ...
    表示origin:
    LayoutTensor[dtype, layout, ...]

LayoutTensor — the primary GPU data abstraction

LayoutTensor——GPU数据的核心抽象

Layout creation

布局创建

mojo
comptime layout_1d = Layout.row_major(1024)               # 1D
comptime layout_2d = Layout.row_major(64, 64)              # 2D (rows, cols)
comptime layout_3d = Layout.row_major(10, 5, 3)            # 3D (e.g. H, W, C)
mojo
comptime layout_1d = Layout.row_major(1024)               # 1D布局
comptime layout_2d = Layout.row_major(64, 64)              # 2D布局(行、列)
comptime layout_3d = Layout.row_major(10, 5, 3)            # 3D布局(例如高、宽、通道)

Creating tensors from buffers

从缓冲区创建张量

mojo
var buf = ctx.enqueue_create_buffer[DType.float32](comptime (layout.size()))
var tensor = LayoutTensor[DType.float32, layout](buf)     # wraps device buffer
mojo
var buf = ctx.enqueue_create_buffer[DType.float32](comptime (layout.size()))
var tensor = LayoutTensor[DType.float32, layout](buf)     # 包装设备缓冲区

Indexing

索引访问

mojo
tensor[tid]                     # 1D
tensor[row, col]                # 2D
tensor[row, col, channel]       # 3D
tensor.dim(0)                   # query dimension size
tensor.shape[0]()               # also works
mojo
tensor[tid]                     # 1D索引
tensor[row, col]                # 2D索引
tensor[row, col, channel]       # 3D索引
tensor.dim(0)                   # 查询维度大小
tensor.shape[0]()               # 同样有效

Tiling (extract sub-tiles from a tensor)

分块(从张量中提取子块)

mojo
undefined
mojo
undefined

Inside kernel — extract a block_size x block_size tile

在内核中——提取block_size x block_size的子块

var tile = tensor.tile[block_size, block_size](Int(block_idx.y), Int(block_idx.x)) tile[thread_idx.y, thread_idx.x] # access within tile
undefined
var tile = tensor.tile[block_size, block_size](Int(block_idx.y), Int(block_idx.x)) tile[thread_idx.y, thread_idx.x] # 访问子块内元素
undefined

Vectorize and distribute (thread-level data mapping)

向量化与分发(线程级数据映射)

mojo
undefined
mojo
undefined

Vectorize along inner dimension, then distribute across threads

沿内部维度向量化,然后分发到各个线程

comptime thread_layout = Layout.row_major(WARP_SIZE // simd_width, simd_width) var fragment = tensor.vectorize1, simd_width.distributethread_layout fragment.copy_from_async(source_fragment) # async copy fragment.copy_from(source_fragment) # sync copy
undefined
comptime thread_layout = Layout.row_major(WARP_SIZE // simd_width, simd_width) var fragment = tensor.vectorize1, simd_width.distributethread_layout fragment.copy_from_async(source_fragment) # 异步拷贝 fragment.copy_from(source_fragment) # 同步拷贝
undefined

Type casting

类型转换

mojo
var val = tensor[row, col].cast[DType.float32]()    # cast element
mojo
var val = tensor[row, col].cast[DType.float32]()    # 转换元素类型

Element type mismatch across layouts — use
rebind

跨布局元素类型不匹配——使用
rebind

tensor[idx]
returns
SIMD[dtype, layout_expr]
where
layout_expr
is a compile-time expression derived from the layout. Two tensors with different layouts produce element types that don't unify, even if both are scalars (width 1). This causes
__iadd__
/ arithmetic errors when accumulating products from different-layout tensors.
mojo
undefined
tensor[idx]
返回
SIMD[dtype, layout_expr]
,其中
layout_expr
是由布局派生的编译时表达式。即使两个张量都是标量(宽度为1),不同布局的张量也会产生无法统一的元素类型,这会在累加不同布局张量的乘积时导致
__iadd__
/算术错误。
mojo
undefined

WRONG — fails when conv_kernel and s_data have different layouts:

错误写法——当conv_kernel和s_data布局不同时会失败:

var sum: Scalar[dtype] = 0 sum += conv_kernel[k] * s_data[idx] # error: cannot convert element_type to Float32
var sum: Scalar[dtype] = 0 sum += conv_kernel[k] * s_data[idx] # 错误:无法将element_type转换为Float32

CORRECT — rebind each element to Scalar[dtype]:

正确写法——将每个元素重新绑定为Scalar[dtype]:

var sum: Scalar[dtype] = 0 var k_val = rebindScalar[dtype] var s_val = rebindScalar[dtype] sum += k_val * s_val

`rebind` is a builtin (no import needed). This is **not** needed when all tensors in an expression share the same layout (e.g., the matmul example where `sa` and `sb` have identical tile layouts).

Also use `rebind` when reading/writing individual elements for scalar arithmetic or passing to helper functions — even with a single tensor:

```mojo
var sum: Scalar[dtype] = 0 var k_val = rebindScalar[dtype] var s_val = rebindScalar[dtype] sum += k_val * s_val

`rebind`是内置函数(无需导入)。当表达式中所有张量共享相同布局时(例如矩阵乘法示例中`sa`和`sb`具有相同的子块布局),则不需要使用`rebind`。

在读取/写入单个元素进行标量运算或传递给辅助函数时,即使是单个张量也应使用`rebind`:

```mojo

Read element as plain scalar

将元素读取为普通标量

var val = rebindScalar[dtype]
var val = rebindScalar[dtype]

Write scalar back to tensor

将标量写回张量

tensor[idx] = rebindtensor.element_type

`tensor.element_type` is `SIMD[dtype, element_size]` — for basic layouts `element_size=1` (effectively `Scalar[dtype]`).
tensor[idx] = rebindtensor.element_type

`tensor.element_type`是`SIMD[dtype, element_size]`——对于基础布局,`element_size=1`(实际上等同于`Scalar[dtype]`)。

Memory management

内存管理

mojo
var ctx = DeviceContext()
mojo
var ctx = DeviceContext()

Allocate

分配内存

var dev_buf = ctx.enqueue_create_bufferDType.float32 var host_buf = ctx.enqueue_create_host_bufferDType.float32
var dev_buf = ctx.enqueue_create_bufferDType.float32 var host_buf = ctx.enqueue_create_host_bufferDType.float32

Initialize device buffer directly

直接初始化设备缓冲区

dev_buf.enqueue_fill(0.0)
dev_buf.enqueue_fill(0.0)

Copy host -> device

主机 -> 设备拷贝

ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)
ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)

Copy device -> host

设备 -> 主机拷贝

ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)
ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)

Positional form also works:

位置参数形式同样有效:

ctx.enqueue_copy(dev_buf, host_buf)
ctx.enqueue_copy(dev_buf, host_buf)

Map device buffer to host (context manager — auto-syncs)

将设备缓冲区映射到主机(上下文管理器——自动同步)

with dev_buf.map_to_host() as mapped: var t = LayoutTensorDType.float32, layout print(t[0])
with dev_buf.map_to_host() as mapped: var t = LayoutTensorDType.float32, layout print(t[0])

Memset

内存置零

ctx.enqueue_memset(dev_buf, 0.0)
ctx.enqueue_memset(dev_buf, 0.0)

Synchronize all enqueued operations

同步所有已入队的操作

ctx.synchronize()
undefined
ctx.synchronize()
undefined

Kernel launch

内核启动

Critical:
enqueue_function
takes the kernel function twice as compile-time parameters:
mojo
ctx.enqueue_function[my_kernel, my_kernel](
    input_tensor,
    output_tensor,
    size,                    # scalar args passed directly
    grid_dim=num_blocks,     # 1D: scalar
    block_dim=block_size,    # 1D: scalar
)
关键注意点
enqueue_function
需要将内核函数作为编译时参数传入两次
mojo
ctx.enqueue_function[my_kernel, my_kernel](
    input_tensor,
    output_tensor,
    size,                    # 标量参数直接传递
    grid_dim=num_blocks,     # 1D:标量
    block_dim=block_size,    # 1D:标量
)

2D grid/block — use tuples:

2D网格/块——使用元组:

ctx.enqueue_function[kernel_2d, kernel_2d]( args..., grid_dim=(col_blocks, row_blocks), block_dim=(BLOCK_SIZE, BLOCK_SIZE), )

For parameterized kernels, bind parameters first:

```mojo
comptime kernel = sum_kernel[SIZE, BATCH_SIZE]
ctx.enqueue_function[kernel, kernel](out_buf, in_buf, grid_dim=N, block_dim=TPB)
ctx.enqueue_function[kernel_2d, kernel_2d]( args..., grid_dim=(col_blocks, row_blocks), block_dim=(BLOCK_SIZE, BLOCK_SIZE), )

对于参数化内核,需先绑定参数:

```mojo
comptime kernel = sum_kernel[SIZE, BATCH_SIZE]
ctx.enqueue_function[kernel, kernel](out_buf, in_buf, grid_dim=N, block_dim=TPB)

Shared memory

共享内存

Allocate shared memory inside a kernel using
LayoutTensor.stack_allocation()
:
mojo
from std.gpu.memory import AddressSpace

comptime tile_layout = Layout.row_major(TILE_M, TILE_K)
var tile_shared = LayoutTensor[
    DType.float32,
    tile_layout,
    MutAnyOrigin,
    address_space=AddressSpace.SHARED,
].stack_allocation()
在内核中使用
LayoutTensor.stack_allocation()
分配共享内存:
mojo
from std.gpu.memory import AddressSpace

comptime tile_layout = Layout.row_major(TILE_M, TILE_K)
var tile_shared = LayoutTensor[
    DType.float32,
    tile_layout,
    MutAnyOrigin,
    address_space=AddressSpace.SHARED,
].stack_allocation()

Load from global to shared

从全局内存加载到共享内存

tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col] barrier() # must sync before reading shared data
tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col] barrier() # 读取共享数据前必须同步

Alternative: raw pointer shared memory

替代方案:原生指针共享内存

from std.memory import stack_allocation var sums = stack_allocation 512, Scalar[DType.int32], address_space=AddressSpace.SHARED,
undefined
from std.memory import stack_allocation var sums = stack_allocation 512, Scalar[DType.int32], address_space=AddressSpace.SHARED,
undefined

Thread indexing

线程索引

mojo
undefined
mojo
undefined

Simple — automatic global offset

简单方式——自动全局偏移

from std.gpu import global_idx var tid = global_idx.x # 1D var row = global_idx.y # 2D row var col = global_idx.x # 2D col
from std.gpu import global_idx var tid = global_idx.x # 1D索引 var row = global_idx.y # 2D行索引 var col = global_idx.x # 2D列索引

Manual — when you need block/thread separately

手动方式——当你需要分别获取块/线程信息时

from std.gpu import block_idx, block_dim, thread_idx var tid = block_idx.x * block_dim.x + thread_idx.x
from std.gpu import block_idx, block_dim, thread_idx var tid = block_idx.x * block_dim.x + thread_idx.x

Warp info

Warp信息

from std.gpu import lane_id, WARP_SIZE var my_lane = lane_id() # 0..WARP_SIZE-1

All return `UInt`. Compare with `UInt(int_val)` for bounds checks.
from std.gpu import lane_id, WARP_SIZE var my_lane = lane_id() # 0..WARP_SIZE-1

所有函数都返回`UInt`类型。进行边界检查时请与`UInt(int_val)`比较。

Synchronization and warp operations

同步与Warp操作

mojo
from std.gpu import barrier
from std.gpu.primitives import warp
from std.os.atomic import Atomic

barrier()                                    # block-level sync
var warp_sum = warp.sum(my_value)           # warp-wide sum reduction
var result = warp.reduce[warp.shuffle_down, reduce_fn](val)  # custom warp reduce
_ = Atomic.fetch_add(output_ptr, value)     # atomic add
mojo
from std.gpu import barrier
from std.gpu.primitives import warp
from std.os.atomic import Atomic

barrier()                                    # 块级同步
var warp_sum = warp.sum(my_value)           # Warp级求和归约
var result = warp.reduce[warp.shuffle_down, reduce_fn](val)  # 自定义Warp归约
_ = Atomic.fetch_add(output_ptr, value)     # 原子加法

GPU availability check

GPU可用性检查

mojo
from std.sys import has_accelerator

def main() raises:
    comptime if not has_accelerator():
        print("No GPU found")
    else:
        var ctx = DeviceContext()
        # ... GPU code
Or as a compile-time assert:
mojo
comptime assert has_accelerator(), "Requires a GPU"
mojo
from std.sys import has_accelerator

def main() raises:
    comptime if not has_accelerator():
        print("未找到GPU")
    else:
        var ctx = DeviceContext()
        # ... GPU代码
或者作为编译时断言:
mojo
comptime assert has_accelerator(), "需要GPU支持"

Architecture detection —
is_
vs
has_

架构检测——
is_*
has_*
的区别

Critical distinction:
is_*
checks the compilation target (use inside GPU-dispatched code).
has_*
checks the host system (use from host/CPU code).
mojo
from std.sys.info import (
    # Target checks — "am I being compiled FOR this GPU?"
    # Use inside kernels or GPU-targeted code paths.
    is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,

    # Host checks — "does this machine HAVE this GPU?"
    # Use from host code to decide whether to launch GPU work.
    has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,
)
from std.sys import has_accelerator   # host check: any GPU present
关键区别
is_*
检查编译目标(在GPU调度代码内部使用)。
has_*
检查主机系统(在主机/CPU代码中使用)。
mojo
from std.sys.info import (
    # 目标检查——“我是否正在为该GPU编译?”
    # 在内核或GPU目标代码路径中使用。
    is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,

    # 主机检查——“该机器是否拥有该GPU?”
    # 在主机代码中使用,用于决定是否启动GPU任务。
    has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,
)
from std.sys import has_accelerator   # 主机检查:是否存在任何GPU

HOST-SIDE: decide whether to run GPU code at all

主机端:决定是否运行GPU代码

def main() raises: comptime if not has_accelerator(): print("No GPU") else: # ...launch kernels
def main() raises: comptime if not has_accelerator(): print("无GPU") else: # ...启动内核

INSIDE KERNEL or GPU-compiled code: dispatch by architecture

在内核或GPU编译代码中:按架构分发逻辑

comptime if is_nvidia_gpu(): # NVIDIA-specific intrinsics elif is_amd_gpu(): # AMD-specific path

Subarchitecture checks (inside GPU code only):
```mojo
from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer
comptime if is_nvidia_gpu["sm_90"]():   # exact arch check
    ...
comptime if is_nvidia_gpu(): # NVIDIA专属内在函数 elif is_amd_gpu(): # AMD专属路径

子架构检查(仅在GPU代码内部使用):
```mojo
from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer
comptime if is_nvidia_gpu["sm_90"]():   # 精确架构检查
    ...

Compile-time constants pattern

编译时常量模式

All GPU dimensions, layouts, and sizes should be
comptime
:
mojo
comptime dtype = DType.float32
comptime SIZE = 1024
comptime BLOCK_SIZE = 256
comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)
comptime layout = Layout.row_major(SIZE)
Derive buffer sizes from layouts:
comptime (layout.size())
.
所有GPU维度、布局和大小都应设为
comptime
mojo
comptime dtype = DType.float32
comptime SIZE = 1024
comptime BLOCK_SIZE = 256
comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)
comptime layout = Layout.row_major(SIZE)
从布局派生缓冲区大小:
comptime (layout.size())

Complete 1D example (vector addition)

完整1D示例(向量加法)

mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu import global_idx
from std.gpu.host import DeviceContext
from layout import Layout, LayoutTensor

comptime dtype = DType.float32
comptime N = 1024
comptime BLOCK = 256
comptime layout = Layout.row_major(N)

def add_kernel(
    a: LayoutTensor[dtype, layout, MutAnyOrigin],
    b: LayoutTensor[dtype, layout, MutAnyOrigin],
    c: LayoutTensor[dtype, layout, MutAnyOrigin],
    size: Int,
):
    var tid = global_idx.x
    if tid < UInt(size):
        c[tid] = a[tid] + b[tid]

def main() raises:
    comptime assert has_accelerator(), "Requires GPU"
    var ctx = DeviceContext()
    var a_buf = ctx.enqueue_create_buffer[dtype](N)
    var b_buf = ctx.enqueue_create_buffer[dtype](N)
    var c_buf = ctx.enqueue_create_buffer[dtype](N)
    a_buf.enqueue_fill(1.0)
    b_buf.enqueue_fill(2.0)

    var a = LayoutTensor[dtype, layout](a_buf)
    var b = LayoutTensor[dtype, layout](b_buf)
    var c = LayoutTensor[dtype, layout](c_buf)

    ctx.enqueue_function[add_kernel, add_kernel](
        a, b, c, N,
        grid_dim=ceildiv(N, BLOCK),
        block_dim=BLOCK,
    )

    with c_buf.map_to_host() as host:
        var result = LayoutTensor[dtype, layout](host)
        print(result)
mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu import global_idx
from std.gpu.host import DeviceContext
from layout import Layout, LayoutTensor

comptime dtype = DType.float32
comptime N = 1024
comptime BLOCK = 256
comptime layout = Layout.row_major(N)

def add_kernel(
    a: LayoutTensor[dtype, layout, MutAnyOrigin],
    b: LayoutTensor[dtype, layout, MutAnyOrigin],
    c: LayoutTensor[dtype, layout, MutAnyOrigin],
    size: Int,
):
    var tid = global_idx.x
    if tid < UInt(size):
        c[tid] = a[tid] + b[tid]

def main() raises:
    comptime assert has_accelerator(), "需要GPU支持"
    var ctx = DeviceContext()
    var a_buf = ctx.enqueue_create_buffer[dtype](N)
    var b_buf = ctx.enqueue_create_buffer[dtype](N)
    var c_buf = ctx.enqueue_create_buffer[dtype](N)
    a_buf.enqueue_fill(1.0)
    b_buf.enqueue_fill(2.0)

    var a = LayoutTensor[dtype, layout](a_buf)
    var b = LayoutTensor[dtype, layout](b_buf)
    var c = LayoutTensor[dtype, layout](c_buf)

    ctx.enqueue_function[add_kernel, add_kernel](
        a, b, c, N,
        grid_dim=ceildiv(N, BLOCK),
        block_dim=BLOCK,
    )

    with c_buf.map_to_host() as host:
        var result = LayoutTensor[dtype, layout](host)
        print(result)

Complete 2D example (tiled matmul with shared memory)

完整2D示例(使用共享内存的分块矩阵乘法)

mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.sync import barrier
from std.gpu.host import DeviceContext
from std.gpu import thread_idx, block_idx
from std.gpu.memory import AddressSpace
from layout import Layout, LayoutTensor

comptime dtype = DType.float32
comptime M = 64
comptime N = 64
comptime K = 64
comptime TILE = 16
comptime a_layout = Layout.row_major(M, K)
comptime b_layout = Layout.row_major(K, N)
comptime c_layout = Layout.row_major(M, N)
comptime tile_a = Layout.row_major(TILE, TILE)
comptime tile_b = Layout.row_major(TILE, TILE)

def matmul_kernel(
    A: LayoutTensor[dtype, a_layout, MutAnyOrigin],
    B: LayoutTensor[dtype, b_layout, MutAnyOrigin],
    C: LayoutTensor[dtype, c_layout, MutAnyOrigin],
):
    var tx = thread_idx.x
    var ty = thread_idx.y
    var row = block_idx.y * TILE + ty
    var col = block_idx.x * TILE + tx

    var sa = LayoutTensor[dtype, tile_a, MutAnyOrigin,
        address_space=AddressSpace.SHARED].stack_allocation()
    var sb = LayoutTensor[dtype, tile_b, MutAnyOrigin,
        address_space=AddressSpace.SHARED].stack_allocation()

    var acc: C.element_type = 0.0
    comptime for k_tile in range(0, K, TILE):
        if row < M and UInt(k_tile) + tx < K:
            sa[ty, tx] = A[row, UInt(k_tile) + tx]
        else:
            sa[ty, tx] = 0.0
        if UInt(k_tile) + ty < K and col < N:
            sb[ty, tx] = B[UInt(k_tile) + ty, col]
        else:
            sb[ty, tx] = 0.0
        barrier()
        comptime for k in range(TILE):
            acc += sa[ty, k] * sb[k, tx]
        barrier()

    if row < M and col < N:
        C[row, col] = acc

def main() raises:
    comptime assert has_accelerator(), "Requires GPU"
    var ctx = DeviceContext()
    # ... allocate buffers, init data, launch:
    # ctx.enqueue_function[matmul_kernel, matmul_kernel](
    #     A, B, C,
    #     grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),
    #     block_dim=(TILE, TILE),
    # )
mojo
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.sync import barrier
from std.gpu.host import DeviceContext
from std.gpu import thread_idx, block_idx
from std.gpu.memory import AddressSpace
from layout import Layout, LayoutTensor

comptime dtype = DType.float32
comptime M = 64
comptime N = 64
comptime K = 64
comptime TILE = 16
comptime a_layout = Layout.row_major(M, K)
comptime b_layout = Layout.row_major(K, N)
comptime c_layout = Layout.row_major(M, N)
comptime tile_a = Layout.row_major(TILE, TILE)
comptime tile_b = Layout.row_major(TILE, TILE)

def matmul_kernel(
    A: LayoutTensor[dtype, a_layout, MutAnyOrigin],
    B: LayoutTensor[dtype, b_layout, MutAnyOrigin],
    C: LayoutTensor[dtype, c_layout, MutAnyOrigin],
):
    var tx = thread_idx.x
    var ty = thread_idx.y
    var row = block_idx.y * TILE + ty
    var col = block_idx.x * TILE + tx

    var sa = LayoutTensor[dtype, tile_a, MutAnyOrigin,
        address_space=AddressSpace.SHARED].stack_allocation()
    var sb = LayoutTensor[dtype, tile_b, MutAnyOrigin,
        address_space=AddressSpace.SHARED].stack_allocation()

    var acc: C.element_type = 0.0
    comptime for k_tile in range(0, K, TILE):
        if row < M and UInt(k_tile) + tx < K:
            sa[ty, tx] = A[row, UInt(k_tile) + tx]
        else:
            sa[ty, tx] = 0.0
        if UInt(k_tile) + ty < K and col < N:
            sb[ty, tx] = B[UInt(k_tile) + ty, col]
        else:
            sb[ty, tx] = 0.0
        barrier()
        comptime for k in range(TILE):
            acc += sa[ty, k] * sb[k, tx]
        barrier()

    if row < M and col < N:
        C[row, col] = acc

def main() raises:
    comptime assert has_accelerator(), "需要GPU支持"
    var ctx = DeviceContext()
    # ... 分配缓冲区、初始化数据、启动内核:
    # ctx.enqueue_function[matmul_kernel, matmul_kernel](
    #     A, B, C,
    #     grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),
    #     block_dim=(TILE, TILE),
    # )

SIMD loads in kernels

内核中的SIMD加载

mojo
undefined
mojo
undefined

Vectorized load from raw pointer

从原生指针进行向量化加载

var val = ptr.loadwidth=8 # SIMD[dtype, 8] var sum = val.reduce_add() # scalar reduction
var val = ptr.loadwidth=8 # SIMD[dtype, 8] var sum = val.reduce_add() # 标量归约

LayoutTensor vectorized access

LayoutTensor向量化访问

var vec_tensor = tensor.vectorize1, 4 # group elements into SIMD[4]
undefined
var vec_tensor = tensor.vectorize1, 4 # 将元素分组为SIMD[4]
undefined

Reduction pattern

归约模式

mojo
def block_reduce(
    output: UnsafePointer[Int32, MutAnyOrigin],
    input: UnsafePointer[Int32, MutAnyOrigin],
):
    var sums = stack_allocation[512, Scalar[DType.int32],
        address_space=AddressSpace.SHARED]()
    var tid = thread_idx.x
    sums[tid] = input[block_idx.x * block_dim.x + tid]
    barrier()

    # Tree reduction in shared memory
    var active = block_dim.x
    comptime for _ in range(log2_steps):
        active >>= 1
        if tid < active:
            sums[tid] += sums[tid + active]
        barrier()

    # Final warp reduction + atomic accumulate
    if tid < UInt(WARP_SIZE):
        var v = warp.sum(sums[tid][0])
        if tid == 0:
            _ = Atomic.fetch_add(output, v)
mojo
def block_reduce(
    output: UnsafePointer[Int32, MutAnyOrigin],
    input: UnsafePointer[Int32, MutAnyOrigin],
):
    var sums = stack_allocation[512, Scalar[DType.int32],
        address_space=AddressSpace.SHARED]()
    var tid = thread_idx.x
    sums[tid] = input[block_idx.x * block_dim.x + tid]
    barrier()

    # 共享内存中的树形归约
    var active = block_dim.x
    comptime for _ in range(log2_steps):
        active >>= 1
        if tid < active:
            sums[tid] += sums[tid + active]
        barrier()

    # 最终Warp归约 + 原子累加
    if tid < UInt(WARP_SIZE):
        var v = warp.sum(sums[tid][0])
        if tid == 0:
            _ = Atomic.fetch_add(output, v)

DeviceBuffer from existing pointer

从现有指针创建DeviceBuffer

mojo
undefined
mojo
undefined

Wrap an existing pointer as a DeviceBuffer (non-owning)

将现有指针包装为DeviceBuffer(非拥有模式)

var buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)
undefined
var buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)
undefined

Benchmarking GPU kernels

GPU内核基准测试

mojo
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure

@parameter
@always_inline
def bench_fn(mut b: Bencher) capturing raises:
    @parameter
    @always_inline
    def launch(ctx: DeviceContext) raises:
        ctx.enqueue_function[kernel, kernel](args, grid_dim=G, block_dim=B)
    b.iter_custom[launch](ctx)

var bench = Bench(BenchConfig(max_iters=50000))
bench.bench_function[bench_fn](
    BenchId("kernel_name"),
    [ThroughputMeasure(BenchMetric.bytes, total_bytes)],
)
mojo
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure

@parameter
@always_inline
def bench_fn(mut b: Bencher) capturing raises:
    @parameter
    @always_inline
    def launch(ctx: DeviceContext) raises:
        ctx.enqueue_function[kernel, kernel](args, grid_dim=G, block_dim=B)
    b.iter_custom[launch](ctx)

var bench = Bench(BenchConfig(max_iters=50000))
bench.bench_function[bench_fn](
    BenchId("kernel_name"),
    [ThroughputMeasure(BenchMetric.bytes, total_bytes)],
)

Hardware details

硬件细节

PropertyNVIDIAAMD CDNAAMD RDNA
Warp size326432
Shared memory48-228 KB/block64 KB/blockconfigurable
Tensor coresSM70+ (WMMA)Matrix coresWMMA (RDNA3+)
TMASM90+ (Hopper)N/AN/A
ClustersSM90+N/AN/A
属性NVIDIAAMD CDNAAMD RDNA
Warp大小326432
共享内存48-228 KB/块64 KB/块可配置
张量核心SM70+(WMMA)矩阵核心WMMA(RDNA3+)
TMASM90+(Hopper)
集群SM90+