tilekernels-gpu-kernels

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TileKernels GPU Kernel Library

TileKernels GPU内核库

Skill by ara.so — Daily 2026 Skills collection.

TileKernels is a high-performance GPU kernel library for LLM operations (MoE routing, FP8/FP4 quantization, transpose, engram gating, Manifold HyperConnection) written in TileLang — a Python DSL for expressing GPU kernels with automatic optimization. Kernels target NVIDIA SM90/SM100 (Hopper/Blackwell) architectures and approach hardware performance limits.

由ara.so提供的技能——2026每日技能合集。

TileKernels是一个基于TileLang编写的高性能GPU内核库，用于LLM操作（MoE路由、FP8/FP4量化、转置、记忆门控、流形超连接）。TileLang是一种Python领域特定语言（DSL），用于表达GPU内核并实现自动优化。这些内核针对NVIDIA SM90/SM100（Hopper/Blackwell）架构开发，性能接近硬件极限。

Requirements

要求

Python 3.10+
PyTorch 2.10+
TileLang 0.1.9+
NVIDIA SM90 or SM100 GPU (H100/H200/B100/B200)
CUDA Toolkit 13.1+

Python 3.10+
PyTorch 2.10+
TileLang 0.1.9+
NVIDIA SM90或SM100 GPU（H100/H200/B100/B200）
CUDA Toolkit 13.1+

Installation

安装

bash

undefined

bash

undefined

Development install (recommended for extending/modifying kernels)

开发安装（推荐用于扩展/修改内核）

pip install -e ".[dev]"

Release install

正式版本安装

pip install tile-kernels

undefined

pip install tile-kernels

undefined

Project Structure

项目结构

tile_kernels/
├── moe/        # MoE routing: top-k selection, token-to-expert mapping, weight normalization
├── quant/      # FP8/FP4/E5M6 quantization with fused SwiGLU ops
├── transpose/  # Batched matrix transpose
├── engram/     # Engram gating: fused RMSNorm, forward/backward, weight gradient reduction
├── mhc/        # Manifold HyperConnection: Sinkhorn normalization, mix split/apply
├── modeling/   # High-level torch.autograd.Function wrappers
├── torch/      # PyTorch reference implementations for validation
└── testing/    # Test and benchmark utilities

tile_kernels/
├── moe/        # MoE路由：top-k选择、Token到专家映射、权重归一化
├── quant/      # 带融合SwiGLU操作的FP8/FP4/E5M6量化
├── transpose/  # 批量矩阵转置
├── engram/     # 记忆门控：融合RMSNorm、前向/反向传播、权重梯度归约
├── mhc/        # 流形超连接：Sinkhorn归一化、混合拆分/应用
├── modeling/   # 高层torch.autograd.Function封装
├── torch/      # 用于验证的PyTorch参考实现
└── testing/    # 测试与基准测试工具

Key Modules and Usage

核心模块与使用方法

MoE Routing Kernels

MoE路由内核

python

import torch
from tile_kernels.moe import (
    topk_gating,           # Top-k expert selection and scoring
    token_to_expert_map,   # Token-to-expert mapping
    fused_expand_reduce,   # Fused expansion/reduction
    weight_normalize,      # Weight normalization
)

python

import torch
from tile_kernels.moe import (
    topk_gating,           # Top-k专家选择与打分
    token_to_expert_map,   # Token到专家的映射
    fused_expand_reduce,   # 融合扩展/归约
    weight_normalize,      # 权重归一化
)

Top-k gating: select top-k experts per token

Top-k门控：为每个Token选择top-k个专家

logits: [num_tokens, num_experts]

logits = torch.randn(1024, 256, device="cuda", dtype=torch.float32) topk_weights, topk_indices = topk_gating(logits, top_k=8)

topk_weights: [num_tokens, top_k], topk_indices: [num_tokens, top_k]

undefined

undefined

Quantization Kernels

量化内核

python

import torch
from tile_kernels.quant import (
    per_token_cast_fp8,      # Per-token FP8 quantization
    per_block_cast_fp8,      # Per-block FP8 quantization
    per_channel_cast_fp8,    # Per-channel FP8 quantization
    fused_swiglu_quant_fp8,  # Fused SwiGLU + FP8 quantization
)

python

import torch
from tile_kernels.quant import (
    per_token_cast_fp8,      # 逐Token FP8量化
    per_block_cast_fp8,      # 逐块FP8量化
    per_channel_cast_fp8,    # 逐通道FP8量化
    fused_swiglu_quant_fp8,  # 融合SwiGLU + FP8量化
)

Per-token FP8 quantization

逐Token FP8量化

x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16) x_fp8, scale = per_token_cast_fp8(x)

x_fp8: [1024, 4096] in torch.float8_e4m3fn

x_fp8: [1024, 4096]，类型为torch.float8_e4m3fn

scale: [1024, 1] per-token scales

scale: [1024, 1]，逐Token缩放因子

Per-block FP8 quantization (common for weight quantization)

逐块FP8量化（常用于权重量化）

w = torch.randn(8192, 4096, device="cuda", dtype=torch.bfloat16) w_fp8, scale = per_block_cast_fp8(w, block_size=128)

Fused SwiGLU + FP8 cast (saves memory bandwidth)

融合SwiGLU + FP8转换（节省内存带宽）

gate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) out_fp8, scale = fused_swiglu_quant_fp8(gate, up)

undefined

gate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) out_fp8, scale = fused_swiglu_quant_fp8(gate, up)

undefined

Transpose Kernels

转置内核

python

import torch
from tile_kernels.transpose import batched_transpose

python

import torch
from tile_kernels.transpose import batched_transpose

Batched transpose for MoE weight manipulation

用于MoE权重操作的批量转置

x: [batch, M, N]

x = torch.randn(32, 1024, 4096, device="cuda", dtype=torch.bfloat16) x_T = batched_transpose(x)

x_T: [batch, N, M] = [32, 4096, 1024]

undefined

undefined

Engram Gating Kernels

记忆门控内核

python

import torch
from tile_kernels.engram import (
    engram_gate_forward,    # Forward pass with fused RMSNorm
    engram_gate_backward,   # Backward pass
    engram_weight_grad,     # Weight gradient reduction
)

python

import torch
from tile_kernels.engram import (
    engram_gate_forward,    # 带融合RMSNorm的前向传播
    engram_gate_backward,   # 反向传播
    engram_weight_grad,     # 权重梯度归约
)

Forward pass

前向传播

hidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16) output, norm_hidden = engram_gate_forward(hidden, weight)

undefined

hidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16) output, norm_hidden = engram_gate_forward(hidden, weight)

undefined

Manifold HyperConnection (mHC) Kernels

流形超连接（mHC）内核

python

import torch
from tile_kernels.mhc import (
    sinkhorn_normalize,     # Sinkhorn normalization
    mhc_mix_split,          # Mix splitting
    mhc_mix_apply,          # Mix application
)

python

import torch
from tile_kernels.mhc import (
    sinkhorn_normalize,     # Sinkhorn归一化
    mhc_mix_split,          # 混合拆分
    mhc_mix_apply,          # 混合应用
)

Sinkhorn normalization for connection weights

连接权重的Sinkhorn归一化

conn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32) normalized = sinkhorn_normalize(conn_weights, num_iters=20)

undefined

conn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32) normalized = sinkhorn_normalize(conn_weights, num_iters=20)

undefined

High-Level Modeling Layers

高层建模层

python

import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipeline

python

import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipeline

EngramGate as a trainable nn.Module-compatible layer

作为可训练nn.Module兼容层的EngramGate

Uses torch.autograd.Function internally

内部使用torch.autograd.Function

gate_layer = EngramGateLayer( hidden_size=2048, num_experts=256, ).cuda()

hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) gate_output = gate_layer(hidden_states)

gate_layer = EngramGateLayer( hidden_size=2048, num_experts=256, ).cuda()

hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) gate_output = gate_layer(hidden_states)

Manifold HyperConnection pipeline

流形超连接流水线

mhc = MHCPipeline( num_connections=8, hidden_size=2048, ).cuda()

undefined

mhc = MHCPipeline( num_connections=8, hidden_size=2048, ).cuda()

undefined

Testing

测试

bash

undefined

bash

undefined

Test a single module (correctness only, 4 parallel workers)

测试单个模块（仅验证正确性，4个并行工作进程）

pytest tests/transpose/test_transpose.py -n 4

Test with benchmarking

带基准测试的测试

pytest tests/transpose/test_transpose.py --run-benchmark

Test MoE kernels

测试MoE内核

pytest tests/moe/ -n 4

Test quantization kernels

测试量化内核

pytest tests/quant/ -n 4

Test engram kernels

测试记忆门控内核

pytest tests/engram/ -n 4

Full pressure test (all tests, 2 repetitions, 4 workers)

全压力测试（所有测试，重复2次，4个工作进程）

TK_FULL_TEST=1 pytest -n 4 --count 2

Test specific quantization variant

测试特定量化变体

pytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark

undefined

pytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark

undefined

Common Patterns

常见模式

Pattern: Fused MoE Forward Pass

模式：融合MoE前向传播

python

import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8

def moe_dispatch(hidden_states, gate_weight, top_k=8):
    """Full MoE dispatch using TileKernels."""
    # 1. Compute gating logits
    logits = torch.mm(hidden_states, gate_weight.T)  # [T, E]
    
    # 2. Top-k expert selection
    topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
    
    # 3. Build token-to-expert routing map
    routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
    
    # 4. Quantize activations before expert computation
    hidden_fp8, scale = per_token_cast_fp8(hidden_states)
    
    return hidden_fp8, scale, topk_weights, routing_map

python

import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8

def moe_dispatch(hidden_states, gate_weight, top_k=8):
    """使用TileKernels完成完整的MoE分发。"""
    # 1. 计算门控logits
    logits = torch.mm(hidden_states, gate_weight.T)  # [T, E]
    
    # 2. Top-k专家选择
    topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
    
    # 3. 构建Token到专家的路由映射
    routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
    
    # 4. 在专家计算前量化激活值
    hidden_fp8, scale = per_token_cast_fp8(hidden_states)
    
    return hidden_fp8, scale, topk_weights, routing_map

Pattern: Using PyTorch Reference Implementations for Validation

模式：使用PyTorch参考实现进行验证

python

import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_ref

python

import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_ref

Compare kernel output vs PyTorch reference

比较内核输出与PyTorch参考实现

x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)

out_kernel, scale_kernel = per_token_cast_fp8(x) out_ref, scale_ref = per_token_cast_fp8_ref(x)

x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)

out_kernel, scale_kernel = per_token_cast_fp8(x) out_ref, scale_ref = per_token_cast_fp8_ref(x)

Validate

验证

torch.testing.assert_close( out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2 ) print("Kernel matches reference ✓")

undefined

torch.testing.assert_close( out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2 ) print("内核与参考实现匹配 ✓")

undefined

Pattern: Benchmarking a Kernel

模式：基准测试内核

python

import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose

x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)

python

import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose

x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)

Using the testing utility

使用测试工具

result = benchmark_kernel( fn=batched_transpose, args=(x,), warmup=25, rep=100, ) print(f"Latency: {result.mean:.3f} ms, Bandwidth: {result.gbps:.1f} GB/s")

undefined

result = benchmark_kernel( fn=batched_transpose, args=(x,), warmup=25, rep=100, ) print(f"延迟：{result.mean:.3f} ms，带宽：{result.gbps:.1f} GB/s")

undefined

Pattern: Custom TileLang Kernel (extending the library)

模式：自定义TileLang内核（扩展库）

python

undefined

python

undefined

tile_kernels follow TileLang DSL patterns

tile_kernels遵循TileLang DSL模式

import tilelang import tilelang.language as T

def make_elementwise_scale_kernel(M, N, dtype="float16"): @T.prim_func def scale_kernel( A: T.Buffer((M, N), dtype), scale: T.Buffer((M,), "float32"), B: T.Buffer((M, N), dtype), ): # TileLang kernel body for i, j in T.grid(M, N): B[i, j] = T.cast( T.cast(A[i, j], "float32") * scale[i], dtype ) return scale_kernel

import tilelang import tilelang.language as T

def make_elementwise_scale_kernel(M, N, dtype="float16"): @T.prim_func def scale_kernel( A: T.Buffer((M, N), dtype), scale: T.Buffer((M,), "float32"), B: T.Buffer((M, N), dtype), ): # TileLang内核主体 for i, j in T.grid(M, N): B[i, j] = T.cast( T.cast(A[i, j], "float32") * scale[i], dtype ) return scale_kernel

Compile and use

编译并使用

kernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))

undefined

kernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))

undefined

Architecture-Specific Notes

架构特定说明

SM90 (Hopper: H100/H200): Full support, primary target
SM100 (Blackwell: B100/B200): Full support
Kernels use hardware-specific features (tensor memory accelerator, async copy, warp-specialized pipelines) — do NOT run on older GPUs (Ampere/Ada)

SM90（Hopper：H100/H200）：完全支持，主要目标架构
SM100（Blackwell：B100/B200）：完全支持
内核使用硬件特定特性（张量内存加速器、异步复制、 warp专用流水线）——无法在旧款GPU（Ampere/Ada）上运行

Troubleshooting

故障排除

CUDA Architecture Mismatch

CUDA架构不匹配

RuntimeError: CUDA error: no kernel image is available for execution on the device

→ You need SM90 or SM100. Check with:

python -c "import torch; print(torch.cuda.get_device_capability())"

RuntimeError: CUDA error: no kernel image is available for execution on the device

→ 您需要SM90或SM100架构的GPU。使用以下命令检查：

python -c "import torch; print(torch.cuda.get_device_capability())"

TileLang Version Mismatch

TileLang版本不匹配

ImportError: cannot import name 'xyz' from 'tilelang'

→ Ensure TileLang >= 0.1.9:

pip install tilelang>=0.1.9

ImportError: cannot import name 'xyz' from 'tilelang'

→ 确保TileLang版本≥0.1.9：

pip install tilelang>=0.1.9

CUDA Toolkit Version

CUDA Toolkit版本问题

error: identifier "__nv_fp8_e4m3" is undefined

→ Requires CUDA 13.1+. Check:

nvcc --version

error: identifier "__nv_fp8_e4m3" is undefined

→ 需要CUDA 13.1+版本。检查版本：

nvcc --version

Out of Shared Memory

共享内存不足

→ Kernels are tuned for specific tile sizes. If you hit shared memory limits, reduce batch size or sequence length, or file an issue.

→ 内核针对特定分块大小优化。如果遇到共享内存限制，请减小批量大小或序列长度，或提交issue。

Running Tests Without Benchmark Flag

未使用基准测试标志运行测试

bash

undefined

bash

undefined

Benchmarks are opt-in to avoid slow CI

基准测试为可选开启，避免CI运行缓慢

pytest tests/ -n 4 # Fast correctness only pytest tests/ -n 4 --run-benchmark # Include performance numbers

undefined

pytest tests/ -n 4 # 仅快速验证正确性 pytest tests/ -n 4 --run-benchmark # 包含性能数据

undefined

Citation

引用

bibtex

@misc{tilekernels,
      title={TileKernels},
      author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
      year={2026},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}

bibtex

@misc{tilekernels,
      title={TileKernels},
      author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
      year={2026},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}