tilekernels-gpu-kernels
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTileKernels GPU Kernel Library
TileKernels GPU内核库
Skill by ara.so — Daily 2026 Skills collection.
TileKernels is a high-performance GPU kernel library for LLM operations (MoE routing, FP8/FP4 quantization, transpose, engram gating, Manifold HyperConnection) written in TileLang — a Python DSL for expressing GPU kernels with automatic optimization. Kernels target NVIDIA SM90/SM100 (Hopper/Blackwell) architectures and approach hardware performance limits.
由ara.so提供的技能——2026每日技能合集。
TileKernels是一个基于TileLang编写的高性能GPU内核库,用于LLM操作(MoE路由、FP8/FP4量化、转置、记忆门控、流形超连接)。TileLang是一种Python领域特定语言(DSL),用于表达GPU内核并实现自动优化。这些内核针对NVIDIA SM90/SM100(Hopper/Blackwell)架构开发,性能接近硬件极限。
Requirements
要求
- Python 3.10+
- PyTorch 2.10+
- TileLang 0.1.9+
- NVIDIA SM90 or SM100 GPU (H100/H200/B100/B200)
- CUDA Toolkit 13.1+
- Python 3.10+
- PyTorch 2.10+
- TileLang 0.1.9+
- NVIDIA SM90或SM100 GPU(H100/H200/B100/B200)
- CUDA Toolkit 13.1+
Installation
安装
bash
undefinedbash
undefinedDevelopment install (recommended for extending/modifying kernels)
开发安装(推荐用于扩展/修改内核)
pip install -e ".[dev]"
pip install -e ".[dev]"
Release install
正式版本安装
pip install tile-kernels
undefinedpip install tile-kernels
undefinedProject Structure
项目结构
tile_kernels/
├── moe/ # MoE routing: top-k selection, token-to-expert mapping, weight normalization
├── quant/ # FP8/FP4/E5M6 quantization with fused SwiGLU ops
├── transpose/ # Batched matrix transpose
├── engram/ # Engram gating: fused RMSNorm, forward/backward, weight gradient reduction
├── mhc/ # Manifold HyperConnection: Sinkhorn normalization, mix split/apply
├── modeling/ # High-level torch.autograd.Function wrappers
├── torch/ # PyTorch reference implementations for validation
└── testing/ # Test and benchmark utilitiestile_kernels/
├── moe/ # MoE路由:top-k选择、Token到专家映射、权重归一化
├── quant/ # 带融合SwiGLU操作的FP8/FP4/E5M6量化
├── transpose/ # 批量矩阵转置
├── engram/ # 记忆门控:融合RMSNorm、前向/反向传播、权重梯度归约
├── mhc/ # 流形超连接:Sinkhorn归一化、混合拆分/应用
├── modeling/ # 高层torch.autograd.Function封装
├── torch/ # 用于验证的PyTorch参考实现
└── testing/ # 测试与基准测试工具Key Modules and Usage
核心模块与使用方法
MoE Routing Kernels
MoE路由内核
python
import torch
from tile_kernels.moe import (
topk_gating, # Top-k expert selection and scoring
token_to_expert_map, # Token-to-expert mapping
fused_expand_reduce, # Fused expansion/reduction
weight_normalize, # Weight normalization
)python
import torch
from tile_kernels.moe import (
topk_gating, # Top-k专家选择与打分
token_to_expert_map, # Token到专家的映射
fused_expand_reduce, # 融合扩展/归约
weight_normalize, # 权重归一化
)Top-k gating: select top-k experts per token
Top-k门控:为每个Token选择top-k个专家
logits: [num_tokens, num_experts]
logits: [num_tokens, num_experts]
logits = torch.randn(1024, 256, device="cuda", dtype=torch.float32)
topk_weights, topk_indices = topk_gating(logits, top_k=8)
logits = torch.randn(1024, 256, device="cuda", dtype=torch.float32)
topk_weights, topk_indices = topk_gating(logits, top_k=8)
topk_weights: [num_tokens, top_k], topk_indices: [num_tokens, top_k]
topk_weights: [num_tokens, top_k], topk_indices: [num_tokens, top_k]
undefinedundefinedQuantization Kernels
量化内核
python
import torch
from tile_kernels.quant import (
per_token_cast_fp8, # Per-token FP8 quantization
per_block_cast_fp8, # Per-block FP8 quantization
per_channel_cast_fp8, # Per-channel FP8 quantization
fused_swiglu_quant_fp8, # Fused SwiGLU + FP8 quantization
)python
import torch
from tile_kernels.quant import (
per_token_cast_fp8, # 逐Token FP8量化
per_block_cast_fp8, # 逐块FP8量化
per_channel_cast_fp8, # 逐通道FP8量化
fused_swiglu_quant_fp8, # 融合SwiGLU + FP8量化
)Per-token FP8 quantization
逐Token FP8量化
x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16)
x_fp8, scale = per_token_cast_fp8(x)
x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16)
x_fp8, scale = per_token_cast_fp8(x)
x_fp8: [1024, 4096] in torch.float8_e4m3fn
x_fp8: [1024, 4096],类型为torch.float8_e4m3fn
scale: [1024, 1] per-token scales
scale: [1024, 1],逐Token缩放因子
Per-block FP8 quantization (common for weight quantization)
逐块FP8量化(常用于权重量化)
w = torch.randn(8192, 4096, device="cuda", dtype=torch.bfloat16)
w_fp8, scale = per_block_cast_fp8(w, block_size=128)
w = torch.randn(8192, 4096, device="cuda", dtype=torch.bfloat16)
w_fp8, scale = per_block_cast_fp8(w, block_size=128)
Fused SwiGLU + FP8 cast (saves memory bandwidth)
融合SwiGLU + FP8转换(节省内存带宽)
gate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16)
up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16)
out_fp8, scale = fused_swiglu_quant_fp8(gate, up)
undefinedgate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16)
up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16)
out_fp8, scale = fused_swiglu_quant_fp8(gate, up)
undefinedTranspose Kernels
转置内核
python
import torch
from tile_kernels.transpose import batched_transposepython
import torch
from tile_kernels.transpose import batched_transposeBatched transpose for MoE weight manipulation
用于MoE权重操作的批量转置
x: [batch, M, N]
x: [batch, M, N]
x = torch.randn(32, 1024, 4096, device="cuda", dtype=torch.bfloat16)
x_T = batched_transpose(x)
x = torch.randn(32, 1024, 4096, device="cuda", dtype=torch.bfloat16)
x_T = batched_transpose(x)
x_T: [batch, N, M] = [32, 4096, 1024]
x_T: [batch, N, M] = [32, 4096, 1024]
undefinedundefinedEngram Gating Kernels
记忆门控内核
python
import torch
from tile_kernels.engram import (
engram_gate_forward, # Forward pass with fused RMSNorm
engram_gate_backward, # Backward pass
engram_weight_grad, # Weight gradient reduction
)python
import torch
from tile_kernels.engram import (
engram_gate_forward, # 带融合RMSNorm的前向传播
engram_gate_backward, # 反向传播
engram_weight_grad, # 权重梯度归约
)Forward pass
前向传播
hidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16)
weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16)
output, norm_hidden = engram_gate_forward(hidden, weight)
undefinedhidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16)
weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16)
output, norm_hidden = engram_gate_forward(hidden, weight)
undefinedManifold HyperConnection (mHC) Kernels
流形超连接(mHC)内核
python
import torch
from tile_kernels.mhc import (
sinkhorn_normalize, # Sinkhorn normalization
mhc_mix_split, # Mix splitting
mhc_mix_apply, # Mix application
)python
import torch
from tile_kernels.mhc import (
sinkhorn_normalize, # Sinkhorn归一化
mhc_mix_split, # 混合拆分
mhc_mix_apply, # 混合应用
)Sinkhorn normalization for connection weights
连接权重的Sinkhorn归一化
conn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32)
normalized = sinkhorn_normalize(conn_weights, num_iters=20)
undefinedconn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32)
normalized = sinkhorn_normalize(conn_weights, num_iters=20)
undefinedHigh-Level Modeling Layers
高层建模层
python
import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipelinepython
import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipelineEngramGate as a trainable nn.Module-compatible layer
作为可训练nn.Module兼容层的EngramGate
Uses torch.autograd.Function internally
内部使用torch.autograd.Function
gate_layer = EngramGateLayer(
hidden_size=2048,
num_experts=256,
).cuda()
hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16)
gate_output = gate_layer(hidden_states)
gate_layer = EngramGateLayer(
hidden_size=2048,
num_experts=256,
).cuda()
hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16)
gate_output = gate_layer(hidden_states)
Manifold HyperConnection pipeline
流形超连接流水线
mhc = MHCPipeline(
num_connections=8,
hidden_size=2048,
).cuda()
undefinedmhc = MHCPipeline(
num_connections=8,
hidden_size=2048,
).cuda()
undefinedTesting
测试
bash
undefinedbash
undefinedTest a single module (correctness only, 4 parallel workers)
测试单个模块(仅验证正确性,4个并行工作进程)
pytest tests/transpose/test_transpose.py -n 4
pytest tests/transpose/test_transpose.py -n 4
Test with benchmarking
带基准测试的测试
pytest tests/transpose/test_transpose.py --run-benchmark
pytest tests/transpose/test_transpose.py --run-benchmark
Test MoE kernels
测试MoE内核
pytest tests/moe/ -n 4
pytest tests/moe/ -n 4
Test quantization kernels
测试量化内核
pytest tests/quant/ -n 4
pytest tests/quant/ -n 4
Test engram kernels
测试记忆门控内核
pytest tests/engram/ -n 4
pytest tests/engram/ -n 4
Full pressure test (all tests, 2 repetitions, 4 workers)
全压力测试(所有测试,重复2次,4个工作进程)
TK_FULL_TEST=1 pytest -n 4 --count 2
TK_FULL_TEST=1 pytest -n 4 --count 2
Test specific quantization variant
测试特定量化变体
pytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark
undefinedpytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark
undefinedCommon Patterns
常见模式
Pattern: Fused MoE Forward Pass
模式:融合MoE前向传播
python
import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8
def moe_dispatch(hidden_states, gate_weight, top_k=8):
"""Full MoE dispatch using TileKernels."""
# 1. Compute gating logits
logits = torch.mm(hidden_states, gate_weight.T) # [T, E]
# 2. Top-k expert selection
topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
# 3. Build token-to-expert routing map
routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
# 4. Quantize activations before expert computation
hidden_fp8, scale = per_token_cast_fp8(hidden_states)
return hidden_fp8, scale, topk_weights, routing_mappython
import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8
def moe_dispatch(hidden_states, gate_weight, top_k=8):
"""使用TileKernels完成完整的MoE分发。"""
# 1. 计算门控logits
logits = torch.mm(hidden_states, gate_weight.T) # [T, E]
# 2. Top-k专家选择
topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
# 3. 构建Token到专家的路由映射
routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
# 4. 在专家计算前量化激活值
hidden_fp8, scale = per_token_cast_fp8(hidden_states)
return hidden_fp8, scale, topk_weights, routing_mapPattern: Using PyTorch Reference Implementations for Validation
模式:使用PyTorch参考实现进行验证
python
import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_refpython
import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_refCompare kernel output vs PyTorch reference
比较内核输出与PyTorch参考实现
x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)
out_kernel, scale_kernel = per_token_cast_fp8(x)
out_ref, scale_ref = per_token_cast_fp8_ref(x)
x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)
out_kernel, scale_kernel = per_token_cast_fp8(x)
out_ref, scale_ref = per_token_cast_fp8_ref(x)
Validate
验证
torch.testing.assert_close(
out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2
)
print("Kernel matches reference ✓")
undefinedtorch.testing.assert_close(
out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2
)
print("内核与参考实现匹配 ✓")
undefinedPattern: Benchmarking a Kernel
模式:基准测试内核
python
import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose
x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)python
import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose
x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)Using the testing utility
使用测试工具
result = benchmark_kernel(
fn=batched_transpose,
args=(x,),
warmup=25,
rep=100,
)
print(f"Latency: {result.mean:.3f} ms, Bandwidth: {result.gbps:.1f} GB/s")
undefinedresult = benchmark_kernel(
fn=batched_transpose,
args=(x,),
warmup=25,
rep=100,
)
print(f"延迟:{result.mean:.3f} ms,带宽:{result.gbps:.1f} GB/s")
undefinedPattern: Custom TileLang Kernel (extending the library)
模式:自定义TileLang内核(扩展库)
python
undefinedpython
undefinedtile_kernels follow TileLang DSL patterns
tile_kernels遵循TileLang DSL模式
import tilelang
import tilelang.language as T
def make_elementwise_scale_kernel(M, N, dtype="float16"):
@T.prim_func
def scale_kernel(
A: T.Buffer((M, N), dtype),
scale: T.Buffer((M,), "float32"),
B: T.Buffer((M, N), dtype),
):
# TileLang kernel body
for i, j in T.grid(M, N):
B[i, j] = T.cast(
T.cast(A[i, j], "float32") * scale[i],
dtype
)
return scale_kernel
import tilelang
import tilelang.language as T
def make_elementwise_scale_kernel(M, N, dtype="float16"):
@T.prim_func
def scale_kernel(
A: T.Buffer((M, N), dtype),
scale: T.Buffer((M,), "float32"),
B: T.Buffer((M, N), dtype),
):
# TileLang内核主体
for i, j in T.grid(M, N):
B[i, j] = T.cast(
T.cast(A[i, j], "float32") * scale[i],
dtype
)
return scale_kernel
Compile and use
编译并使用
kernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))
undefinedkernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))
undefinedArchitecture-Specific Notes
架构特定说明
- SM90 (Hopper: H100/H200): Full support, primary target
- SM100 (Blackwell: B100/B200): Full support
- Kernels use hardware-specific features (tensor memory accelerator, async copy, warp-specialized pipelines) — do NOT run on older GPUs (Ampere/Ada)
- SM90(Hopper:H100/H200):完全支持,主要目标架构
- SM100(Blackwell:B100/B200):完全支持
- 内核使用硬件特定特性(张量内存加速器、异步复制、 warp专用流水线)——无法在旧款GPU(Ampere/Ada)上运行
Troubleshooting
故障排除
CUDA Architecture Mismatch
CUDA架构不匹配
RuntimeError: CUDA error: no kernel image is available for execution on the device→ You need SM90 or SM100. Check with:
python -c "import torch; print(torch.cuda.get_device_capability())"RuntimeError: CUDA error: no kernel image is available for execution on the device→ 您需要SM90或SM100架构的GPU。使用以下命令检查:
python -c "import torch; print(torch.cuda.get_device_capability())"TileLang Version Mismatch
TileLang版本不匹配
ImportError: cannot import name 'xyz' from 'tilelang'→ Ensure TileLang >= 0.1.9:
pip install tilelang>=0.1.9ImportError: cannot import name 'xyz' from 'tilelang'→ 确保TileLang版本≥0.1.9:
pip install tilelang>=0.1.9CUDA Toolkit Version
CUDA Toolkit版本问题
error: identifier "__nv_fp8_e4m3" is undefined→ Requires CUDA 13.1+. Check:
nvcc --versionerror: identifier "__nv_fp8_e4m3" is undefined→ 需要CUDA 13.1+版本。检查版本:
nvcc --versionOut of Shared Memory
共享内存不足
→ Kernels are tuned for specific tile sizes. If you hit shared memory limits, reduce batch size or sequence length, or file an issue.
→ 内核针对特定分块大小优化。如果遇到共享内存限制,请减小批量大小或序列长度,或提交issue。
Running Tests Without Benchmark Flag
未使用基准测试标志运行测试
bash
undefinedbash
undefinedBenchmarks are opt-in to avoid slow CI
基准测试为可选开启,避免CI运行缓慢
pytest tests/ -n 4 # Fast correctness only
pytest tests/ -n 4 --run-benchmark # Include performance numbers
undefinedpytest tests/ -n 4 # 仅快速验证正确性
pytest tests/ -n 4 --run-benchmark # 包含性能数据
undefinedCitation
引用
bibtex
@misc{tilekernels,
title={TileKernels},
author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
year={2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}bibtex
@misc{tilekernels,
title={TileKernels},
author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
year={2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}