tilekernels-gpu-kernels

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TileKernels GPU Kernel Library

TileKernels GPU内核库

Skill by ara.so — Daily 2026 Skills collection.
TileKernels is a high-performance GPU kernel library for LLM operations (MoE routing, FP8/FP4 quantization, transpose, engram gating, Manifold HyperConnection) written in TileLang — a Python DSL for expressing GPU kernels with automatic optimization. Kernels target NVIDIA SM90/SM100 (Hopper/Blackwell) architectures and approach hardware performance limits.
ara.so提供的技能——2026每日技能合集。
TileKernels是一个基于TileLang编写的高性能GPU内核库,用于LLM操作(MoE路由、FP8/FP4量化、转置、记忆门控、流形超连接)。TileLang是一种Python领域特定语言(DSL),用于表达GPU内核并实现自动优化。这些内核针对NVIDIA SM90/SM100(Hopper/Blackwell)架构开发,性能接近硬件极限。

Requirements

要求

  • Python 3.10+
  • PyTorch 2.10+
  • TileLang 0.1.9+
  • NVIDIA SM90 or SM100 GPU (H100/H200/B100/B200)
  • CUDA Toolkit 13.1+
  • Python 3.10+
  • PyTorch 2.10+
  • TileLang 0.1.9+
  • NVIDIA SM90或SM100 GPU(H100/H200/B100/B200)
  • CUDA Toolkit 13.1+

Installation

安装

bash
undefined
bash
undefined

Development install (recommended for extending/modifying kernels)

开发安装(推荐用于扩展/修改内核)

pip install -e ".[dev]"
pip install -e ".[dev]"

Release install

正式版本安装

pip install tile-kernels
undefined
pip install tile-kernels
undefined

Project Structure

项目结构

tile_kernels/
├── moe/        # MoE routing: top-k selection, token-to-expert mapping, weight normalization
├── quant/      # FP8/FP4/E5M6 quantization with fused SwiGLU ops
├── transpose/  # Batched matrix transpose
├── engram/     # Engram gating: fused RMSNorm, forward/backward, weight gradient reduction
├── mhc/        # Manifold HyperConnection: Sinkhorn normalization, mix split/apply
├── modeling/   # High-level torch.autograd.Function wrappers
├── torch/      # PyTorch reference implementations for validation
└── testing/    # Test and benchmark utilities
tile_kernels/
├── moe/        # MoE路由:top-k选择、Token到专家映射、权重归一化
├── quant/      # 带融合SwiGLU操作的FP8/FP4/E5M6量化
├── transpose/  # 批量矩阵转置
├── engram/     # 记忆门控:融合RMSNorm、前向/反向传播、权重梯度归约
├── mhc/        # 流形超连接:Sinkhorn归一化、混合拆分/应用
├── modeling/   # 高层torch.autograd.Function封装
├── torch/      # 用于验证的PyTorch参考实现
└── testing/    # 测试与基准测试工具

Key Modules and Usage

核心模块与使用方法

MoE Routing Kernels

MoE路由内核

python
import torch
from tile_kernels.moe import (
    topk_gating,           # Top-k expert selection and scoring
    token_to_expert_map,   # Token-to-expert mapping
    fused_expand_reduce,   # Fused expansion/reduction
    weight_normalize,      # Weight normalization
)
python
import torch
from tile_kernels.moe import (
    topk_gating,           # Top-k专家选择与打分
    token_to_expert_map,   # Token到专家的映射
    fused_expand_reduce,   # 融合扩展/归约
    weight_normalize,      # 权重归一化
)

Top-k gating: select top-k experts per token

Top-k门控:为每个Token选择top-k个专家

logits: [num_tokens, num_experts]

logits: [num_tokens, num_experts]

logits = torch.randn(1024, 256, device="cuda", dtype=torch.float32) topk_weights, topk_indices = topk_gating(logits, top_k=8)
logits = torch.randn(1024, 256, device="cuda", dtype=torch.float32) topk_weights, topk_indices = topk_gating(logits, top_k=8)

topk_weights: [num_tokens, top_k], topk_indices: [num_tokens, top_k]

topk_weights: [num_tokens, top_k], topk_indices: [num_tokens, top_k]

undefined
undefined

Quantization Kernels

量化内核

python
import torch
from tile_kernels.quant import (
    per_token_cast_fp8,      # Per-token FP8 quantization
    per_block_cast_fp8,      # Per-block FP8 quantization
    per_channel_cast_fp8,    # Per-channel FP8 quantization
    fused_swiglu_quant_fp8,  # Fused SwiGLU + FP8 quantization
)
python
import torch
from tile_kernels.quant import (
    per_token_cast_fp8,      # 逐Token FP8量化
    per_block_cast_fp8,      # 逐块FP8量化
    per_channel_cast_fp8,    # 逐通道FP8量化
    fused_swiglu_quant_fp8,  # 融合SwiGLU + FP8量化
)

Per-token FP8 quantization

逐Token FP8量化

x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16) x_fp8, scale = per_token_cast_fp8(x)
x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16) x_fp8, scale = per_token_cast_fp8(x)

x_fp8: [1024, 4096] in torch.float8_e4m3fn

x_fp8: [1024, 4096],类型为torch.float8_e4m3fn

scale: [1024, 1] per-token scales

scale: [1024, 1],逐Token缩放因子

Per-block FP8 quantization (common for weight quantization)

逐块FP8量化(常用于权重量化)

w = torch.randn(8192, 4096, device="cuda", dtype=torch.bfloat16) w_fp8, scale = per_block_cast_fp8(w, block_size=128)
w = torch.randn(8192, 4096, device="cuda", dtype=torch.bfloat16) w_fp8, scale = per_block_cast_fp8(w, block_size=128)

Fused SwiGLU + FP8 cast (saves memory bandwidth)

融合SwiGLU + FP8转换(节省内存带宽)

gate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) out_fp8, scale = fused_swiglu_quant_fp8(gate, up)
undefined
gate = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) up = torch.randn(1024, 8192, device="cuda", dtype=torch.bfloat16) out_fp8, scale = fused_swiglu_quant_fp8(gate, up)
undefined

Transpose Kernels

转置内核

python
import torch
from tile_kernels.transpose import batched_transpose
python
import torch
from tile_kernels.transpose import batched_transpose

Batched transpose for MoE weight manipulation

用于MoE权重操作的批量转置

x: [batch, M, N]

x: [batch, M, N]

x = torch.randn(32, 1024, 4096, device="cuda", dtype=torch.bfloat16) x_T = batched_transpose(x)
x = torch.randn(32, 1024, 4096, device="cuda", dtype=torch.bfloat16) x_T = batched_transpose(x)

x_T: [batch, N, M] = [32, 4096, 1024]

x_T: [batch, N, M] = [32, 4096, 1024]

undefined
undefined

Engram Gating Kernels

记忆门控内核

python
import torch
from tile_kernels.engram import (
    engram_gate_forward,    # Forward pass with fused RMSNorm
    engram_gate_backward,   # Backward pass
    engram_weight_grad,     # Weight gradient reduction
)
python
import torch
from tile_kernels.engram import (
    engram_gate_forward,    # 带融合RMSNorm的前向传播
    engram_gate_backward,   # 反向传播
    engram_weight_grad,     # 权重梯度归约
)

Forward pass

前向传播

hidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16) output, norm_hidden = engram_gate_forward(hidden, weight)
undefined
hidden = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) weight = torch.randn(256, 2048, device="cuda", dtype=torch.bfloat16) output, norm_hidden = engram_gate_forward(hidden, weight)
undefined

Manifold HyperConnection (mHC) Kernels

流形超连接(mHC)内核

python
import torch
from tile_kernels.mhc import (
    sinkhorn_normalize,     # Sinkhorn normalization
    mhc_mix_split,          # Mix splitting
    mhc_mix_apply,          # Mix application
)
python
import torch
from tile_kernels.mhc import (
    sinkhorn_normalize,     # Sinkhorn归一化
    mhc_mix_split,          # 混合拆分
    mhc_mix_apply,          # 混合应用
)

Sinkhorn normalization for connection weights

连接权重的Sinkhorn归一化

conn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32) normalized = sinkhorn_normalize(conn_weights, num_iters=20)
undefined
conn_weights = torch.randn(8, 64, device="cuda", dtype=torch.float32) normalized = sinkhorn_normalize(conn_weights, num_iters=20)
undefined

High-Level Modeling Layers

高层建模层

python
import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipeline
python
import torch
from tile_kernels.modeling import EngramGateLayer, MHCPipeline

EngramGate as a trainable nn.Module-compatible layer

作为可训练nn.Module兼容层的EngramGate

Uses torch.autograd.Function internally

内部使用torch.autograd.Function

gate_layer = EngramGateLayer( hidden_size=2048, num_experts=256, ).cuda()
hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) gate_output = gate_layer(hidden_states)
gate_layer = EngramGateLayer( hidden_size=2048, num_experts=256, ).cuda()
hidden_states = torch.randn(1024, 2048, device="cuda", dtype=torch.bfloat16) gate_output = gate_layer(hidden_states)

Manifold HyperConnection pipeline

流形超连接流水线

mhc = MHCPipeline( num_connections=8, hidden_size=2048, ).cuda()
undefined
mhc = MHCPipeline( num_connections=8, hidden_size=2048, ).cuda()
undefined

Testing

测试

bash
undefined
bash
undefined

Test a single module (correctness only, 4 parallel workers)

测试单个模块(仅验证正确性,4个并行工作进程)

pytest tests/transpose/test_transpose.py -n 4
pytest tests/transpose/test_transpose.py -n 4

Test with benchmarking

带基准测试的测试

pytest tests/transpose/test_transpose.py --run-benchmark
pytest tests/transpose/test_transpose.py --run-benchmark

Test MoE kernels

测试MoE内核

pytest tests/moe/ -n 4
pytest tests/moe/ -n 4

Test quantization kernels

测试量化内核

pytest tests/quant/ -n 4
pytest tests/quant/ -n 4

Test engram kernels

测试记忆门控内核

pytest tests/engram/ -n 4
pytest tests/engram/ -n 4

Full pressure test (all tests, 2 repetitions, 4 workers)

全压力测试(所有测试,重复2次,4个工作进程)

TK_FULL_TEST=1 pytest -n 4 --count 2
TK_FULL_TEST=1 pytest -n 4 --count 2

Test specific quantization variant

测试特定量化变体

pytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark
undefined
pytest tests/quant/test_fp8_cast.py -n 4 --run-benchmark
undefined

Common Patterns

常见模式

Pattern: Fused MoE Forward Pass

模式:融合MoE前向传播

python
import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8

def moe_dispatch(hidden_states, gate_weight, top_k=8):
    """Full MoE dispatch using TileKernels."""
    # 1. Compute gating logits
    logits = torch.mm(hidden_states, gate_weight.T)  # [T, E]
    
    # 2. Top-k expert selection
    topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
    
    # 3. Build token-to-expert routing map
    routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
    
    # 4. Quantize activations before expert computation
    hidden_fp8, scale = per_token_cast_fp8(hidden_states)
    
    return hidden_fp8, scale, topk_weights, routing_map
python
import torch
from tile_kernels.moe import topk_gating, token_to_expert_map
from tile_kernels.quant import per_token_cast_fp8

def moe_dispatch(hidden_states, gate_weight, top_k=8):
    """使用TileKernels完成完整的MoE分发。"""
    # 1. 计算门控logits
    logits = torch.mm(hidden_states, gate_weight.T)  # [T, E]
    
    # 2. Top-k专家选择
    topk_weights, topk_indices = topk_gating(logits, top_k=top_k)
    
    # 3. 构建Token到专家的路由映射
    routing_map = token_to_expert_map(topk_indices, num_experts=gate_weight.shape[0])
    
    # 4. 在专家计算前量化激活值
    hidden_fp8, scale = per_token_cast_fp8(hidden_states)
    
    return hidden_fp8, scale, topk_weights, routing_map

Pattern: Using PyTorch Reference Implementations for Validation

模式:使用PyTorch参考实现进行验证

python
import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_ref
python
import torch
from tile_kernels.quant import per_token_cast_fp8
from tile_kernels.torch import per_token_cast_fp8 as per_token_cast_fp8_ref

Compare kernel output vs PyTorch reference

比较内核输出与PyTorch参考实现

x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)
out_kernel, scale_kernel = per_token_cast_fp8(x) out_ref, scale_ref = per_token_cast_fp8_ref(x)
x = torch.randn(512, 4096, device="cuda", dtype=torch.bfloat16)
out_kernel, scale_kernel = per_token_cast_fp8(x) out_ref, scale_ref = per_token_cast_fp8_ref(x)

Validate

验证

torch.testing.assert_close( out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2 ) print("Kernel matches reference ✓")
undefined
torch.testing.assert_close( out_kernel.float(), out_ref.float(), atol=1e-2, rtol=1e-2 ) print("内核与参考实现匹配 ✓")
undefined

Pattern: Benchmarking a Kernel

模式:基准测试内核

python
import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose

x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)
python
import torch
from tile_kernels.testing import benchmark_kernel
from tile_kernels.transpose import batched_transpose

x = torch.randn(64, 4096, 4096, device="cuda", dtype=torch.bfloat16)

Using the testing utility

使用测试工具

result = benchmark_kernel( fn=batched_transpose, args=(x,), warmup=25, rep=100, ) print(f"Latency: {result.mean:.3f} ms, Bandwidth: {result.gbps:.1f} GB/s")
undefined
result = benchmark_kernel( fn=batched_transpose, args=(x,), warmup=25, rep=100, ) print(f"延迟:{result.mean:.3f} ms,带宽:{result.gbps:.1f} GB/s")
undefined

Pattern: Custom TileLang Kernel (extending the library)

模式:自定义TileLang内核(扩展库)

python
undefined
python
undefined

tile_kernels follow TileLang DSL patterns

tile_kernels遵循TileLang DSL模式

import tilelang import tilelang.language as T
def make_elementwise_scale_kernel(M, N, dtype="float16"): @T.prim_func def scale_kernel( A: T.Buffer((M, N), dtype), scale: T.Buffer((M,), "float32"), B: T.Buffer((M, N), dtype), ): # TileLang kernel body for i, j in T.grid(M, N): B[i, j] = T.cast( T.cast(A[i, j], "float32") * scale[i], dtype ) return scale_kernel
import tilelang import tilelang.language as T
def make_elementwise_scale_kernel(M, N, dtype="float16"): @T.prim_func def scale_kernel( A: T.Buffer((M, N), dtype), scale: T.Buffer((M,), "float32"), B: T.Buffer((M, N), dtype), ): # TileLang内核主体 for i, j in T.grid(M, N): B[i, j] = T.cast( T.cast(A[i, j], "float32") * scale[i], dtype ) return scale_kernel

Compile and use

编译并使用

kernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))
undefined
kernel = tilelang.compile(make_elementwise_scale_kernel(1024, 4096))
undefined

Architecture-Specific Notes

架构特定说明

  • SM90 (Hopper: H100/H200): Full support, primary target
  • SM100 (Blackwell: B100/B200): Full support
  • Kernels use hardware-specific features (tensor memory accelerator, async copy, warp-specialized pipelines) — do NOT run on older GPUs (Ampere/Ada)
  • SM90(Hopper:H100/H200):完全支持,主要目标架构
  • SM100(Blackwell:B100/B200):完全支持
  • 内核使用硬件特定特性(张量内存加速器、异步复制、 warp专用流水线)——无法在旧款GPU(Ampere/Ada)上运行

Troubleshooting

故障排除

CUDA Architecture Mismatch

CUDA架构不匹配

RuntimeError: CUDA error: no kernel image is available for execution on the device
→ You need SM90 or SM100. Check with:
python -c "import torch; print(torch.cuda.get_device_capability())"
RuntimeError: CUDA error: no kernel image is available for execution on the device
→ 您需要SM90或SM100架构的GPU。使用以下命令检查:
python -c "import torch; print(torch.cuda.get_device_capability())"

TileLang Version Mismatch

TileLang版本不匹配

ImportError: cannot import name 'xyz' from 'tilelang'
→ Ensure TileLang >= 0.1.9:
pip install tilelang>=0.1.9
ImportError: cannot import name 'xyz' from 'tilelang'
→ 确保TileLang版本≥0.1.9:
pip install tilelang>=0.1.9

CUDA Toolkit Version

CUDA Toolkit版本问题

error: identifier "__nv_fp8_e4m3" is undefined
→ Requires CUDA 13.1+. Check:
nvcc --version
error: identifier "__nv_fp8_e4m3" is undefined
→ 需要CUDA 13.1+版本。检查版本:
nvcc --version

Out of Shared Memory

共享内存不足

→ Kernels are tuned for specific tile sizes. If you hit shared memory limits, reduce batch size or sequence length, or file an issue.
→ 内核针对特定分块大小优化。如果遇到共享内存限制,请减小批量大小或序列长度,或提交issue。

Running Tests Without Benchmark Flag

未使用基准测试标志运行测试

bash
undefined
bash
undefined

Benchmarks are opt-in to avoid slow CI

基准测试为可选开启,避免CI运行缓慢

pytest tests/ -n 4 # Fast correctness only pytest tests/ -n 4 --run-benchmark # Include performance numbers
undefined
pytest tests/ -n 4 # 仅快速验证正确性 pytest tests/ -n 4 --run-benchmark # 包含性能数据
undefined

Citation

引用

bibtex
@misc{tilekernels,
      title={TileKernels},
      author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
      year={2026},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}
bibtex
@misc{tilekernels,
      title={TileKernels},
      author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},
      year={2026},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/TileKernels}},
}