CUDA Kernels for Diffusers & Transformers
This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
Quick Start
Diffusers (Video/Image Generation)
For benchmarking kernel performance:
bash
# Benchmark with optimized kernels (6% end-to-end speedup)
python generate_video.py --use-optimized-kernels
# Benchmark baseline with torch.compile (34% speedup)
python generate_video.py --no-optimized-kernels --compile
# Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
python generate_video.py --use-optimized-kernels && \
python generate_video.py --no-optimized-kernels --compile
For a minimal diffusers integration example (~150 lines):
bash
python scripts/ltx_kernel_injection_example.py
Transformers (LLMs)
For a minimal transformers integration example (~120 lines):
bash
python scripts/transformers_injection_example.py
HuggingFace Kernels Hub
Load pre-compiled kernels from HuggingFace Hub (no local compilation):
python
from kernels import get_kernel
# Load optimized activation kernels
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)
For a complete HuggingFace Kernels example:
bash
python scripts/huggingface_kernels_example.py
Isolated Kernel Micro-benchmarks
bash
python benchmark_rmsnorm.py
Supported Libraries & Models
| Library | Supported Models | Key Kernels |
|---|
| diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN |
| transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | Compute Capability | Guide |
|---|
| H100 | sm_90 | h100-optimization-guide.md |
| A100 | sm_80 | a100-optimization-guide.md |
| T4 | sm_75 | t4-optimization-guide.md |
When This Skill Applies
Use this skill when:
- Benchmarking kernel performance against baseline implementations
- Writing new CUDA kernels for diffusion models or LLMs
- Optimizing existing kernels for H100, A100, or T4 architecture
- Implementing custom attention, normalization, or activation layers
- Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
- Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
- Debugging kernel performance issues on NVIDIA GPUs
Working Example
A complete working example is available at
. This demonstrates:
- Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
- Build system setup with setup.py, build.toml, and flake.nix
- PyTorch C++ bindings and Python API
- Benchmarking script for comparing optimized vs baseline performance
Benchmarking Kernels
Use the benchmark script to measure kernel performance:
bash
# Full benchmark with all options
python scripts/benchmark_example.py \
--use-optimized-kernels \
--compile \
--batch-size 1 \
--num-frames 161 \
--height 512 \
--width 768 \
--steps 50 \
--warmup-iterations 2
Benchmark Script Options
| Option | Default | Description |
|---|
| auto | Use custom H100 CUDA kernels |
| - | Use baseline implementation |
| false | Enable torch.compile on transformer |
| 1 | Number of videos per prompt |
| 161 | Number of frames to generate |
| 512 | Video height in pixels |
| 768 | Video width in pixels |
| 50 | Denoising steps |
| 2 | Warmup runs before benchmark |
Example Benchmark Results
End-to-End Video Generation (49 frames, 30 steps, H100 80GB):
| Configuration | Time (s) | it/s | Speedup | Notes |
|---|
| Baseline (no compile) | 2.87 | 12.58 | 1.00x | Reference |
| Optimized Kernels | 2.70 | 13.52 | 1.06x | 6% faster |
| Baseline + torch.compile | 2.14 | 19.05 | 1.34x | 34% faster |
Important: and
are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.
Key metrics to capture:
- Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
- Precision: Data type used (e.g., bfloat16)
- Resolution: Width x Height (e.g., 768x512)
- Frames: Number of frames generated (e.g., 49, 161)
RMSNorm Micro-benchmarks
The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:
| Shape | Custom (ms) | PyTorch (ms) | Speedup |
|---|
| [1×1024×2048] | 0.019 | 0.065 | 3.37x |
| [2×1024×2048] | 0.024 | 0.073 | 3.04x |
| [4×1024×2048] | 0.036 | 0.093 | 2.58x |
| [2×4096×3072] | 0.087 | 0.208 | 2.41x |
| [4×4096×3072] | 0.157 | 0.392 | 2.49x |
Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s
Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.
Project Structure
.claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # End-to-end video generation benchmark
│ ├── benchmark_rmsnorm.py # Isolated RMSNorm micro-benchmark
│ ├── ltx_kernel_injection_example.py # Minimal diffusers integration (~150 lines)
│ ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub integration
├── references/
│ ├── diffusers-integration.md # Complete diffusers integration guide
│ ├── transformers-integration.md # Complete transformers integration guide
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│ ├── troubleshooting.md # Common issues and solutions
│ ├── kernel-templates.md # CUDA kernel templates (includes vectorized)
│ ├── h100-optimization-guide.md # H100 (Hopper) optimization deep dive
│ ├── a100-optimization-guide.md # A100 (Ampere) optimization deep dive
│ └── t4-optimization-guide.md # T4 (Turing) optimization deep dive
└── SKILL.md # This file
examples/ltx_video/ # Complete working example
├── kernel_src/
│ └── rmsnorm.cu # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/ # PyTorch bindings
├── generate_video.py # Full benchmark script
├── benchmark_rmsnorm.py # Isolated kernel benchmark
└── setup.py # pip install -e .
GPU Architecture Reference
H100 (Hopper) - Primary Target
| Spec | Value | Optimization Impact |
|---|
| SMs | 132 | Grid sizing: aim for multiples of 132 |
| Threads/SM | 2048 | Max 16 blocks of 128 threads per SM |
| Shared Memory | 192 KB/SM | Large tiles possible |
| L2 Cache | 50 MB | Reuse across blocks |
| Memory BW | 3.35 TB/s | Coalesced access critical |
| Warp Size | 32 | All reductions use warp shuffles |
Quick Comparison (H100 vs A100 vs T4)
| Spec | H100 | A100 | T4 |
|---|
| SMs | 132 | 108 | 40 |
| Memory BW | 3.35 TB/s | 2.0 TB/s | 320 GB/s |
| Shared Mem/SM | 192 KB | 164 KB | 64 KB |
| BF16 Support | Yes | Yes | No (FP16 only) |
| Compute Cap | sm_90 | sm_80 | sm_75 |
See detailed guides: H100 | A100 | T4
Core Kernel Patterns
Vectorized Memory Access (Critical for Performance)
BFloat16 vectorization using :
cuda
// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}
FP16 vectorization using :
cuda
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
FP32 vectorization using :
cuda
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;
Warp Shuffle Reductions
cuda
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}
Block Sizes for Attention
Thread Configuration
For element-wise ops (RoPE, GEGLU):
cuda
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
For reduction ops (LayerNorm, RMSNorm) with vectorization:
cuda
// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // Round to warp boundary
Supported Data Types
All kernels support three precision modes:
- (FP16) - Default for inference
- (BF16) - Preferred for training
- (FP32) - Reference/debugging
Building Kernels
With Nix (Recommended)
bash
nix run .#build-and-copy --max-jobs 2 --cores 8 -L
With pip/uv
build.toml Configuration
toml
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]
Library Integration
HuggingFace Kernels Hub (get_kernel)
See huggingface-kernels-integration.md for the complete guide.
Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:
python
from kernels import get_kernel, has_kernel
# Check availability and load
if has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
Key functions:
get_kernel(repo_id, version=None)
- Download and load kernel from Hub
- - Check if compatible build exists
- - Load from local directory (development)
Popular community kernels:
kernels-community/activation
- GELU, SiLU, etc.
kernels-community/flash-attn
- Flash Attention 2
kernels-community/triton-layer-norm
- LayerNorm, RMSNorm
Diffusers Integration (Video/Image Generation)
See diffusers-integration.md for the complete guide.
Transformers Integration (LLMs)
See transformers-integration.md for the complete guide.
Key differences from diffusers:
- Transformers RMSNorm always has weights (no )
- Use to match LlamaRMSNorm, MistralRMSNorm, etc.
- Check for (LLaMA) or (others) for epsilon
- No pattern - use Flash Attention 2 instead
Minimal transformers pattern:
python
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)
Diffusers Critical Pitfalls
1. RMSNorm Weight May Be None
LTX-Video uses
for some RMSNorm modules:
python
# Transformer blocks: NO WEIGHT
self.norm1 = RMSNorm(dim, elementwise_affine=False)
# Attention modules: HAS WEIGHT
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
Solution: Handle both cases:
python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)
2. Diffusers RMSNorm != torch.nn.RMSNorm
python
# WRONG - misses diffusers RMSNorm
if isinstance(module, torch.nn.RMSNorm):
# CORRECT - catches all RMSNorm variants
if type(module).__name__ == 'RMSNorm':
3. LTX-Video Uses GELU, Not GEGLU
LTX-Video uses
activation_fn="gelu-approximate"
. Don't patch GEGLU for LTX-Video.
4. Inject Kernels BEFORE CPU Offloading
python
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # BEFORE offloading
pipe.enable_model_cpu_offload() # Now safe
Minimal Integration Pattern
python
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""Patch all RMSNorm modules to use custom kernel."""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)
# Usage
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
Kernel-Specific Guidelines
RMSNorm
- Input layout:
- Epsilon default: 1e-6
- Weight may be None if
- Vectorization: Use for BF16, for FP16, for FP32
- Performance: 2.67x faster than PyTorch with vectorized implementation
- Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth
RoPE
- 1D:
[batch, seq, heads, head_dim]
- for text
- 3D:
[batch, t*h*w, heads, head_dim]
- for video
- LTX-Video computes its own RoPE via
GEGLU vs GELU
- GEGLU: Input -> Output
- GELU: Standard activation
- LTX-Video uses GELU, NOT GEGLU
AdaLN
- Formula:
norm(x) * weight * (1 + scale) + shift
- Used in DiT blocks for conditioning
Performance Profiling
bash
# NVIDIA Nsight Systems
nsys profile -o profile python your_script.py
# NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py
Common Issues
See troubleshooting.md for all common issues and solutions.
Quick fixes:
- "NoneType has no attribute contiguous": RMSNorm weight is None, create ones
- isinstance() not matching: Use instead
- GEGLU not called: Model uses GELU, not GEGLU
- Patching doesn't persist: Inject before
enable_model_cpu_offload()
- torch.compile fails with custom kernels: See below
torch.compile Compatibility
Custom CUDA kernels and
are
mutually exclusive unless you register the kernel as a PyTorch custom op.
Error message:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
Workaround options:
- Use without (6% speedup)
- Use without custom kernels (34% speedup)
- Register kernel as custom op (advanced, requires )
To register as custom op (for torch.compile compatibility):
python
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # No shape changes
See Also
Scripts
- benchmark_example.py - Benchmarking script for comparing optimized vs baseline - START HERE
- ltx_kernel_injection_example.py - Minimal diffusers integration (~150 lines)
- transformers_injection_example.py - Minimal transformers/LLM integration (~120 lines)
- huggingface_kernels_example.py - HuggingFace Kernels Hub integration
Integration Guides
- huggingface-kernels-integration.md - HuggingFace Kernels Hub (get_kernel) - load pre-compiled kernels
- diffusers-integration.md - Complete diffusers pipeline integration
- transformers-integration.md - Complete transformers/LLM integration
GPU Optimization Guides
- h100-optimization-guide.md - H100 (Hopper, sm_90) deep dive
- a100-optimization-guide.md - A100 (Ampere, sm_80) deep dive
- t4-optimization-guide.md - T4 (Turing, sm_75) deep dive
Reference
- troubleshooting.md - Common issues and solutions
- kernel-templates.md - Complete kernel templates
- examples/ltx_video/ - Full LTX-Video example directory
External Resources