cuda-kernels

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA Kernels for Diffusers & Transformers

面向Diffusers与Transformers的CUDA内核

This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
本技能为开发针对NVIDIA GPU(H100、A100、T4)、适配HuggingFace diffuserstransformers库的优化CUDA内核提供模式与指南。

Quick Start

快速开始

Diffusers (Video/Image Generation)

Diffusers(视频/图像生成)

For benchmarking kernel performance:
bash
undefined
内核性能基准测试:
bash
undefined

Benchmark with optimized kernels (6% end-to-end speedup)

使用优化内核进行基准测试(端到端速度提升6%)

python generate_video.py --use-optimized-kernels
python generate_video.py --use-optimized-kernels

Benchmark baseline with torch.compile (34% speedup)

使用torch.compile进行基准测试(速度提升34%)

python generate_video.py --no-optimized-kernels --compile
python generate_video.py --no-optimized-kernels --compile

Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)

对比不同配置(注意:--compile与--use-optimized-kernels互斥)

python generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile

**For a minimal diffusers integration example (~150 lines):**
```bash
python scripts/ltx_kernel_injection_example.py
python generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile

**Diffusers极简集成示例(约150行代码):**
```bash
python scripts/ltx_kernel_injection_example.py

Transformers (LLMs)

Transformers(大语言模型)

For a minimal transformers integration example (~120 lines):
bash
python scripts/transformers_injection_example.py
Transformers极简集成示例(约120行代码):
bash
python scripts/transformers_injection_example.py

HuggingFace Kernels Hub

HuggingFace Kernels Hub

Load pre-compiled kernels from HuggingFace Hub (no local compilation):
python
from kernels import get_kernel
从HuggingFace Hub加载预编译内核(无需本地编译):
python
from kernels import get_kernel

Load optimized activation kernels

加载优化后的激活函数内核

activation = get_kernel("kernels-community/activation", version=1)
activation = get_kernel("kernels-community/activation", version=1)

Use the kernel

使用内核

y = torch.empty_like(x) activation.gelu_fast(y, x)

**For a complete HuggingFace Kernels example:**
```bash
python scripts/huggingface_kernels_example.py
y = torch.empty_like(x) activation.gelu_fast(y, x)

**完整HuggingFace Kernels示例:**
```bash
python scripts/huggingface_kernels_example.py

Isolated Kernel Micro-benchmarks

独立内核微基准测试

bash
python benchmark_rmsnorm.py
bash
python benchmark_rmsnorm.py

Supported Libraries & Models

支持的库与模型

LibrarySupported ModelsKey Kernels
diffusersLTX-Video, Stable Diffusion, FLUX, DiTRMSNorm, GEGLU, RoPE, AdaLN
transformersLLaMA, Mistral, Qwen, FalconRMSNorm, Attention
GPUCompute CapabilityGuide
H100sm_90h100-optimization-guide.md
A100sm_80a100-optimization-guide.md
T4sm_75t4-optimization-guide.md
支持的模型核心内核
diffusersLTX-Video, Stable Diffusion, FLUX, DiTRMSNorm, GEGLU, RoPE, AdaLN
transformersLLaMA, Mistral, Qwen, FalconRMSNorm, Attention
GPU计算能力指南
H100sm_90h100-optimization-guide.md
A100sm_80a100-optimization-guide.md
T4sm_75t4-optimization-guide.md

When This Skill Applies

适用场景

Use this skill when:
  • Benchmarking kernel performance against baseline implementations
  • Writing new CUDA kernels for diffusion models or LLMs
  • Optimizing existing kernels for H100, A100, or T4 architecture
  • Implementing custom attention, normalization, or activation layers
  • Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
  • Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
  • Debugging kernel performance issues on NVIDIA GPUs
在以下场景中使用本技能:
  • 内核性能基准测试:与基准实现对比性能
  • 为扩散模型或大语言模型编写新的CUDA内核
  • 针对H100、A100或T4架构优化现有内核
  • 实现自定义注意力、归一化或激活层
  • diffusers流水线集成(LTX-Video、Stable Diffusion、FLUX、DiT)
  • transformers模型集成(LLaMA、Mistral、Qwen)
  • 调试NVIDIA GPU上的内核性能问题

Working Example

完整示例

A complete working example is available at
examples/ltx_video/
. This demonstrates:
  • Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
  • Build system setup with setup.py, build.toml, and flake.nix
  • PyTorch C++ bindings and Python API
  • Benchmarking script for comparing optimized vs baseline performance
完整的可运行示例位于
examples/ltx_video/
目录下,展示:
  • 自定义CUDA内核(RMSNorm、RoPE 3D、GEGLU、AdaLN)
  • 包含setup.py、build.toml和flake.nix的构建系统设置
  • PyTorch C++绑定与Python API
  • 用于对比优化版与基准版性能的基准测试脚本

Benchmarking Kernels

内核基准测试

Use the benchmark script to measure kernel performance:
bash
undefined
使用基准测试脚本测量内核性能:
bash
undefined

Full benchmark with all options

包含所有选项的完整基准测试

python scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
undefined
python scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
undefined

Benchmark Script Options

基准测试脚本选项

OptionDefaultDescription
--use-optimized-kernels
autoUse custom H100 CUDA kernels
--no-optimized-kernels
-Use baseline implementation
--compile
falseEnable torch.compile on transformer
--batch-size
1Number of videos per prompt
--num-frames
161Number of frames to generate
--height
512Video height in pixels
--width
768Video width in pixels
--steps
50Denoising steps
--warmup-iterations
2Warmup runs before benchmark
选项默认值描述
--use-optimized-kernels
auto使用自定义H100 CUDA内核
--no-optimized-kernels
-使用基准实现
--compile
false为transformer启用torch.compile
--batch-size
1每个prompt生成的视频数量
--num-frames
161生成的帧数量
--height
512视频高度(像素)
--width
768视频宽度(像素)
--steps
50去噪步数
--warmup-iterations
2基准测试前的预热运行次数

Example Benchmark Results

基准测试结果示例

End-to-End Video Generation (49 frames, 30 steps, H100 80GB):
ConfigurationTime (s)it/sSpeedupNotes
Baseline (no compile)2.8712.581.00xReference
Optimized Kernels2.7013.521.06x6% faster
Baseline + torch.compile2.1419.051.34x34% faster
Important:
--use-optimized-kernels
and
--compile
are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.
Key metrics to capture:
  • Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
  • Precision: Data type used (e.g., bfloat16)
  • Resolution: Width x Height (e.g., 768x512)
  • Frames: Number of frames generated (e.g., 49, 161)
端到端视频生成(49帧、30步、H100 80GB):
配置耗时(秒)每秒迭代数提速比说明
基准版(未编译)2.8712.581.00x参考值
优化内核版2.7013.521.06x快6%
基准版+torch.compile2.1419.051.34x快34%
重要提示:
--use-optimized-kernels
--compile
目前互斥。自定义内核需要注册为PyTorch自定义算子才能与torch.compile兼容。
需记录的关键指标:
  • 设备:GPU型号(如NVIDIA H100 80GB HBM3)
  • 精度:使用的数据类型(如bfloat16)
  • 分辨率:宽×高(如768x512)
  • 帧数:生成的帧数量(如49、161)

RMSNorm Micro-benchmarks

RMSNorm微基准测试

The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:
ShapeCustom (ms)PyTorch (ms)Speedup
[1×1024×2048]0.0190.0653.37x
[2×1024×2048]0.0240.0733.04x
[4×1024×2048]0.0360.0932.58x
[2×4096×3072]0.0870.2082.41x
[4×4096×3072]0.1570.3922.49x
Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s
Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.
向量化RMSNorm内核比PyTorch基准版实现平均提速2.67倍
形状自定义内核(毫秒)PyTorch(毫秒)提速比
[1×1024×2048]0.0190.0653.37x
[2×1024×2048]0.0240.0733.04x
[4×1024×2048]0.0360.0932.58x
[2×4096×3072]0.0870.2082.41x
[4×4096×3072]0.1570.3922.49x
带宽效率:达到H100理论3.35 TB/s带宽的38%
端到端提速比偏低的原因:RMSNorm在LTX-Video的总计算量中仅占约5%,剩余时间消耗在注意力机制(Flash Attention/SDPA)、线性投影和VAE解码上。

Project Structure

项目结构

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # End-to-end video generation benchmark
│   ├── benchmark_rmsnorm.py              # Isolated RMSNorm micro-benchmark
│   ├── ltx_kernel_injection_example.py   # Minimal diffusers integration (~150 lines)
│   ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub integration
├── references/
│   ├── diffusers-integration.md          # Complete diffusers integration guide
│   ├── transformers-integration.md       # Complete transformers integration guide
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│   ├── troubleshooting.md                # Common issues and solutions
│   ├── kernel-templates.md               # CUDA kernel templates (includes vectorized)
│   ├── h100-optimization-guide.md        # H100 (Hopper) optimization deep dive
│   ├── a100-optimization-guide.md        # A100 (Ampere) optimization deep dive
│   └── t4-optimization-guide.md          # T4 (Turing) optimization deep dive
└── SKILL.md                              # This file

examples/ltx_video/                  # Complete working example
├── kernel_src/
│   └── rmsnorm.cu                  # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/                      # PyTorch bindings
├── generate_video.py               # Full benchmark script
├── benchmark_rmsnorm.py            # Isolated kernel benchmark
└── setup.py                        # pip install -e .
.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # 端到端视频生成基准测试
│   ├── benchmark_rmsnorm.py              # 独立RMSNorm微基准测试
│   ├── ltx_kernel_injection_example.py   # Diffusers极简集成(约150行)
│   ├── transformers_injection_example.py # Transformers极简集成(约120行)
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub集成
├── references/
│   ├── diffusers-integration.md          # 完整Diffusers集成指南
│   ├── transformers-integration.md       # 完整Transformers集成指南
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub(get_kernel)指南
│   ├── troubleshooting.md                # 常见问题与解决方案
│   ├── kernel-templates.md               # CUDA内核模板(包含向量化版本)
│   ├── h100-optimization-guide.md        # H100(Hopper)优化深度解析
│   ├── a100-optimization-guide.md        # A100(Ampere)优化深度解析
│   └── t4-optimization-guide.md          # T4(Turing)优化深度解析
└── SKILL.md                              # 本文档

examples/ltx_video/                  # 完整可运行示例
├── kernel_src/
│   └── rmsnorm.cu                  # 向量化RMSNorm内核(提速2.67倍)
├── torch-ext/                      # PyTorch绑定
├── generate_video.py               # 完整基准测试脚本
├── benchmark_rmsnorm.py            # 独立内核基准测试
└── setup.py                        # pip install -e .

GPU Architecture Reference

GPU架构参考

H100 (Hopper) - Primary Target

H100(Hopper)- 主要目标平台

SpecValueOptimization Impact
SMs132Grid sizing: aim for multiples of 132
Threads/SM2048Max 16 blocks of 128 threads per SM
Shared Memory192 KB/SMLarge tiles possible
L2 Cache50 MBReuse across blocks
Memory BW3.35 TB/sCoalesced access critical
Warp Size32All reductions use warp shuffles
规格数值优化影响
SM数量132网格大小:尽量设置为132的倍数
每个SM的线程数2048每个SM最多支持16个128线程的块
共享内存192 KB/SM可使用大 tile
L2缓存50 MB可跨块复用数据
内存带宽3.35 TB/s合并访问至关重要
Warp大小32所有归约操作使用warp shuffle

Quick Comparison (H100 vs A100 vs T4)

快速对比(H100 vs A100 vs T4)

SpecH100A100T4
SMs13210840
Memory BW3.35 TB/s2.0 TB/s320 GB/s
Shared Mem/SM192 KB164 KB64 KB
BF16 SupportYesYesNo (FP16 only)
Compute Capsm_90sm_80sm_75
See detailed guides: H100 | A100 | T4
规格H100A100T4
SM数量13210840
内存带宽3.35 TB/s2.0 TB/s320 GB/s
每个SM的共享内存192 KB164 KB64 KB
BF16支持否(仅支持FP16)
计算能力sm_90sm_80sm_75
查看详细指南:H100 | A100 | T4

Core Kernel Patterns

核心内核模式

Vectorized Memory Access (Critical for Performance)

向量化内存访问(性能关键)

BFloat16 vectorization using
__nv_bfloat162
:
cuda
// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}
FP16 vectorization using
__half2
:
cuda
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
FP32 vectorization using
float4
:
cuda
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;
使用
__nv_bfloat162
实现BFloat16向量化:
cuda
// 一次性加载2个bfloat16元素(32位加载)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}
使用
__half2
实现FP16向量化:
cuda
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);
使用
float4
实现FP32向量化:
cuda
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

Warp Shuffle Reductions

Warp Shuffle归约

cuda
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}
cuda
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

Block Sizes for Attention

注意力机制的块大小

  • BLOCK_SIZE_M = 128
    ,
    BLOCK_SIZE_N = 64
    ,
    BLOCK_SIZE_K = 64
  • NUM_WARPS = 8
  • BLOCK_SIZE_M = 128
    ,
    BLOCK_SIZE_N = 64
    ,
    BLOCK_SIZE_K = 64
  • NUM_WARPS = 8

Thread Configuration

线程配置

For element-wise ops (RoPE, GEGLU):
cuda
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
For reduction ops (LayerNorm, RMSNorm) with vectorization:
cuda
// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // Round to warp boundary
针对元素级操作(RoPE、GEGLU):
cuda
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;
针对带向量化的归约操作(LayerNorm、RMSNorm):
cuda
// 针对bf16/fp16向量化访问,除以2
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // 向上取整到warp边界

Supported Data Types

支持的数据类型

All kernels support three precision modes:
  • __half
    (FP16) - Default for inference
  • __nv_bfloat16
    (BF16) - Preferred for training
  • float
    (FP32) - Reference/debugging
所有内核支持三种精度模式:
  • __half
    (FP16)- 推理默认值
  • __nv_bfloat16
    (BF16)- 训练首选
  • float
    (FP32)- 参考/调试用

Building Kernels

内核构建

With Nix (Recommended)

使用Nix(推荐)

bash
nix run .#build-and-copy --max-jobs 2 --cores 8 -L
bash
nix run .#build-and-copy --max-jobs 2 --cores 8 -L

With pip/uv

使用pip/uv

bash
uv pip install -e .
bash
uv pip install -e .

build.toml Configuration

build.toml配置

toml
[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]
toml
[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

Library Integration

库集成

HuggingFace Kernels Hub (get_kernel)

HuggingFace Kernels Hub(get_kernel)

See huggingface-kernels-integration.md for the complete guide.
Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:
python
from kernels import get_kernel, has_kernel
完整指南请查看huggingface-kernels-integration.md
直接从HuggingFace Hub加载预编译的优化内核,无需本地编译:
python
from kernels import get_kernel, has_kernel

Check availability and load

检查可用性并加载

if has_kernel("kernels-community/activation"): activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)

**Key functions:**
- `get_kernel(repo_id, version=None)` - Download and load kernel from Hub
- `has_kernel(repo_id)` - Check if compatible build exists
- `get_local_kernel(path)` - Load from local directory (development)

**Popular community kernels:**
- `kernels-community/activation` - GELU, SiLU, etc.
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm, RMSNorm
if has_kernel("kernels-community/activation"): activation = get_kernel("kernels-community/activation", version=1)
# 使用内核
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)

**核心函数:**
- `get_kernel(repo_id, version=None)` - 从Hub下载并加载内核
- `has_kernel(repo_id)` - 检查是否存在兼容的构建版本
- `get_local_kernel(path)` - 从本地目录加载(开发用)

**热门社区内核:**
- `kernels-community/activation` - GELU、SiLU等
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm、RMSNorm

Diffusers Integration (Video/Image Generation)

Diffusers集成(视频/图像生成)

See diffusers-integration.md for the complete guide.
完整指南请查看diffusers-integration.md

Transformers Integration (LLMs)

Transformers集成(大语言模型)

See transformers-integration.md for the complete guide.
Key differences from diffusers:
  • Transformers RMSNorm always has weights (no
    elementwise_affine=False
    )
  • Use
    'RMSNorm' in class_name
    to match LlamaRMSNorm, MistralRMSNorm, etc.
  • Check for
    variance_epsilon
    (LLaMA) or
    eps
    (others) for epsilon
  • No
    set_processor()
    pattern - use Flash Attention 2 instead
Minimal transformers pattern:
python
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)
完整指南请查看transformers-integration.md
与Diffusers的主要区别:
  • Transformers的RMSNorm 始终带有权重(无
    elementwise_affine=False
    情况)
  • 使用
    'RMSNorm' in class_name
    匹配LlamaRMSNorm、MistralRMSNorm等所有变体
  • 检查
    variance_epsilon
    (LLaMA)或
    eps
    (其他模型)获取epsilon值
  • set_processor()
    模式 - 改用Flash Attention 2
Transformers极简集成模式:
python
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

Diffusers Critical Pitfalls

Diffusers关键注意事项

1. RMSNorm Weight May Be None

1. RMSNorm权重可能为None

LTX-Video uses
elementwise_affine=False
for some RMSNorm modules:
python
undefined
LTX-Video的部分RMSNorm模块使用
elementwise_affine=False
python
undefined

Transformer blocks: NO WEIGHT

Transformer块:无权重

self.norm1 = RMSNorm(dim, elementwise_affine=False)
self.norm1 = RMSNorm(dim, elementwise_affine=False)

Attention modules: HAS WEIGHT

注意力模块:有权重

self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)

**Solution:** Handle both cases:
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)

**解决方案:** 处理两种情况:
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

2. Diffusers RMSNorm != torch.nn.RMSNorm

2. Diffusers的RMSNorm != torch.nn.RMSNorm

python
undefined
python
undefined

WRONG - misses diffusers RMSNorm

错误 - 无法匹配Diffusers的RMSNorm

if isinstance(module, torch.nn.RMSNorm):
if isinstance(module, torch.nn.RMSNorm):

CORRECT - catches all RMSNorm variants

正确 - 捕获所有RMSNorm变体

if type(module).name == 'RMSNorm':
undefined
if type(module).name == 'RMSNorm':
undefined

3. LTX-Video Uses GELU, Not GEGLU

3. LTX-Video使用GELU而非GEGLU

LTX-Video uses
activation_fn="gelu-approximate"
. Don't patch GEGLU for LTX-Video.
LTX-Video使用
activation_fn="gelu-approximate"
,不要为LTX-Video打GEGLU的补丁。

4. Inject Kernels BEFORE CPU Offloading

4. 在CPU卸载前注入内核

python
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # BEFORE offloading
pipe.enable_model_cpu_offload()  # Now safe
python
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # 先注入
pipe.enable_model_cpu_offload()  # 再启用卸载

Minimal Integration Pattern

极简集成模式

python
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """Patch all RMSNorm modules to use custom kernel."""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)
python
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """为所有RMSNorm模块打补丁,使用自定义内核。"""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

Usage

使用示例

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) pipe.to("cuda") patch_rmsnorm_modules(pipe.transformer) pipe.enable_model_cpu_offload()
undefined
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) pipe.to("cuda") patch_rmsnorm_modules(pipe.transformer) pipe.enable_model_cpu_offload()
undefined

Kernel-Specific Guidelines

内核特定指南

RMSNorm

RMSNorm

  • Input layout:
    [..., hidden_size]
  • Epsilon default: 1e-6
  • Weight may be None if
    elementwise_affine=False
  • Vectorization: Use
    __nv_bfloat162
    for BF16,
    __half2
    for FP16,
    float4
    for FP32
  • Performance: 2.67x faster than PyTorch with vectorized implementation
  • Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth
  • 输入布局:
    [..., hidden_size]
  • 默认epsilon:1e-6
  • 权重可能为None:当
    elementwise_affine=False
  • 向量化:BF16使用
    __nv_bfloat162
    ,FP16使用
    __half2
    ,FP32使用
    float4
  • 性能:向量化实现比PyTorch快2.67倍
  • 带宽:达到H100理论3.35 TB/s带宽的约38%

RoPE

RoPE

  • 1D:
    [batch, seq, heads, head_dim]
    - for text
  • 3D:
    [batch, t*h*w, heads, head_dim]
    - for video
  • LTX-Video computes its own RoPE via
    LTXVideoRotaryPosEmbed
  • 1D:
    [batch, seq, heads, head_dim]
    - 用于文本
  • 3D:
    [batch, t*h*w, heads, head_dim]
    - 用于视频
  • LTX-Video通过
    LTXVideoRotaryPosEmbed
    计算自身的RoPE

GEGLU vs GELU

GEGLU vs GELU

  • GEGLU: Input
    [batch, seq, 2*hidden]
    -> Output
    [batch, seq, hidden]
  • GELU: Standard activation
  • LTX-Video uses GELU, NOT GEGLU
  • GEGLU:输入
    [batch, seq, 2*hidden]
    -> 输出
    [batch, seq, hidden]
  • GELU:标准激活函数
  • LTX-Video使用GELU,而非GEGLU

AdaLN

AdaLN

  • Formula:
    norm(x) * weight * (1 + scale) + shift
  • Used in DiT blocks for conditioning
  • 公式:
    norm(x) * weight * (1 + scale) + shift
  • 用于DiT块的条件控制

Performance Profiling

性能分析

bash
undefined
bash
undefined

NVIDIA Nsight Systems

NVIDIA Nsight Systems

nsys profile -o profile python your_script.py
nsys profile -o profile python your_script.py

NVIDIA Nsight Compute

NVIDIA Nsight Compute

ncu --set full -o metrics python your_script.py
undefined
ncu --set full -o metrics python your_script.py
undefined

Common Issues

常见问题

See troubleshooting.md for all common issues and solutions.
Quick fixes:
  • "NoneType has no attribute contiguous": RMSNorm weight is None, create ones
  • isinstance() not matching: Use
    type(module).__name__
    instead
  • GEGLU not called: Model uses GELU, not GEGLU
  • Patching doesn't persist: Inject before
    enable_model_cpu_offload()
  • torch.compile fails with custom kernels: See below
所有常见问题与解决方案请查看troubleshooting.md
快速修复:
  • "NoneType has no attribute contiguous":RMSNorm权重为None,创建全1张量
  • isinstance()无法匹配:改用
    type(module).__name__
  • GEGLU未被调用:模型使用的是GELU而非GEGLU
  • 补丁不生效:在
    enable_model_cpu_offload()
    前注入内核
  • 自定义内核与torch.compile不兼容:查看下文

torch.compile Compatibility

torch.compile兼容性

Custom CUDA kernels and
torch.compile
are mutually exclusive unless you register the kernel as a PyTorch custom op.
Error message:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
Workaround options:
  1. Use
    --use-optimized-kernels
    without
    --compile
    (6% speedup)
  2. Use
    --compile
    without custom kernels (34% speedup)
  3. Register kernel as custom op (advanced, requires
    torch.library
    )
To register as custom op (for torch.compile compatibility):
python
import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # No shape changes
自定义CUDA内核与
torch.compile
互斥,除非将内核注册为PyTorch自定义算子。
错误信息:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
解决方案选项:
  1. 使用
    --use-optimized-kernels
    但不使用
    --compile
    (提速6%)
  2. 使用
    --compile
    但不使用自定义内核(提速34%)
  3. 将内核注册为自定义算子(进阶,需使用
    torch.library
注册为自定义算子(实现torch.compile兼容):
python
import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # 无形状变化

See Also

相关资源

Scripts

脚本

  • benchmark_example.py - Benchmarking script for comparing optimized vs baseline - START HERE
  • ltx_kernel_injection_example.py - Minimal diffusers integration (~150 lines)
  • transformers_injection_example.py - Minimal transformers/LLM integration (~120 lines)
  • huggingface_kernels_example.py - HuggingFace Kernels Hub integration
  • benchmark_example.py - 对比优化版与基准版的基准测试脚本 - 从这里开始
  • ltx_kernel_injection_example.py - Diffusers极简集成(约150行)
  • transformers_injection_example.py - Transformers/大语言模型极简集成(约120行)
  • huggingface_kernels_example.py - HuggingFace Kernels Hub集成

Integration Guides

集成指南

  • huggingface-kernels-integration.md - HuggingFace Kernels Hub (get_kernel) - load pre-compiled kernels
  • diffusers-integration.md - Complete diffusers pipeline integration
  • transformers-integration.md - Complete transformers/LLM integration
  • huggingface-kernels-integration.md - HuggingFace Kernels Hub(get_kernel)- 加载预编译内核
  • diffusers-integration.md - 完整Diffusers流水线集成指南
  • transformers-integration.md - 完整Transformers/大语言模型集成指南

GPU Optimization Guides

GPU优化指南

  • h100-optimization-guide.md - H100 (Hopper, sm_90) deep dive
  • a100-optimization-guide.md - A100 (Ampere, sm_80) deep dive
  • t4-optimization-guide.md - T4 (Turing, sm_75) deep dive
  • h100-optimization-guide.md - H100(Hopper, sm_90)深度解析
  • a100-optimization-guide.md - A100(Ampere, sm_80)深度解析
  • t4-optimization-guide.md - T4(Turing, sm_75)深度解析

Reference

参考文档

  • troubleshooting.md - Common issues and solutions
  • kernel-templates.md - Complete kernel templates
  • examples/ltx_video/ - Full LTX-Video example directory
  • troubleshooting.md - 常见问题与解决方案
  • kernel-templates.md - 完整内核模板
  • examples/ltx_video/ - 完整LTX-Video示例目录

External Resources

外部资源