cuda-kernels

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CUDA Kernels for Diffusers & Transformers

面向Diffusers与Transformers的CUDA内核

This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.

本技能为开发针对NVIDIA GPU（H100、A100、T4）、适配HuggingFace diffusers和transformers库的优化CUDA内核提供模式与指南。

Quick Start

快速开始

Diffusers (Video/Image Generation)

Diffusers（视频/图像生成）

For benchmarking kernel performance:

bash

undefined

内核性能基准测试：

bash

undefined

Benchmark with optimized kernels (6% end-to-end speedup)

使用优化内核进行基准测试（端到端速度提升6%）

python generate_video.py --use-optimized-kernels

Benchmark baseline with torch.compile (34% speedup)

使用torch.compile进行基准测试（速度提升34%）

python generate_video.py --no-optimized-kernels --compile

Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)

对比不同配置（注意：--compile与--use-optimized-kernels互斥）

python generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile


**For a minimal diffusers integration example (~150 lines):**
```bash
python scripts/ltx_kernel_injection_example.py

python generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile


**Diffusers极简集成示例（约150行代码）：**
```bash
python scripts/ltx_kernel_injection_example.py

Transformers (LLMs)

Transformers（大语言模型）

For a minimal transformers integration example (~120 lines):

bash

python scripts/transformers_injection_example.py

Transformers极简集成示例（约120行代码）：

bash

python scripts/transformers_injection_example.py

HuggingFace Kernels Hub

Load pre-compiled kernels from HuggingFace Hub (no local compilation):

python

from kernels import get_kernel

从HuggingFace Hub加载预编译内核（无需本地编译）：

python

from kernels import get_kernel

Load optimized activation kernels

加载优化后的激活函数内核

activation = get_kernel("kernels-community/activation", version=1)

Use the kernel

使用内核

y = torch.empty_like(x) activation.gelu_fast(y, x)


**For a complete HuggingFace Kernels example:**
```bash
python scripts/huggingface_kernels_example.py

y = torch.empty_like(x) activation.gelu_fast(y, x)


**完整HuggingFace Kernels示例：**
```bash
python scripts/huggingface_kernels_example.py

Isolated Kernel Micro-benchmarks

独立内核微基准测试

bash

python benchmark_rmsnorm.py

bash

python benchmark_rmsnorm.py

Supported Libraries & Models

支持的库与模型

Library	Supported Models	Key Kernels
diffusers	LTX-Video, Stable Diffusion, FLUX, DiT	RMSNorm, GEGLU, RoPE, AdaLN
transformers	LLaMA, Mistral, Qwen, Falcon	RMSNorm, Attention

GPU	Compute Capability	Guide
H100	sm_90	h100-optimization-guide.md
A100	sm_80	a100-optimization-guide.md
T4	sm_75	t4-optimization-guide.md

库	支持的模型	核心内核
diffusers	LTX-Video, Stable Diffusion, FLUX, DiT	RMSNorm, GEGLU, RoPE, AdaLN
transformers	LLaMA, Mistral, Qwen, Falcon	RMSNorm, Attention

GPU	计算能力	指南
H100	sm_90	h100-optimization-guide.md
A100	sm_80	a100-optimization-guide.md
T4	sm_75	t4-optimization-guide.md

When This Skill Applies

适用场景

Use this skill when:

Benchmarking kernel performance against baseline implementations
Writing new CUDA kernels for diffusion models or LLMs
Optimizing existing kernels for H100, A100, or T4 architecture
Implementing custom attention, normalization, or activation layers
Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
Debugging kernel performance issues on NVIDIA GPUs

在以下场景中使用本技能：

内核性能基准测试：与基准实现对比性能
为扩散模型或大语言模型编写新的CUDA内核
针对H100、A100或T4架构优化现有内核
实现自定义注意力、归一化或激活层
与diffusers流水线集成（LTX-Video、Stable Diffusion、FLUX、DiT）
与transformers模型集成（LLaMA、Mistral、Qwen）
调试NVIDIA GPU上的内核性能问题

Working Example

完整示例

A complete working example is available at

examples/ltx_video/

. This demonstrates:

Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
Build system setup with setup.py, build.toml, and flake.nix
PyTorch C++ bindings and Python API
Benchmarking script for comparing optimized vs baseline performance

完整的可运行示例位于

examples/ltx_video/

目录下，展示：

自定义CUDA内核（RMSNorm、RoPE 3D、GEGLU、AdaLN）
包含setup.py、build.toml和flake.nix的构建系统设置
PyTorch C++绑定与Python API
用于对比优化版与基准版性能的基准测试脚本

Benchmarking Kernels

内核基准测试

Use the benchmark script to measure kernel performance:

bash

undefined

使用基准测试脚本测量内核性能：

bash

undefined

Full benchmark with all options

包含所有选项的完整基准测试

python scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2

undefined

python scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2

undefined

Benchmark Script Options

基准测试脚本选项

Option	Default	Description
`--use-optimized-kernels`	auto	Use custom H100 CUDA kernels
`--no-optimized-kernels`	-	Use baseline implementation
`--compile`	false	Enable torch.compile on transformer
`--batch-size`	1	Number of videos per prompt
`--num-frames`	161	Number of frames to generate
`--height`	512	Video height in pixels
`--width`	768	Video width in pixels
`--steps`	50	Denoising steps
`--warmup-iterations`	2	Warmup runs before benchmark

选项	默认值	描述
`--use-optimized-kernels`	auto	使用自定义H100 CUDA内核
`--no-optimized-kernels`	-	使用基准实现
`--compile`	false	为transformer启用torch.compile
`--batch-size`	1	每个prompt生成的视频数量
`--num-frames`	161	生成的帧数量
`--height`	512	视频高度（像素）
`--width`	768	视频宽度（像素）
`--steps`	50	去噪步数
`--warmup-iterations`	2	基准测试前的预热运行次数

Example Benchmark Results

基准测试结果示例

End-to-End Video Generation (49 frames, 30 steps, H100 80GB):

Configuration	Time (s)	it/s	Speedup	Notes
Baseline (no compile)	2.87	12.58	1.00x	Reference
Optimized Kernels	2.70	13.52	1.06x	6% faster
Baseline + torch.compile	2.14	19.05	1.34x	34% faster

Important:

--use-optimized-kernels

and

--compile

are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.

Key metrics to capture:

Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
Precision: Data type used (e.g., bfloat16)
Resolution: Width x Height (e.g., 768x512)
Frames: Number of frames generated (e.g., 49, 161)

端到端视频生成（49帧、30步、H100 80GB）：

配置	耗时（秒）	每秒迭代数	提速比	说明
基准版（未编译）	2.87	12.58	1.00x	参考值
优化内核版	2.70	13.52	1.06x	快6%
基准版+torch.compile	2.14	19.05	1.34x	快34%

重要提示：

--use-optimized-kernels

与

--compile

目前互斥。自定义内核需要注册为PyTorch自定义算子才能与torch.compile兼容。

需记录的关键指标：

设备：GPU型号（如NVIDIA H100 80GB HBM3）
精度：使用的数据类型（如bfloat16）
分辨率：宽×高（如768x512）
帧数：生成的帧数量（如49、161）

RMSNorm Micro-benchmarks

RMSNorm微基准测试

The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1×1024×2048]	0.019	0.065	3.37x
[2×1024×2048]	0.024	0.073	3.04x
[4×1024×2048]	0.036	0.093	2.58x
[2×4096×3072]	0.087	0.208	2.41x
[4×4096×3072]	0.157	0.392	2.49x

Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s

Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.

向量化RMSNorm内核比PyTorch基准版实现平均提速2.67倍：

形状	自定义内核（毫秒）	PyTorch（毫秒）	提速比
[1×1024×2048]	0.019	0.065	3.37x
[2×1024×2048]	0.024	0.073	3.04x
[4×1024×2048]	0.036	0.093	2.58x
[2×4096×3072]	0.087	0.208	2.41x
[4×4096×3072]	0.157	0.392	2.49x

带宽效率：达到H100理论3.35 TB/s带宽的38%

端到端提速比偏低的原因：RMSNorm在LTX-Video的总计算量中仅占约5%，剩余时间消耗在注意力机制（Flash Attention/SDPA）、线性投影和VAE解码上。

Project Structure

项目结构

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # End-to-end video generation benchmark
│   ├── benchmark_rmsnorm.py              # Isolated RMSNorm micro-benchmark
│   ├── ltx_kernel_injection_example.py   # Minimal diffusers integration (~150 lines)
│   ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub integration
├── references/
│   ├── diffusers-integration.md          # Complete diffusers integration guide
│   ├── transformers-integration.md       # Complete transformers integration guide
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│   ├── troubleshooting.md                # Common issues and solutions
│   ├── kernel-templates.md               # CUDA kernel templates (includes vectorized)
│   ├── h100-optimization-guide.md        # H100 (Hopper) optimization deep dive
│   ├── a100-optimization-guide.md        # A100 (Ampere) optimization deep dive
│   └── t4-optimization-guide.md          # T4 (Turing) optimization deep dive
└── SKILL.md                              # This file

examples/ltx_video/                  # Complete working example
├── kernel_src/
│   └── rmsnorm.cu                  # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/                      # PyTorch bindings
├── generate_video.py               # Full benchmark script
├── benchmark_rmsnorm.py            # Isolated kernel benchmark
└── setup.py                        # pip install -e .

.claude/skills/cuda-kernels/
├── scripts/
│   ├── benchmark_example.py              # 端到端视频生成基准测试
│   ├── benchmark_rmsnorm.py              # 独立RMSNorm微基准测试
│   ├── ltx_kernel_injection_example.py   # Diffusers极简集成（约150行）
│   ├── transformers_injection_example.py # Transformers极简集成（约120行）
│   └── huggingface_kernels_example.py    # HuggingFace Kernels Hub集成
├── references/
│   ├── diffusers-integration.md          # 完整Diffusers集成指南
│   ├── transformers-integration.md       # 完整Transformers集成指南
│   ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub（get_kernel）指南
│   ├── troubleshooting.md                # 常见问题与解决方案
│   ├── kernel-templates.md               # CUDA内核模板（包含向量化版本）
│   ├── h100-optimization-guide.md        # H100（Hopper）优化深度解析
│   ├── a100-optimization-guide.md        # A100（Ampere）优化深度解析
│   └── t4-optimization-guide.md          # T4（Turing）优化深度解析
└── SKILL.md                              # 本文档

examples/ltx_video/                  # 完整可运行示例
├── kernel_src/
│   └── rmsnorm.cu                  # 向量化RMSNorm内核（提速2.67倍）
├── torch-ext/                      # PyTorch绑定
├── generate_video.py               # 完整基准测试脚本
├── benchmark_rmsnorm.py            # 独立内核基准测试
└── setup.py                        # pip install -e .

GPU Architecture Reference

GPU架构参考

H100 (Hopper) - Primary Target

H100（Hopper）- 主要目标平台

Spec	Value	Optimization Impact
SMs	132	Grid sizing: aim for multiples of 132
Threads/SM	2048	Max 16 blocks of 128 threads per SM
Shared Memory	192 KB/SM	Large tiles possible
L2 Cache	50 MB	Reuse across blocks
Memory BW	3.35 TB/s	Coalesced access critical
Warp Size	32	All reductions use warp shuffles

规格	数值	优化影响
SM数量	132	网格大小：尽量设置为132的倍数
每个SM的线程数	2048	每个SM最多支持16个128线程的块
共享内存	192 KB/SM	可使用大 tile
L2缓存	50 MB	可跨块复用数据
内存带宽	3.35 TB/s	合并访问至关重要
Warp大小	32	所有归约操作使用warp shuffle

Quick Comparison (H100 vs A100 vs T4)

快速对比（H100 vs A100 vs T4）

Spec	H100	A100	T4
SMs	132	108	40
Memory BW	3.35 TB/s	2.0 TB/s	320 GB/s
Shared Mem/SM	192 KB	164 KB	64 KB
BF16 Support	Yes	Yes	No (FP16 only)
Compute Cap	sm_90	sm_80	sm_75

See detailed guides: H100 | A100 | T4

规格	H100	A100	T4
SM数量	132	108	40
内存带宽	3.35 TB/s	2.0 TB/s	320 GB/s
每个SM的共享内存	192 KB	164 KB	64 KB
BF16支持	是	是	否（仅支持FP16）
计算能力	sm_90	sm_80	sm_75

查看详细指南：H100 | A100 | T4

Core Kernel Patterns

核心内核模式

Vectorized Memory Access (Critical for Performance)

向量化内存访问（性能关键）

BFloat16 vectorization using
__nv_bfloat162
:

cuda

// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}

FP16 vectorization using
__half2
:

cuda

const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);

FP32 vectorization using
float4
:

cuda

const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

使用
__nv_bfloat162
实现BFloat16向量化：

cuda

// 一次性加载2个bfloat16元素（32位加载）
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);

#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
    __nv_bfloat162 v = vec_input[i];
    float v0 = __bfloat162float(v.x);
    float v1 = __bfloat162float(v.y);
    sum_sq += v0 * v0 + v1 * v1;
}

使用
__half2
实现FP16向量化：

cuda

const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);

使用
float4
实现FP32向量化：

cuda

const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;

Warp Shuffle Reductions

Warp Shuffle归约

cuda

template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

cuda

template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

Block Sizes for Attention

注意力机制的块大小

BLOCK_SIZE_M = 128

BLOCK_SIZE_N = 64

BLOCK_SIZE_K = 64

```
NUM_WARPS = 8
```

BLOCK_SIZE_M = 128

BLOCK_SIZE_N = 64

BLOCK_SIZE_K = 64

```
NUM_WARPS = 8
```

Thread Configuration

线程配置

For element-wise ops (RoPE, GEGLU):

cuda

constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;

For reduction ops (LayerNorm, RMSNorm) with vectorization:

cuda

// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // Round to warp boundary

针对元素级操作（RoPE、GEGLU）：

cuda

constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;

针对带向量化的归约操作（LayerNorm、RMSNorm）：

cuda

// 针对bf16/fp16向量化访问，除以2
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32;  // 向上取整到warp边界

Supported Data Types

支持的数据类型

All kernels support three precision modes:

```
__half
```
(FP16) - Default for inference
```
__nv_bfloat16
```
(BF16) - Preferred for training
```
float
```
(FP32) - Reference/debugging

所有内核支持三种精度模式：

```
__half
```
（FP16）- 推理默认值
```
__nv_bfloat16
```
（BF16）- 训练首选
```
float
```
（FP32）- 参考/调试用

Building Kernels

内核构建

With Nix (Recommended)

使用Nix（推荐）

bash

nix run .#build-and-copy --max-jobs 2 --cores 8 -L

bash

nix run .#build-and-copy --max-jobs 2 --cores 8 -L

With pip/uv

使用pip/uv

bash

uv pip install -e .

bash

uv pip install -e .

build.toml Configuration

build.toml配置

toml

[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

toml

[general]
name = "ltx_kernels"
backends = ["cuda"]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]

Library Integration

库集成

HuggingFace Kernels Hub (get_kernel)

HuggingFace Kernels Hub（get_kernel）

See huggingface-kernels-integration.md for the complete guide.

Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:

python

from kernels import get_kernel, has_kernel

完整指南请查看huggingface-kernels-integration.md。

直接从HuggingFace Hub加载预编译的优化内核，无需本地编译：

python

from kernels import get_kernel, has_kernel

Check availability and load

检查可用性并加载

if has_kernel("kernels-community/activation"): activation = get_kernel("kernels-community/activation", version=1)

# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)


**Key functions:**
- `get_kernel(repo_id, version=None)` - Download and load kernel from Hub
- `has_kernel(repo_id)` - Check if compatible build exists
- `get_local_kernel(path)` - Load from local directory (development)

**Popular community kernels:**
- `kernels-community/activation` - GELU, SiLU, etc.
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm, RMSNorm

if has_kernel("kernels-community/activation"): activation = get_kernel("kernels-community/activation", version=1)

# 使用内核
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)


**核心函数：**
- `get_kernel(repo_id, version=None)` - 从Hub下载并加载内核
- `has_kernel(repo_id)` - 检查是否存在兼容的构建版本
- `get_local_kernel(path)` - 从本地目录加载（开发用）

**热门社区内核：**
- `kernels-community/activation` - GELU、SiLU等
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm、RMSNorm

Diffusers Integration (Video/Image Generation)

Diffusers集成（视频/图像生成）

See diffusers-integration.md for the complete guide.

完整指南请查看diffusers-integration.md。

Transformers Integration (LLMs)

Transformers集成（大语言模型）

See transformers-integration.md for the complete guide.

Key differences from diffusers:

Transformers RMSNorm always has weights (no
```
elementwise_affine=False
```
)
Use
```
'RMSNorm' in class_name
```
to match LlamaRMSNorm, MistralRMSNorm, etc.
Check for
```
variance_epsilon
```
(LLaMA) or
```
eps
```
(others) for epsilon
No
```
set_processor()
```
pattern - use Flash Attention 2 instead

Minimal transformers pattern:

python

from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

完整指南请查看transformers-integration.md。

与Diffusers的主要区别：

Transformers的RMSNorm 始终带有权重（无
```
elementwise_affine=False
```
情况）
使用
```
'RMSNorm' in class_name
```
匹配LlamaRMSNorm、MistralRMSNorm等所有变体
检查
```
variance_epsilon
```
（LLaMA）或
```
eps
```
（其他模型）获取epsilon值
无
```
set_processor()
```
模式 - 改用Flash Attention 2

Transformers极简集成模式：

python

from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm

def patch_rmsnorm(model):
    for name, module in model.named_modules():
        if 'RMSNorm' in type(module).__name__:
            eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
            def make_forward(mod, epsilon):
                def forward(x):
                    return rmsnorm(x, mod.weight, eps=epsilon)
                return forward
            module.forward = make_forward(module, eps)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)

Diffusers Critical Pitfalls

Diffusers关键注意事项

1. RMSNorm Weight May Be None

1. RMSNorm权重可能为None

LTX-Video uses

elementwise_affine=False

for some RMSNorm modules:

python

undefined

LTX-Video的部分RMSNorm模块使用

elementwise_affine=False

：

python

undefined

Transformer blocks: NO WEIGHT

Transformer块：无权重

self.norm1 = RMSNorm(dim, elementwise_affine=False)

Attention modules: HAS WEIGHT

注意力模块：有权重

self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)


**Solution:** Handle both cases:
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)


**解决方案：** 处理两种情况：
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
    output = rmsnorm(x, module.weight, eps=eps)
else:
    weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    output = rmsnorm(x, weight, eps=eps)

2. Diffusers RMSNorm != torch.nn.RMSNorm

2. Diffusers的RMSNorm != torch.nn.RMSNorm

python

undefined

python

undefined

WRONG - misses diffusers RMSNorm

错误 - 无法匹配Diffusers的RMSNorm

if isinstance(module, torch.nn.RMSNorm):

CORRECT - catches all RMSNorm variants

正确 - 捕获所有RMSNorm变体

if type(module).name == 'RMSNorm':

undefined

if type(module).name == 'RMSNorm':

undefined

3. LTX-Video Uses GELU, Not GEGLU

3. LTX-Video使用GELU而非GEGLU

LTX-Video uses

activation_fn="gelu-approximate"

. Don't patch GEGLU for LTX-Video.

LTX-Video使用

activation_fn="gelu-approximate"

，不要为LTX-Video打GEGLU的补丁。

4. Inject Kernels BEFORE CPU Offloading

4. 在CPU卸载前注入内核

python

pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # BEFORE offloading
pipe.enable_model_cpu_offload()  # Now safe

python

pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe)  # 先注入
pipe.enable_model_cpu_offload()  # 再启用卸载

Minimal Integration Pattern

极简集成模式

python

from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """Patch all RMSNorm modules to use custom kernel."""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

python

from diffusers import LTXPipeline
from ltx_kernels import rmsnorm

def patch_rmsnorm_modules(model):
    """为所有RMSNorm模块打补丁，使用自定义内核。"""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None

            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

Usage

使用示例

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) pipe.to("cuda") patch_rmsnorm_modules(pipe.transformer) pipe.enable_model_cpu_offload()

undefined

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) pipe.to("cuda") patch_rmsnorm_modules(pipe.transformer) pipe.enable_model_cpu_offload()

undefined

Kernel-Specific Guidelines

内核特定指南

RMSNorm

Input layout:
```
[..., hidden_size]
```
Epsilon default: 1e-6
Weight may be None if
```
elementwise_affine=False
```
Vectorization: Use
```
__nv_bfloat162
```
for BF16,
```
__half2
```
for FP16,
```
float4
```
for FP32
Performance: 2.67x faster than PyTorch with vectorized implementation
Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth

输入布局：
```
[..., hidden_size]
```
默认epsilon：1e-6
权重可能为None：当
```
elementwise_affine=False
```
时
向量化：BF16使用
```
__nv_bfloat162
```
，FP16使用
```
__half2
```
，FP32使用
```
float4
```
性能：向量化实现比PyTorch快2.67倍
带宽：达到H100理论3.35 TB/s带宽的约38%

RoPE

1D:
```
[batch, seq, heads, head_dim]
```
- for text
3D:
```
[batch, t*h*w, heads, head_dim]
```
- for video
LTX-Video computes its own RoPE via
```
LTXVideoRotaryPosEmbed
```

1D：
```
[batch, seq, heads, head_dim]
```
- 用于文本
3D：
```
[batch, t*h*w, heads, head_dim]
```
- 用于视频
LTX-Video通过
```
LTXVideoRotaryPosEmbed
```
计算自身的RoPE

GEGLU vs GELU

GEGLU: Input

[batch, seq, 2*hidden]

-> Output

[batch, seq, hidden]

GELU: Standard activation
LTX-Video uses GELU, NOT GEGLU

GEGLU：输入

[batch, seq, 2*hidden]

-> 输出

[batch, seq, hidden]

GELU：标准激活函数
LTX-Video使用GELU，而非GEGLU

AdaLN

Formula:
```
norm(x) * weight * (1 + scale) + shift
```
Used in DiT blocks for conditioning

公式：
```
norm(x) * weight * (1 + scale) + shift
```
用于DiT块的条件控制

Performance Profiling

性能分析

bash

undefined

bash

undefined

NVIDIA Nsight Systems

nsys profile -o profile python your_script.py

NVIDIA Nsight Compute

ncu --set full -o metrics python your_script.py

undefined

ncu --set full -o metrics python your_script.py

undefined

Common Issues

常见问题

See troubleshooting.md for all common issues and solutions.

Quick fixes:

"NoneType has no attribute contiguous": RMSNorm weight is None, create ones
isinstance() not matching: Use
```
type(module).__name__
```
instead
GEGLU not called: Model uses GELU, not GEGLU
Patching doesn't persist: Inject before
```
enable_model_cpu_offload()
```
torch.compile fails with custom kernels: See below

所有常见问题与解决方案请查看troubleshooting.md。

快速修复：

"NoneType has no attribute contiguous"：RMSNorm权重为None，创建全1张量
isinstance()无法匹配：改用
```
type(module).__name__
```
GEGLU未被调用：模型使用的是GELU而非GEGLU
补丁不生效：在
```
enable_model_cpu_offload()
```
前注入内核
自定义内核与torch.compile不兼容：查看下文

torch.compile Compatibility

torch.compile兼容性

Custom CUDA kernels and

torch.compile

are mutually exclusive unless you register the kernel as a PyTorch custom op.

Error message:

torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped

Workaround options:

Use
```
--use-optimized-kernels
```
without
```
--compile
```
(6% speedup)
Use
```
--compile
```
without custom kernels (34% speedup)
Register kernel as custom op (advanced, requires
```
torch.library
```
)

To register as custom op (for torch.compile compatibility):

python

import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # No shape changes

自定义CUDA内核与

torch.compile

互斥，除非将内核注册为PyTorch自定义算子。

错误信息：

torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped

解决方案选项：

使用
```
--use-optimized-kernels
```
但不使用
```
--compile
```
（提速6%）
使用
```
--compile
```
但不使用自定义内核（提速34%）
将内核注册为自定义算子（进阶，需使用
```
torch.library
```
）

注册为自定义算子（实现torch.compile兼容）：

python

import torch

@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
    ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)

@rmsnorm.register_fake
def _(out, input, weight, eps):
    pass  # 无形状变化