cuda-kernels
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUDA Kernels for Diffusers & Transformers
面向Diffusers与Transformers的CUDA内核
This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
本技能为开发针对NVIDIA GPU(H100、A100、T4)、适配HuggingFace diffusers和transformers库的优化CUDA内核提供模式与指南。
Quick Start
快速开始
Diffusers (Video/Image Generation)
Diffusers(视频/图像生成)
For benchmarking kernel performance:
bash
undefined内核性能基准测试:
bash
undefinedBenchmark with optimized kernels (6% end-to-end speedup)
使用优化内核进行基准测试(端到端速度提升6%)
python generate_video.py --use-optimized-kernels
python generate_video.py --use-optimized-kernels
Benchmark baseline with torch.compile (34% speedup)
使用torch.compile进行基准测试(速度提升34%)
python generate_video.py --no-optimized-kernels --compile
python generate_video.py --no-optimized-kernels --compile
Compare configurations (note: --compile and --use-optimized-kernels are mutually exclusive)
对比不同配置(注意:--compile与--use-optimized-kernels互斥)
python generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile
python generate_video.py --no-optimized-kernels --compile
**For a minimal diffusers integration example (~150 lines):**
```bash
python scripts/ltx_kernel_injection_example.pypython generate_video.py --use-optimized-kernels &&
python generate_video.py --no-optimized-kernels --compile
python generate_video.py --no-optimized-kernels --compile
**Diffusers极简集成示例(约150行代码):**
```bash
python scripts/ltx_kernel_injection_example.pyTransformers (LLMs)
Transformers(大语言模型)
For a minimal transformers integration example (~120 lines):
bash
python scripts/transformers_injection_example.pyTransformers极简集成示例(约120行代码):
bash
python scripts/transformers_injection_example.pyHuggingFace Kernels Hub
HuggingFace Kernels Hub
Load pre-compiled kernels from HuggingFace Hub (no local compilation):
python
from kernels import get_kernel从HuggingFace Hub加载预编译内核(无需本地编译):
python
from kernels import get_kernelLoad optimized activation kernels
加载优化后的激活函数内核
activation = get_kernel("kernels-community/activation", version=1)
activation = get_kernel("kernels-community/activation", version=1)
Use the kernel
使用内核
y = torch.empty_like(x)
activation.gelu_fast(y, x)
**For a complete HuggingFace Kernels example:**
```bash
python scripts/huggingface_kernels_example.pyy = torch.empty_like(x)
activation.gelu_fast(y, x)
**完整HuggingFace Kernels示例:**
```bash
python scripts/huggingface_kernels_example.pyIsolated Kernel Micro-benchmarks
独立内核微基准测试
bash
python benchmark_rmsnorm.pybash
python benchmark_rmsnorm.pySupported Libraries & Models
支持的库与模型
| Library | Supported Models | Key Kernels |
|---|---|---|
| diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN |
| transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | Compute Capability | Guide |
|---|---|---|
| H100 | sm_90 | h100-optimization-guide.md |
| A100 | sm_80 | a100-optimization-guide.md |
| T4 | sm_75 | t4-optimization-guide.md |
| 库 | 支持的模型 | 核心内核 |
|---|---|---|
| diffusers | LTX-Video, Stable Diffusion, FLUX, DiT | RMSNorm, GEGLU, RoPE, AdaLN |
| transformers | LLaMA, Mistral, Qwen, Falcon | RMSNorm, Attention |
| GPU | 计算能力 | 指南 |
|---|---|---|
| H100 | sm_90 | h100-optimization-guide.md |
| A100 | sm_80 | a100-optimization-guide.md |
| T4 | sm_75 | t4-optimization-guide.md |
When This Skill Applies
适用场景
Use this skill when:
- Benchmarking kernel performance against baseline implementations
- Writing new CUDA kernels for diffusion models or LLMs
- Optimizing existing kernels for H100, A100, or T4 architecture
- Implementing custom attention, normalization, or activation layers
- Integrating kernels with diffusers pipelines (LTX-Video, Stable Diffusion, FLUX, DiT)
- Integrating kernels with transformers models (LLaMA, Mistral, Qwen)
- Debugging kernel performance issues on NVIDIA GPUs
在以下场景中使用本技能:
- 内核性能基准测试:与基准实现对比性能
- 为扩散模型或大语言模型编写新的CUDA内核
- 针对H100、A100或T4架构优化现有内核
- 实现自定义注意力、归一化或激活层
- 与diffusers流水线集成(LTX-Video、Stable Diffusion、FLUX、DiT)
- 与transformers模型集成(LLaMA、Mistral、Qwen)
- 调试NVIDIA GPU上的内核性能问题
Working Example
完整示例
A complete working example is available at . This demonstrates:
examples/ltx_video/- Custom CUDA kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
- Build system setup with setup.py, build.toml, and flake.nix
- PyTorch C++ bindings and Python API
- Benchmarking script for comparing optimized vs baseline performance
完整的可运行示例位于目录下,展示:
examples/ltx_video/- 自定义CUDA内核(RMSNorm、RoPE 3D、GEGLU、AdaLN)
- 包含setup.py、build.toml和flake.nix的构建系统设置
- PyTorch C++绑定与Python API
- 用于对比优化版与基准版性能的基准测试脚本
Benchmarking Kernels
内核基准测试
Use the benchmark script to measure kernel performance:
bash
undefined使用基准测试脚本测量内核性能:
bash
undefinedFull benchmark with all options
包含所有选项的完整基准测试
python scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
undefinedpython scripts/benchmark_example.py
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
--use-optimized-kernels
--compile
--batch-size 1
--num-frames 161
--height 512
--width 768
--steps 50
--warmup-iterations 2
undefinedBenchmark Script Options
基准测试脚本选项
| Option | Default | Description |
|---|---|---|
| auto | Use custom H100 CUDA kernels |
| - | Use baseline implementation |
| false | Enable torch.compile on transformer |
| 1 | Number of videos per prompt |
| 161 | Number of frames to generate |
| 512 | Video height in pixels |
| 768 | Video width in pixels |
| 50 | Denoising steps |
| 2 | Warmup runs before benchmark |
| 选项 | 默认值 | 描述 |
|---|---|---|
| auto | 使用自定义H100 CUDA内核 |
| - | 使用基准实现 |
| false | 为transformer启用torch.compile |
| 1 | 每个prompt生成的视频数量 |
| 161 | 生成的帧数量 |
| 512 | 视频高度(像素) |
| 768 | 视频宽度(像素) |
| 50 | 去噪步数 |
| 2 | 基准测试前的预热运行次数 |
Example Benchmark Results
基准测试结果示例
End-to-End Video Generation (49 frames, 30 steps, H100 80GB):
| Configuration | Time (s) | it/s | Speedup | Notes |
|---|---|---|---|---|
| Baseline (no compile) | 2.87 | 12.58 | 1.00x | Reference |
| Optimized Kernels | 2.70 | 13.52 | 1.06x | 6% faster |
| Baseline + torch.compile | 2.14 | 19.05 | 1.34x | 34% faster |
Important: and are currently mutually exclusive. Custom kernels require PyTorch custom op registration to work with torch.compile.
--use-optimized-kernels--compileKey metrics to capture:
- Device: GPU model (e.g., NVIDIA H100 80GB HBM3)
- Precision: Data type used (e.g., bfloat16)
- Resolution: Width x Height (e.g., 768x512)
- Frames: Number of frames generated (e.g., 49, 161)
端到端视频生成(49帧、30步、H100 80GB):
| 配置 | 耗时(秒) | 每秒迭代数 | 提速比 | 说明 |
|---|---|---|---|---|
| 基准版(未编译) | 2.87 | 12.58 | 1.00x | 参考值 |
| 优化内核版 | 2.70 | 13.52 | 1.06x | 快6% |
| 基准版+torch.compile | 2.14 | 19.05 | 1.34x | 快34% |
重要提示: 与目前互斥。自定义内核需要注册为PyTorch自定义算子才能与torch.compile兼容。
--use-optimized-kernels--compile需记录的关键指标:
- 设备:GPU型号(如NVIDIA H100 80GB HBM3)
- 精度:使用的数据类型(如bfloat16)
- 分辨率:宽×高(如768x512)
- 帧数:生成的帧数量(如49、161)
RMSNorm Micro-benchmarks
RMSNorm微基准测试
The vectorized RMSNorm kernel achieves 2.67x average speedup over PyTorch baseline:
| Shape | Custom (ms) | PyTorch (ms) | Speedup |
|---|---|---|---|
| [1×1024×2048] | 0.019 | 0.065 | 3.37x |
| [2×1024×2048] | 0.024 | 0.073 | 3.04x |
| [4×1024×2048] | 0.036 | 0.093 | 2.58x |
| [2×4096×3072] | 0.087 | 0.208 | 2.41x |
| [4×4096×3072] | 0.157 | 0.392 | 2.49x |
Bandwidth efficiency: 38% of H100's theoretical 3.35 TB/s
Why end-to-end speedup is smaller: RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention (Flash Attention/SDPA), linear projections, and VAE decode.
向量化RMSNorm内核比PyTorch基准版实现平均提速2.67倍:
| 形状 | 自定义内核(毫秒) | PyTorch(毫秒) | 提速比 |
|---|---|---|---|
| [1×1024×2048] | 0.019 | 0.065 | 3.37x |
| [2×1024×2048] | 0.024 | 0.073 | 3.04x |
| [4×1024×2048] | 0.036 | 0.093 | 2.58x |
| [2×4096×3072] | 0.087 | 0.208 | 2.41x |
| [4×4096×3072] | 0.157 | 0.392 | 2.49x |
带宽效率:达到H100理论3.35 TB/s带宽的38%
端到端提速比偏低的原因:RMSNorm在LTX-Video的总计算量中仅占约5%,剩余时间消耗在注意力机制(Flash Attention/SDPA)、线性投影和VAE解码上。
Project Structure
项目结构
.claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # End-to-end video generation benchmark
│ ├── benchmark_rmsnorm.py # Isolated RMSNorm micro-benchmark
│ ├── ltx_kernel_injection_example.py # Minimal diffusers integration (~150 lines)
│ ├── transformers_injection_example.py # Minimal transformers integration (~120 lines)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub integration
├── references/
│ ├── diffusers-integration.md # Complete diffusers integration guide
│ ├── transformers-integration.md # Complete transformers integration guide
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub (get_kernel) guide
│ ├── troubleshooting.md # Common issues and solutions
│ ├── kernel-templates.md # CUDA kernel templates (includes vectorized)
│ ├── h100-optimization-guide.md # H100 (Hopper) optimization deep dive
│ ├── a100-optimization-guide.md # A100 (Ampere) optimization deep dive
│ └── t4-optimization-guide.md # T4 (Turing) optimization deep dive
└── SKILL.md # This file
examples/ltx_video/ # Complete working example
├── kernel_src/
│ └── rmsnorm.cu # Vectorized RMSNorm kernel (2.67x faster)
├── torch-ext/ # PyTorch bindings
├── generate_video.py # Full benchmark script
├── benchmark_rmsnorm.py # Isolated kernel benchmark
└── setup.py # pip install -e ..claude/skills/cuda-kernels/
├── scripts/
│ ├── benchmark_example.py # 端到端视频生成基准测试
│ ├── benchmark_rmsnorm.py # 独立RMSNorm微基准测试
│ ├── ltx_kernel_injection_example.py # Diffusers极简集成(约150行)
│ ├── transformers_injection_example.py # Transformers极简集成(约120行)
│ └── huggingface_kernels_example.py # HuggingFace Kernels Hub集成
├── references/
│ ├── diffusers-integration.md # 完整Diffusers集成指南
│ ├── transformers-integration.md # 完整Transformers集成指南
│ ├── huggingface-kernels-integration.md # HuggingFace Kernels Hub(get_kernel)指南
│ ├── troubleshooting.md # 常见问题与解决方案
│ ├── kernel-templates.md # CUDA内核模板(包含向量化版本)
│ ├── h100-optimization-guide.md # H100(Hopper)优化深度解析
│ ├── a100-optimization-guide.md # A100(Ampere)优化深度解析
│ └── t4-optimization-guide.md # T4(Turing)优化深度解析
└── SKILL.md # 本文档
examples/ltx_video/ # 完整可运行示例
├── kernel_src/
│ └── rmsnorm.cu # 向量化RMSNorm内核(提速2.67倍)
├── torch-ext/ # PyTorch绑定
├── generate_video.py # 完整基准测试脚本
├── benchmark_rmsnorm.py # 独立内核基准测试
└── setup.py # pip install -e .GPU Architecture Reference
GPU架构参考
H100 (Hopper) - Primary Target
H100(Hopper)- 主要目标平台
| Spec | Value | Optimization Impact |
|---|---|---|
| SMs | 132 | Grid sizing: aim for multiples of 132 |
| Threads/SM | 2048 | Max 16 blocks of 128 threads per SM |
| Shared Memory | 192 KB/SM | Large tiles possible |
| L2 Cache | 50 MB | Reuse across blocks |
| Memory BW | 3.35 TB/s | Coalesced access critical |
| Warp Size | 32 | All reductions use warp shuffles |
| 规格 | 数值 | 优化影响 |
|---|---|---|
| SM数量 | 132 | 网格大小:尽量设置为132的倍数 |
| 每个SM的线程数 | 2048 | 每个SM最多支持16个128线程的块 |
| 共享内存 | 192 KB/SM | 可使用大 tile |
| L2缓存 | 50 MB | 可跨块复用数据 |
| 内存带宽 | 3.35 TB/s | 合并访问至关重要 |
| Warp大小 | 32 | 所有归约操作使用warp shuffle |
Quick Comparison (H100 vs A100 vs T4)
快速对比(H100 vs A100 vs T4)
| Spec | H100 | A100 | T4 |
|---|---|---|---|
| SMs | 132 | 108 | 40 |
| Memory BW | 3.35 TB/s | 2.0 TB/s | 320 GB/s |
| Shared Mem/SM | 192 KB | 164 KB | 64 KB |
| BF16 Support | Yes | Yes | No (FP16 only) |
| Compute Cap | sm_90 | sm_80 | sm_75 |
See detailed guides: H100 | A100 | T4
| 规格 | H100 | A100 | T4 |
|---|---|---|---|
| SM数量 | 132 | 108 | 40 |
| 内存带宽 | 3.35 TB/s | 2.0 TB/s | 320 GB/s |
| 每个SM的共享内存 | 192 KB | 164 KB | 64 KB |
| BF16支持 | 是 | 是 | 否(仅支持FP16) |
| 计算能力 | sm_90 | sm_80 | sm_75 |
查看详细指南:H100 | A100 | T4
Core Kernel Patterns
核心内核模式
Vectorized Memory Access (Critical for Performance)
向量化内存访问(性能关键)
BFloat16 vectorization using :
__nv_bfloat162cuda
// Load 2 bfloat16 elements at once (32-bit load)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}FP16 vectorization using :
__half2cuda
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);FP32 vectorization using :
float4cuda
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;使用实现BFloat16向量化:
__nv_bfloat162cuda
// 一次性加载2个bfloat16元素(32位加载)
const __nv_bfloat162* vec_input = reinterpret_cast<const __nv_bfloat162*>(row_input);
#pragma unroll 4
for (int i = tid; i < vec_hidden; i += stride) {
__nv_bfloat162 v = vec_input[i];
float v0 = __bfloat162float(v.x);
float v1 = __bfloat162float(v.y);
sum_sq += v0 * v0 + v1 * v1;
}使用实现FP16向量化:
__half2cuda
const __half2* vec_input = reinterpret_cast<const __half2*>(row_input);
__half2 v = vec_input[i];
float v0 = __half2float(v.x);
float v1 = __half2float(v.y);使用实现FP32向量化:
float4cuda
const float4* vec_input = reinterpret_cast<const float4*>(row_input);
float4 v = vec_input[i];
sum_sq += v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w;Warp Shuffle Reductions
Warp Shuffle归约
cuda
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}cuda
template <typename T>
__device__ __forceinline__ T warp_reduce_sum(T val) {
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}Block Sizes for Attention
注意力机制的块大小
- ,
BLOCK_SIZE_M = 128,BLOCK_SIZE_N = 64BLOCK_SIZE_K = 64 NUM_WARPS = 8
- ,
BLOCK_SIZE_M = 128,BLOCK_SIZE_N = 64BLOCK_SIZE_K = 64 NUM_WARPS = 8
Thread Configuration
线程配置
For element-wise ops (RoPE, GEGLU):
cuda
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;For reduction ops (LayerNorm, RMSNorm) with vectorization:
cuda
// Divide by 2 for bf16/fp16 vectorized access
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // Round to warp boundary针对元素级操作(RoPE、GEGLU):
cuda
constexpr int BLOCK_SIZE = 256;
int num_blocks = (total_elements + BLOCK_SIZE - 1) / BLOCK_SIZE;针对带向量化的归约操作(LayerNorm、RMSNorm):
cuda
// 针对bf16/fp16向量化访问,除以2
int threads = min(hidden_size / 2, MAX_THREADS);
threads = max(threads, WARP_SIZE);
threads = (threads + 32 - 1) / 32 * 32; // 向上取整到warp边界Supported Data Types
支持的数据类型
All kernels support three precision modes:
- (FP16) - Default for inference
__half - (BF16) - Preferred for training
__nv_bfloat16 - (FP32) - Reference/debugging
float
所有内核支持三种精度模式:
- (FP16)- 推理默认值
__half - (BF16)- 训练首选
__nv_bfloat16 - (FP32)- 参考/调试用
float
Building Kernels
内核构建
With Nix (Recommended)
使用Nix(推荐)
bash
nix run .#build-and-copy --max-jobs 2 --cores 8 -Lbash
nix run .#build-and-copy --max-jobs 2 --cores 8 -LWith pip/uv
使用pip/uv
bash
uv pip install -e .bash
uv pip install -e .build.toml Configuration
build.toml配置
toml
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]toml
[general]
name = "ltx_kernels"
backends = ["cuda"]
[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
cuda-capabilities = ["9.0"]Library Integration
库集成
HuggingFace Kernels Hub (get_kernel)
HuggingFace Kernels Hub(get_kernel)
See huggingface-kernels-integration.md for the complete guide.
Load pre-compiled, optimized kernels directly from HuggingFace Hub without local compilation:
python
from kernels import get_kernel, has_kernel完整指南请查看huggingface-kernels-integration.md。
直接从HuggingFace Hub加载预编译的优化内核,无需本地编译:
python
from kernels import get_kernel, has_kernelCheck availability and load
检查可用性并加载
if has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# Use the kernel
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
**Key functions:**
- `get_kernel(repo_id, version=None)` - Download and load kernel from Hub
- `has_kernel(repo_id)` - Check if compatible build exists
- `get_local_kernel(path)` - Load from local directory (development)
**Popular community kernels:**
- `kernels-community/activation` - GELU, SiLU, etc.
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm, RMSNormif has_kernel("kernels-community/activation"):
activation = get_kernel("kernels-community/activation", version=1)
# 使用内核
x = torch.randn((4, 4), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
**核心函数:**
- `get_kernel(repo_id, version=None)` - 从Hub下载并加载内核
- `has_kernel(repo_id)` - 检查是否存在兼容的构建版本
- `get_local_kernel(path)` - 从本地目录加载(开发用)
**热门社区内核:**
- `kernels-community/activation` - GELU、SiLU等
- `kernels-community/flash-attn` - Flash Attention 2
- `kernels-community/triton-layer-norm` - LayerNorm、RMSNormDiffusers Integration (Video/Image Generation)
Diffusers集成(视频/图像生成)
See diffusers-integration.md for the complete guide.
完整指南请查看diffusers-integration.md。
Transformers Integration (LLMs)
Transformers集成(大语言模型)
See transformers-integration.md for the complete guide.
Key differences from diffusers:
- Transformers RMSNorm always has weights (no )
elementwise_affine=False - Use to match LlamaRMSNorm, MistralRMSNorm, etc.
'RMSNorm' in class_name - Check for (LLaMA) or
variance_epsilon(others) for epsiloneps - No pattern - use Flash Attention 2 instead
set_processor()
Minimal transformers pattern:
python
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)完整指南请查看transformers-integration.md。
与Diffusers的主要区别:
- Transformers的RMSNorm 始终带有权重(无情况)
elementwise_affine=False - 使用匹配LlamaRMSNorm、MistralRMSNorm等所有变体
'RMSNorm' in class_name - 检查(LLaMA)或
variance_epsilon(其他模型)获取epsilon值eps - 无模式 - 改用Flash Attention 2
set_processor()
Transformers极简集成模式:
python
from transformers import AutoModelForCausalLM
from ltx_kernels import rmsnorm
def patch_rmsnorm(model):
for name, module in model.named_modules():
if 'RMSNorm' in type(module).__name__:
eps = getattr(module, 'variance_epsilon', None) or getattr(module, 'eps', 1e-6)
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
patch_rmsnorm(model)Diffusers Critical Pitfalls
Diffusers关键注意事项
1. RMSNorm Weight May Be None
1. RMSNorm权重可能为None
LTX-Video uses for some RMSNorm modules:
elementwise_affine=Falsepython
undefinedLTX-Video的部分RMSNorm模块使用:
elementwise_affine=Falsepython
undefinedTransformer blocks: NO WEIGHT
Transformer块:无权重
self.norm1 = RMSNorm(dim, elementwise_affine=False)
self.norm1 = RMSNorm(dim, elementwise_affine=False)
Attention modules: HAS WEIGHT
注意力模块:有权重
self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
**Solution:** Handle both cases:
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)self.norm_q = torch.nn.RMSNorm(..., elementwise_affine=True)
**解决方案:** 处理两种情况:
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
output = rmsnorm(x, module.weight, eps=eps)
else:
weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
output = rmsnorm(x, weight, eps=eps)2. Diffusers RMSNorm != torch.nn.RMSNorm
2. Diffusers的RMSNorm != torch.nn.RMSNorm
python
undefinedpython
undefinedWRONG - misses diffusers RMSNorm
错误 - 无法匹配Diffusers的RMSNorm
if isinstance(module, torch.nn.RMSNorm):
if isinstance(module, torch.nn.RMSNorm):
CORRECT - catches all RMSNorm variants
正确 - 捕获所有RMSNorm变体
if type(module).name == 'RMSNorm':
undefinedif type(module).name == 'RMSNorm':
undefined3. LTX-Video Uses GELU, Not GEGLU
3. LTX-Video使用GELU而非GEGLU
LTX-Video uses . Don't patch GEGLU for LTX-Video.
activation_fn="gelu-approximate"LTX-Video使用,不要为LTX-Video打GEGLU的补丁。
activation_fn="gelu-approximate"4. Inject Kernels BEFORE CPU Offloading
4. 在CPU卸载前注入内核
python
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # BEFORE offloading
pipe.enable_model_cpu_offload() # Now safepython
pipe = LTXPipeline.from_pretrained(...)
pipe.to("cuda")
inject_optimized_kernels(pipe) # 先注入
pipe.enable_model_cpu_offload() # 再启用卸载Minimal Integration Pattern
极简集成模式
python
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""Patch all RMSNorm modules to use custom kernel."""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)python
from diffusers import LTXPipeline
from ltx_kernels import rmsnorm
def patch_rmsnorm_modules(model):
"""为所有RMSNorm模块打补丁,使用自定义内核。"""
for name, module in model.named_modules():
if type(module).__name__ == 'RMSNorm':
eps = getattr(module, 'eps', 1e-6)
has_weight = hasattr(module, 'weight') and module.weight is not None
if has_weight:
def make_forward(mod, epsilon):
def forward(x):
return rmsnorm(x, mod.weight, eps=epsilon)
return forward
module.forward = make_forward(module, eps)
else:
def make_forward(epsilon):
def forward(x):
w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
return rmsnorm(x, w, eps=epsilon)
return forward
module.forward = make_forward(eps)Usage
使用示例
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
undefinedpipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
undefinedKernel-Specific Guidelines
内核特定指南
RMSNorm
RMSNorm
- Input layout:
[..., hidden_size] - Epsilon default: 1e-6
- Weight may be None if
elementwise_affine=False - Vectorization: Use for BF16,
__nv_bfloat162for FP16,__half2for FP32float4 - Performance: 2.67x faster than PyTorch with vectorized implementation
- Bandwidth: Achieves ~38% of H100's 3.35 TB/s theoretical bandwidth
- 输入布局:
[..., hidden_size] - 默认epsilon:1e-6
- 权重可能为None:当时
elementwise_affine=False - 向量化:BF16使用,FP16使用
__nv_bfloat162,FP32使用__half2float4 - 性能:向量化实现比PyTorch快2.67倍
- 带宽:达到H100理论3.35 TB/s带宽的约38%
RoPE
RoPE
- 1D: - for text
[batch, seq, heads, head_dim] - 3D: - for video
[batch, t*h*w, heads, head_dim] - LTX-Video computes its own RoPE via
LTXVideoRotaryPosEmbed
- 1D:- 用于文本
[batch, seq, heads, head_dim] - 3D:- 用于视频
[batch, t*h*w, heads, head_dim] - LTX-Video通过计算自身的RoPE
LTXVideoRotaryPosEmbed
GEGLU vs GELU
GEGLU vs GELU
- GEGLU: Input -> Output
[batch, seq, 2*hidden][batch, seq, hidden] - GELU: Standard activation
- LTX-Video uses GELU, NOT GEGLU
- GEGLU:输入-> 输出
[batch, seq, 2*hidden][batch, seq, hidden] - GELU:标准激活函数
- LTX-Video使用GELU,而非GEGLU
AdaLN
AdaLN
- Formula:
norm(x) * weight * (1 + scale) + shift - Used in DiT blocks for conditioning
- 公式:
norm(x) * weight * (1 + scale) + shift - 用于DiT块的条件控制
Performance Profiling
性能分析
bash
undefinedbash
undefinedNVIDIA Nsight Systems
NVIDIA Nsight Systems
nsys profile -o profile python your_script.py
nsys profile -o profile python your_script.py
NVIDIA Nsight Compute
NVIDIA Nsight Compute
ncu --set full -o metrics python your_script.py
undefinedncu --set full -o metrics python your_script.py
undefinedCommon Issues
常见问题
See troubleshooting.md for all common issues and solutions.
Quick fixes:
- "NoneType has no attribute contiguous": RMSNorm weight is None, create ones
- isinstance() not matching: Use instead
type(module).__name__ - GEGLU not called: Model uses GELU, not GEGLU
- Patching doesn't persist: Inject before
enable_model_cpu_offload() - torch.compile fails with custom kernels: See below
所有常见问题与解决方案请查看troubleshooting.md。
快速修复:
- "NoneType has no attribute contiguous":RMSNorm权重为None,创建全1张量
- isinstance()无法匹配:改用
type(module).__name__ - GEGLU未被调用:模型使用的是GELU而非GEGLU
- 补丁不生效:在前注入内核
enable_model_cpu_offload() - 自定义内核与torch.compile不兼容:查看下文
torch.compile Compatibility
torch.compile兼容性
Custom CUDA kernels and are mutually exclusive unless you register the kernel as a PyTorch custom op.
torch.compileError message:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skippedWorkaround options:
- Use without
--use-optimized-kernels(6% speedup)--compile - Use without custom kernels (34% speedup)
--compile - Register kernel as custom op (advanced, requires )
torch.library
To register as custom op (for torch.compile compatibility):
python
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # No shape changes自定义CUDA内核与互斥,除非将内核注册为PyTorch自定义算子。
torch.compile错误信息:
torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped解决方案选项:
- 使用但不使用
--use-optimized-kernels(提速6%)--compile - 使用但不使用自定义内核(提速34%)
--compile - 将内核注册为自定义算子(进阶,需使用)
torch.library
注册为自定义算子(实现torch.compile兼容):
python
import torch
@torch.library.custom_op("ltx_kernels::rmsnorm", mutates_args={"out"})
def rmsnorm(out: torch.Tensor, input: torch.Tensor, weight: torch.Tensor, eps: float) -> None:
ops.rmsnorm_forward(out, input.contiguous(), weight.contiguous(), eps)
@rmsnorm.register_fake
def _(out, input, weight, eps):
pass # 无形状变化See Also
相关资源
Scripts
脚本
- benchmark_example.py - Benchmarking script for comparing optimized vs baseline - START HERE
- ltx_kernel_injection_example.py - Minimal diffusers integration (~150 lines)
- transformers_injection_example.py - Minimal transformers/LLM integration (~120 lines)
- huggingface_kernels_example.py - HuggingFace Kernels Hub integration
- benchmark_example.py - 对比优化版与基准版的基准测试脚本 - 从这里开始
- ltx_kernel_injection_example.py - Diffusers极简集成(约150行)
- transformers_injection_example.py - Transformers/大语言模型极简集成(约120行)
- huggingface_kernels_example.py - HuggingFace Kernels Hub集成
Integration Guides
集成指南
- huggingface-kernels-integration.md - HuggingFace Kernels Hub (get_kernel) - load pre-compiled kernels
- diffusers-integration.md - Complete diffusers pipeline integration
- transformers-integration.md - Complete transformers/LLM integration
- huggingface-kernels-integration.md - HuggingFace Kernels Hub(get_kernel)- 加载预编译内核
- diffusers-integration.md - 完整Diffusers流水线集成指南
- transformers-integration.md - 完整Transformers/大语言模型集成指南
GPU Optimization Guides
GPU优化指南
- h100-optimization-guide.md - H100 (Hopper, sm_90) deep dive
- a100-optimization-guide.md - A100 (Ampere, sm_80) deep dive
- t4-optimization-guide.md - T4 (Turing, sm_75) deep dive
- h100-optimization-guide.md - H100(Hopper, sm_90)深度解析
- a100-optimization-guide.md - A100(Ampere, sm_80)深度解析
- t4-optimization-guide.md - T4(Turing, sm_75)深度解析
Reference
参考文档
- troubleshooting.md - Common issues and solutions
- kernel-templates.md - Complete kernel templates
- examples/ltx_video/ - Full LTX-Video example directory
- troubleshooting.md - 常见问题与解决方案
- kernel-templates.md - 完整内核模板
- examples/ltx_video/ - 完整LTX-Video示例目录