add-sgl-kernel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTutorial: Adding a New Kernel to sgl-kernel
(AOT / Heavyweight)
sgl-kernel教程:向sgl-kernel
中添加新内核(AOT / 重量级)
sgl-kernelThis tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement to demonstrate the complete workflow.
scale(x, factor) = x * factor本教程将逐步演示如何添加一个简单的逐元素缩放操作作为AOT内核。我们将实现来展示完整的工作流程。
scale(x, factor) = x * factorGoal
目标
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor (CUDA) and scalar
x(float)factor - Output: (element-wise, in-place or into pre-allocated
x * factor)out - Supported dtypes: FP16 (), BF16 (
torch.float16), FP32 (torch.bfloat16)torch.float32- Dispatched via macro (defined in
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16)sgl-kernel/include/utils.h
- Dispatched via
添加一个新操作,将张量的每个元素乘以一个标量系数:
- 输入:张量(CUDA)和标量
x(float)factor - 输出:(逐元素计算,支持原地修改或写入预分配的
x * factor张量)out - 支持的数据类型:FP16 ()、BF16 (
torch.float16)、FP32 (torch.bfloat16)torch.float32- 通过宏进行分发(定义于
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16)sgl-kernel/include/utils.h
- 通过
Two rules of thumb (must follow)
两条重要原则(必须遵守)
- Prefer first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
python/sglang/jit_kernel - Prefer when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
sgl-kernel - Exception: if the dependency is , or CUTLASS that is already provided through
flashinfer, the kernel can still be implemented asflashinfer.jit_kernel
In addition, every new kernel must ship with:
- Tests (pytest)
- A benchmark script (triton.testing)
- 当内核不依赖CUTLASS或其他大型C++项目时,优先选择。这是轻量级内核的默认实现路径,可支持快速迭代。
python/sglang/jit_kernel - 当内核确实依赖CUTLASS或其他大型C++项目,或者需要纳入AOT包/torch算子注册流程时,优先选择。
sgl-kernel - 例外情况:如果依赖项是,或者是
flashinfer已提供的CUTLASS,内核仍可通过flashinfer实现。jit_kernel
此外,每个新内核必须附带:
- 测试(基于pytest)
- 基准测试脚本(基于triton.testing)
Repository integration map
仓库集成映射
You will typically touch these files/areas:
- Implementation: (pick the right subdirectory)
sgl-kernel/csrc/elementwise/scale.cu - Public declarations:
sgl-kernel/include/sgl_kernel_ops.h - Torch extension registration:
sgl-kernel/csrc/common_extension.cc - Build: (
sgl-kernel/CMakeLists.txt)set(SOURCES ...) - Python API: and
sgl-kernel/python/sgl_kernel/sgl-kernel/python/sgl_kernel/__init__.py - Tests:
sgl-kernel/tests/test_scale.py - Benchmarks:
sgl-kernel/benchmark/bench_scale.py
你通常需要修改以下文件/区域:
- 实现代码:(选择合适的子目录)
sgl-kernel/csrc/elementwise/scale.cu - 公共声明:
sgl-kernel/include/sgl_kernel_ops.h - Torch扩展注册:
sgl-kernel/csrc/common_extension.cc - 构建配置:(
sgl-kernel/CMakeLists.txt部分)set(SOURCES ...) - Python API:和
sgl-kernel/python/sgl_kernel/sgl-kernel/python/sgl_kernel/__init__.py - 测试代码:
sgl-kernel/tests/test_scale.py - 基准测试:
sgl-kernel/benchmark/bench_scale.py
Step 1: Implement the kernel in csrc/
csrc/步骤1:在csrc/
中实现内核
csrc/Pick the right subdirectory:
- — for element-wise ops (our example)
csrc/elementwise/ - ,
csrc/gemm/,csrc/attention/— for other categoriescsrc/moe/
Create :
sgl-kernel/csrc/elementwise/scale.cucpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");
const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}Key points:
- Use (PyTorch tensors),
at::Tensorfor validation,TORCH_CHECKfor streamat::cuda::getCurrentCUDAStream() - covers
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16,float(FP16),half(BF16)__nv_bfloat16 - Add device error checking after every kernel launch
- If a kernel only works on certain architectures, enforce that with and skip logic in tests
TORCH_CHECK
选择合适的子目录:
- — 用于逐元素操作(本示例使用此目录)
csrc/elementwise/ - 、
csrc/gemm/、csrc/attention/— 用于其他类别操作csrc/moe/
创建:
sgl-kernel/csrc/elementwise/scale.cucpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");
const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}关键点:
- 使用(PyTorch张量)、
at::Tensor进行参数验证、TORCH_CHECK获取流at::cuda::getCurrentCUDAStream() - 宏支持
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16、float(FP16)、half(BF16)__nv_bfloat16 - 每次内核启动后添加设备错误检查
- 如果内核仅支持特定架构,通过强制执行,并在测试中添加跳过逻辑
TORCH_CHECK
Step 2: Add a C++ declaration in include/sgl_kernel_ops.h
include/sgl_kernel_ops.h步骤2:在include/sgl_kernel_ops.h
中添加C++声明
include/sgl_kernel_ops.hEdit , add to the elementwise section:
sgl-kernel/include/sgl_kernel_ops.hcpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);编辑,在逐元素操作部分添加:
sgl-kernel/include/sgl_kernel_ops.hcpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);Step 3: Register the op in csrc/common_extension.cc
csrc/common_extension.cc步骤3:在csrc/common_extension.cc
中注册算子
csrc/common_extension.ccEdit , inside :
sgl-kernel/csrc/common_extension.ccTORCH_LIBRARY_FRAGMENT(sgl_kernel, m)cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);Key points:
- means in-place / mutable output argument
Tensor! - The schema is important for and for consistent call signatures
torch.compile - If your underlying C++ API uses but PyTorch bindings expect
float, the implicit cast is fine for scalars; use shims if needed for other typesdouble
编辑,在块内添加:
sgl-kernel/csrc/common_extension.ccTORCH_LIBRARY_FRAGMENT(sgl_kernel, m)cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);关键点:
- 表示原地修改/可变输出参数
Tensor! - 算子 schema 对和保持调用签名一致性至关重要
torch.compile - 如果底层C++ API使用但PyTorch绑定期望
float,标量的隐式转换是可行的;其他类型可使用垫片(shim)处理double
Step 4: Add the new source file to CMakeLists.txt
CMakeLists.txt步骤4:将新源文件添加到CMakeLists.txt
CMakeLists.txtEdit , add to :
sgl-kernel/CMakeLists.txtset(SOURCES ...)cmake
csrc/elementwise/scale.cuKey points:
- Keep the list alphabetically sorted (the file explicitly requires this)
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
编辑,在中添加:
sgl-kernel/CMakeLists.txtset(SOURCES ...)cmake
csrc/elementwise/scale.cu关键点:
- 保持列表按字母顺序排序(文件明确要求此规则)
- 如果内核有架构限制,在测试/基准测试中通过跳过逻辑体现
Step 5: Expose a Python API under sgl-kernel/python/sgl_kernel/
sgl-kernel/python/sgl_kernel/步骤5:在sgl-kernel/python/sgl_kernel/
下暴露Python API
sgl-kernel/python/sgl_kernel/Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:
- implement the Python wrapper in
sgl-kernel/python/sgl_kernel/elementwise.py - then re-export it from
sgl-kernel/python/sgl_kernel/__init__.py
For example, in , add:
sgl-kernel/python/sgl_kernel/elementwise.pypython
import torch
def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return outThen re-export it from following the existing import style used by other kernels.
sgl-kernel/python/sgl_kernel/__init__.py优先遵循现有的模块组织方式。对于逐元素内核,通常的模式是:
- 在中实现Python包装器
sgl-kernel/python/sgl_kernel/elementwise.py - 然后在中重新导出
sgl-kernel/python/sgl_kernel/__init__.py
例如,在中添加:
sgl-kernel/python/sgl_kernel/elementwise.pypython
import torch
def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out然后按照其他内核的导入方式,在中重新导出该函数。
sgl-kernel/python/sgl_kernel/__init__.pyStep 6: Write tests (required)
步骤6:编写测试(必填)
Create :
sgl-kernel/tests/test_scale.pypython
import pytest
import torch
import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
result = sgl_kernel.scale(input, factor, out=out)
assert result is out
expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)
def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)
if __name__ == "__main__":
pytest.main([__file__, "-q"])创建:
sgl-kernel/tests/test_scale.pypython
import pytest
import torch
import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
result = sgl_kernel.scale(input, factor, out=out)
assert result is out
expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)
def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)
if __name__ == "__main__":
pytest.main([__file__, "-q"])Step 7: Add a benchmark (required)
步骤7:添加基准测试(必填)
Create :
sgl-kernel/benchmark/bench_scale.pypython
import itertools
import os
import torch
import triton
import triton.testing
import sgl_kernel
IS_CI = (
os.getenv("CI", "false").lower() == "true"
or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]
configs = list(itertools.product(dtypes, sizes))
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0
if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
if __name__ == "__main__":
benchmark.run(print_data=True)创建:
sgl-kernel/benchmark/bench_scale.pypython
import itertools
import os
import torch
import triton
import triton.testing
import sgl_kernel
IS_CI = (
os.getenv("CI", "false").lower() == "true"
or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]
configs = list(itertools.product(dtypes, sizes))
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0
if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
if __name__ == "__main__":
benchmark.run(print_data=True)Step 8: Build
步骤8:构建
Build:
bash
cd sgl-kernel
make build -j16If you need to limit host resource usage:
bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"执行构建:
bash
cd sgl-kernel
make build -j16如果需要限制主机资源使用:
bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"Step 9: Validate
步骤9:验证
After building successfully, run the test and benchmark:
bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py构建成功后,运行测试和基准测试:
bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.pyTroubleshooting
故障排除
- Async CUDA errors:
CUDA_LAUNCH_BLOCKING=1 - Memory errors:
compute-sanitizer --tool memcheck python ... - Build is too slow / OOM: reduce and
MAX_JOBSSGL_KERNEL_COMPILE_THREADS - Binary bloat: use
sgl-kernel/analyze_whl_kernel_sizes.py - CMake sources list: if your file is missing from
.cu, the symbol will be undefined at link timeSOURCES
- 异步CUDA错误:设置环境变量
CUDA_LAUNCH_BLOCKING=1 - 内存错误:使用进行检测
compute-sanitizer --tool memcheck python ... - 构建过慢/内存不足:减小和
MAX_JOBS的值SGL_KERNEL_COMPILE_THREADS - 二进制文件过大:使用分析
sgl-kernel/analyze_whl_kernel_sizes.py - CMake源文件列表问题:如果文件未添加到
.cu中,链接时会出现符号未定义错误SOURCES
References
参考资料
sgl-kernel/README.mdsgl-kernel/include/sgl_kernel_ops.hsgl-kernel/csrc/common_extension.ccsgl-kernel/CMakeLists.txt- —
sgl-kernel/include/utils.hmacro and friendsDISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 - — reference for the FP16/BF16/FP32 dispatch pattern
sgl-kernel/csrc/elementwise/activation.cu
sgl-kernel/README.mdsgl-kernel/include/sgl_kernel_ops.hsgl-kernel/csrc/common_extension.ccsgl-kernel/CMakeLists.txt- —
sgl-kernel/include/utils.h宏及相关工具DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 - — FP16/BF16/FP32分发模式的参考实现
sgl-kernel/csrc/elementwise/activation.cu
Summary of Files Created/Modified
创建/修改的文件汇总
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmarksgl-kernel/csrc/elementwise/scale.cu # 新增:CUDA内核 + 启动器
sgl-kernel/include/sgl_kernel_ops.h # 修改:C++声明
sgl-kernel/csrc/common_extension.cc # 修改:schema + 分发注册
sgl-kernel/CMakeLists.txt # 修改:添加源文件(按字母顺序)
sgl-kernel/python/sgl_kernel/elementwise.py # 修改:Python包装器
sgl-kernel/python/sgl_kernel/__init__.py # 修改:重新导出Python API
sgl-kernel/tests/test_scale.py # 新增:测试代码
sgl-kernel/benchmark/bench_scale.py # 新增:基准测试代码