add-sgl-kernel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Tutorial: Adding a New Kernel to
sgl-kernel
(AOT / Heavyweight)

教程:向
sgl-kernel
中添加新内核(AOT / 重量级)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement
scale(x, factor) = x * factor
to demonstrate the complete workflow.
本教程将逐步演示如何添加一个简单的逐元素缩放操作作为AOT内核。我们将实现
scale(x, factor) = x * factor
来展示完整的工作流程。

Goal

目标

Add a new operation that scales each element of a tensor by a scalar factor:
  • Input: tensor
    x
    (CUDA) and scalar
    factor
    (float)
  • Output:
    x * factor
    (element-wise, in-place or into pre-allocated
    out
    )
  • Supported dtypes: FP16 (
    torch.float16
    ), BF16 (
    torch.bfloat16
    ), FP32 (
    torch.float32
    )
    • Dispatched via
      DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
      macro (defined in
      sgl-kernel/include/utils.h
      )
添加一个新操作,将张量的每个元素乘以一个标量系数:
  • 输入:张量
    x
    (CUDA)和标量
    factor
    (float)
  • 输出:
    x * factor
    (逐元素计算,支持原地修改或写入预分配的
    out
    张量)
  • 支持的数据类型:FP16 (
    torch.float16
    )、BF16 (
    torch.bfloat16
    )、FP32 (
    torch.float32
    )
    • 通过
      DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
      宏进行分发(定义于
      sgl-kernel/include/utils.h

Two rules of thumb (must follow)

两条重要原则(必须遵守)

  1. Prefer
    python/sglang/jit_kernel
    first
    when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
  2. Prefer
    sgl-kernel
    when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
  3. Exception: if the dependency is
    flashinfer
    , or CUTLASS that is already provided through
    flashinfer
    , the kernel can still be implemented as
    jit_kernel
    .
In addition, every new kernel must ship with:
  • Tests (pytest)
  • A benchmark script (triton.testing)

  1. 当内核依赖CUTLASS或其他大型C++项目时,优先选择
    python/sglang/jit_kernel
    。这是轻量级内核的默认实现路径,可支持快速迭代。
  2. 当内核确实依赖CUTLASS或其他大型C++项目,或者需要纳入AOT包/torch算子注册流程时,优先选择
    sgl-kernel
  3. 例外情况:如果依赖项是
    flashinfer
    ,或者是
    flashinfer
    已提供的CUTLASS,内核仍可通过
    jit_kernel
    实现。
此外,每个新内核必须附带:
  • 测试(基于pytest)
  • 基准测试脚本(基于triton.testing)

Repository integration map

仓库集成映射

You will typically touch these files/areas:
  • Implementation:
    sgl-kernel/csrc/elementwise/scale.cu
    (pick the right subdirectory)
  • Public declarations:
    sgl-kernel/include/sgl_kernel_ops.h
  • Torch extension registration:
    sgl-kernel/csrc/common_extension.cc
  • Build:
    sgl-kernel/CMakeLists.txt
    (
    set(SOURCES ...)
    )
  • Python API:
    sgl-kernel/python/sgl_kernel/
    and
    sgl-kernel/python/sgl_kernel/__init__.py
  • Tests:
    sgl-kernel/tests/test_scale.py
  • Benchmarks:
    sgl-kernel/benchmark/bench_scale.py

你通常需要修改以下文件/区域:
  • 实现代码:
    sgl-kernel/csrc/elementwise/scale.cu
    (选择合适的子目录)
  • 公共声明:
    sgl-kernel/include/sgl_kernel_ops.h
  • Torch扩展注册:
    sgl-kernel/csrc/common_extension.cc
  • 构建配置:
    sgl-kernel/CMakeLists.txt
    set(SOURCES ...)
    部分)
  • Python API:
    sgl-kernel/python/sgl_kernel/
    sgl-kernel/python/sgl_kernel/__init__.py
  • 测试代码:
    sgl-kernel/tests/test_scale.py
  • 基准测试:
    sgl-kernel/benchmark/bench_scale.py

Step 1: Implement the kernel in
csrc/

步骤1:在
csrc/
中实现内核

Pick the right subdirectory:
  • csrc/elementwise/
    — for element-wise ops (our example)
  • csrc/gemm/
    ,
    csrc/attention/
    ,
    csrc/moe/
    — for other categories
Create
sgl-kernel/csrc/elementwise/scale.cu
:
cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>

#include "utils.h"  // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
                              const T* __restrict__ input,
                              float factor,
                              int64_t n) {
  int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
  if (idx < n) {
    out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
  }
}

void scale(at::Tensor& out, const at::Tensor& input, double factor) {
  TORCH_CHECK(input.is_cuda(),       "input must be a CUDA tensor");
  TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
  TORCH_CHECK(out.is_cuda(),         "out must be a CUDA tensor");
  TORCH_CHECK(out.is_contiguous(),   "out must be contiguous");
  TORCH_CHECK(out.sizes() == input.sizes(),  "out and input must have the same shape");
  TORCH_CHECK(out.scalar_type() == input.scalar_type(),
              "out and input must have the same dtype");

  const int64_t n = input.numel();
  const int threads = 256;
  const int blocks  = (n + threads - 1) / threads;

  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

  // Dispatches over float, float16, bfloat16
  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
    scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
        static_cast<c_type*>(out.data_ptr()),
        static_cast<const c_type*>(input.data_ptr()),
        static_cast<float>(factor),
        n);
    cudaError_t status = cudaGetLastError();
    TORCH_CHECK(status == cudaSuccess,
                "scale_kernel launch failed: ", cudaGetErrorString(status));
    return true;
  });
}
Key points:
  • Use
    at::Tensor
    (PyTorch tensors),
    TORCH_CHECK
    for validation,
    at::cuda::getCurrentCUDAStream()
    for stream
  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
    covers
    float
    ,
    half
    (FP16),
    __nv_bfloat16
    (BF16)
  • Add device error checking after every kernel launch
  • If a kernel only works on certain architectures, enforce that with
    TORCH_CHECK
    and skip logic in tests

选择合适的子目录:
  • csrc/elementwise/
    — 用于逐元素操作(本示例使用此目录)
  • csrc/gemm/
    csrc/attention/
    csrc/moe/
    — 用于其他类别操作
创建
sgl-kernel/csrc/elementwise/scale.cu
cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>

#include "utils.h"  // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
                              const T* __restrict__ input,
                              float factor,
                              int64_t n) {
  int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
  if (idx < n) {
    out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
  }
}

void scale(at::Tensor& out, const at::Tensor& input, double factor) {
  TORCH_CHECK(input.is_cuda(),       "input must be a CUDA tensor");
  TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
  TORCH_CHECK(out.is_cuda(),         "out must be a CUDA tensor");
  TORCH_CHECK(out.is_contiguous(),   "out must be contiguous");
  TORCH_CHECK(out.sizes() == input.sizes(),  "out and input must have the same shape");
  TORCH_CHECK(out.scalar_type() == input.scalar_type(),
              "out and input must have the same dtype");

  const int64_t n = input.numel();
  const int threads = 256;
  const int blocks  = (n + threads - 1) / threads;

  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

  // Dispatches over float, float16, bfloat16
  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
    scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
        static_cast<c_type*>(out.data_ptr()),
        static_cast<const c_type*>(input.data_ptr()),
        static_cast<float>(factor),
        n);
    cudaError_t status = cudaGetLastError();
    TORCH_CHECK(status == cudaSuccess,
                "scale_kernel launch failed: ", cudaGetErrorString(status));
    return true;
  });
}
关键点:
  • 使用
    at::Tensor
    (PyTorch张量)、
    TORCH_CHECK
    进行参数验证、
    at::cuda::getCurrentCUDAStream()
    获取流
  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
    宏支持
    float
    half
    (FP16)、
    __nv_bfloat16
    (BF16)
  • 每次内核启动后添加设备错误检查
  • 如果内核仅支持特定架构,通过
    TORCH_CHECK
    强制执行,并在测试中添加跳过逻辑

Step 2: Add a C++ declaration in
include/sgl_kernel_ops.h

步骤2:在
include/sgl_kernel_ops.h
中添加C++声明

Edit
sgl-kernel/include/sgl_kernel_ops.h
, add to the elementwise section:
cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);

编辑
sgl-kernel/include/sgl_kernel_ops.h
,在逐元素操作部分添加:
cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);

Step 3: Register the op in
csrc/common_extension.cc

步骤3:在
csrc/common_extension.cc
中注册算子

Edit
sgl-kernel/csrc/common_extension.cc
, inside
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)
:
cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
Key points:
  • Tensor!
    means in-place / mutable output argument
  • The schema is important for
    torch.compile
    and for consistent call signatures
  • If your underlying C++ API uses
    float
    but PyTorch bindings expect
    double
    , the implicit cast is fine for scalars; use shims if needed for other types

编辑
sgl-kernel/csrc/common_extension.cc
,在
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)
块内添加:
cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
关键点:
  • Tensor!
    表示原地修改/可变输出参数
  • 算子 schema 对
    torch.compile
    和保持调用签名一致性至关重要
  • 如果底层C++ API使用
    float
    但PyTorch绑定期望
    double
    ,标量的隐式转换是可行的;其他类型可使用垫片(shim)处理

Step 4: Add the new source file to
CMakeLists.txt

步骤4:将新源文件添加到
CMakeLists.txt

Edit
sgl-kernel/CMakeLists.txt
, add to
set(SOURCES ...)
:
cmake
csrc/elementwise/scale.cu
Key points:
  • Keep the list alphabetically sorted (the file explicitly requires this)
  • If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic

编辑
sgl-kernel/CMakeLists.txt
,在
set(SOURCES ...)
中添加:
cmake
csrc/elementwise/scale.cu
关键点:
  • 保持列表按字母顺序排序(文件明确要求此规则)
  • 如果内核有架构限制,在测试/基准测试中通过跳过逻辑体现

Step 5: Expose a Python API under
sgl-kernel/python/sgl_kernel/

步骤5:在
sgl-kernel/python/sgl_kernel/
下暴露Python API

Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:
  • implement the Python wrapper in
    sgl-kernel/python/sgl_kernel/elementwise.py
  • then re-export it from
    sgl-kernel/python/sgl_kernel/__init__.py
For example, in
sgl-kernel/python/sgl_kernel/elementwise.py
, add:
python
import torch

def scale(
    input: torch.Tensor,
    factor: float,
    out: torch.Tensor | None = None,
) -> torch.Tensor:
    """
    Element-wise scale: out = input * factor.

    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

    Parameters
    ----------
    input  : CUDA input tensor
    factor : scale factor (float)
    out    : optional pre-allocated CUDA output tensor (same shape/dtype as input)
    """
    if out is None:
        out = torch.empty_like(input)
    torch.ops.sgl_kernel.scale.default(out, input, factor)
    return out
Then re-export it from
sgl-kernel/python/sgl_kernel/__init__.py
following the existing import style used by other kernels.

优先遵循现有的模块组织方式。对于逐元素内核,通常的模式是:
  • sgl-kernel/python/sgl_kernel/elementwise.py
    中实现Python包装器
  • 然后在
    sgl-kernel/python/sgl_kernel/__init__.py
    中重新导出
例如,在
sgl-kernel/python/sgl_kernel/elementwise.py
中添加:
python
import torch

def scale(
    input: torch.Tensor,
    factor: float,
    out: torch.Tensor | None = None,
) -> torch.Tensor:
    """
    Element-wise scale: out = input * factor.

    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

    Parameters
    ----------
    input  : CUDA input tensor
    factor : scale factor (float)
    out    : optional pre-allocated CUDA output tensor (same shape/dtype as input)
    """
    if out is None:
        out = torch.empty_like(input)
    torch.ops.sgl_kernel.scale.default(out, input, factor)
    return out
然后按照其他内核的导入方式,在
sgl-kernel/python/sgl_kernel/__init__.py
中重新导出该函数。

Step 6: Write tests (required)

步骤6:编写测试(必填)

Create
sgl-kernel/tests/test_scale.py
:
python
import pytest

import torch
import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
    input = torch.randn(size, dtype=dtype, device="cuda")
    out   = torch.empty_like(input)

    result = sgl_kernel.scale(input, factor, out=out)
    assert result is out

    expected = input * factor
    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)


def test_scale_shape_mismatch():
    input = torch.randn(128, dtype=torch.float16, device="cuda")
    out   = torch.empty(256, dtype=torch.float16, device="cuda")
    with pytest.raises(RuntimeError, match="same shape"):
        sgl_kernel.scale(input, 2.0, out=out)


def test_scale_cpu_input():
    input = torch.randn(128, dtype=torch.float16)  # CPU
    out   = torch.empty_like(input)
    with pytest.raises(RuntimeError, match="CUDA"):
        sgl_kernel.scale(input, 2.0, out=out)


if __name__ == "__main__":
    pytest.main([__file__, "-q"])

创建
sgl-kernel/tests/test_scale.py
python
import pytest

import torch
import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
    input = torch.randn(size, dtype=dtype, device="cuda")
    out   = torch.empty_like(input)

    result = sgl_kernel.scale(input, factor, out=out)
    assert result is out

    expected = input * factor
    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)


def test_scale_shape_mismatch():
    input = torch.randn(128, dtype=torch.float16, device="cuda")
    out   = torch.empty(256, dtype=torch.float16, device="cuda")
    with pytest.raises(RuntimeError, match="same shape"):
        sgl_kernel.scale(input, 2.0, out=out)


def test_scale_cpu_input():
    input = torch.randn(128, dtype=torch.float16)  # CPU
    out   = torch.empty_like(input)
    with pytest.raises(RuntimeError, match="CUDA"):
        sgl_kernel.scale(input, 2.0, out=out)


if __name__ == "__main__":
    pytest.main([__file__, "-q"])

Step 7: Add a benchmark (required)

步骤7:添加基准测试(必填)

Create
sgl-kernel/benchmark/bench_scale.py
:
python
import itertools
import os

import torch
import triton
import triton.testing

import sgl_kernel

IS_CI = (
    os.getenv("CI", "false").lower() == "true"
    or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)

dtypes  = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes   = [4096] if IS_CI else [2**n for n in range(10, 20)]  # 1K … 512K
factors = [2.0]

configs = list(itertools.product(dtypes, sizes))


def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
    return input * factor


@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["dtype", "size"],
        x_vals=configs,
        line_arg="provider",
        line_vals=["sglang", "torch"],
        line_names=["SGL Kernel", "PyTorch"],
        styles=[("green", "-"), ("red", "--")],
        ylabel="µs (median)",
        plot_name="scale-performance",
        args={},
    )
)
def benchmark(dtype, size, provider):
    input  = torch.randn(size, dtype=dtype, device="cuda")
    out    = torch.empty_like(input)
    factor = 2.0

    if provider == "sglang":
        fn = lambda: sgl_kernel.scale(input, factor, out=out)
    else:
        fn = lambda: torch_scale(input, factor)

    ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
        fn, quantiles=[0.5, 0.2, 0.8]
    )
    return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
    benchmark.run(print_data=True)

创建
sgl-kernel/benchmark/bench_scale.py
python
import itertools
import os

import torch
import triton
import triton.testing

import sgl_kernel

IS_CI = (
    os.getenv("CI", "false").lower() == "true"
    or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)

dtypes  = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes   = [4096] if IS_CI else [2**n for n in range(10, 20)]  # 1K … 512K
factors = [2.0]

configs = list(itertools.product(dtypes, sizes))


def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
    return input * factor


@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["dtype", "size"],
        x_vals=configs,
        line_arg="provider",
        line_vals=["sglang", "torch"],
        line_names=["SGL Kernel", "PyTorch"],
        styles=[("green", "-"), ("red", "--")],
        ylabel="µs (median)",
        plot_name="scale-performance",
        args={},
    )
)
def benchmark(dtype, size, provider):
    input  = torch.randn(size, dtype=dtype, device="cuda")
    out    = torch.empty_like(input)
    factor = 2.0

    if provider == "sglang":
        fn = lambda: sgl_kernel.scale(input, factor, out=out)
    else:
        fn = lambda: torch_scale(input, factor)

    ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
        fn, quantiles=[0.5, 0.2, 0.8]
    )
    return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
    benchmark.run(print_data=True)

Step 8: Build

步骤8:构建

Build:
bash
cd sgl-kernel
make build -j16
If you need to limit host resource usage:
bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"

执行构建:
bash
cd sgl-kernel
make build -j16
如果需要限制主机资源使用:
bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"

Step 9: Validate

步骤9:验证

After building successfully, run the test and benchmark:
bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py

构建成功后,运行测试和基准测试:
bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py

Troubleshooting

故障排除

  • Async CUDA errors:
    CUDA_LAUNCH_BLOCKING=1
  • Memory errors:
    compute-sanitizer --tool memcheck python ...
  • Build is too slow / OOM: reduce
    MAX_JOBS
    and
    SGL_KERNEL_COMPILE_THREADS
  • Binary bloat: use
    sgl-kernel/analyze_whl_kernel_sizes.py
  • CMake sources list: if your
    .cu
    file is missing from
    SOURCES
    , the symbol will be undefined at link time

  • 异步CUDA错误:设置环境变量
    CUDA_LAUNCH_BLOCKING=1
  • 内存错误:使用
    compute-sanitizer --tool memcheck python ...
    进行检测
  • 构建过慢/内存不足:减小
    MAX_JOBS
    SGL_KERNEL_COMPILE_THREADS
    的值
  • 二进制文件过大:使用
    sgl-kernel/analyze_whl_kernel_sizes.py
    分析
  • CMake源文件列表问题:如果
    .cu
    文件未添加到
    SOURCES
    中,链接时会出现符号未定义错误

References

参考资料

  • sgl-kernel/README.md
  • sgl-kernel/include/sgl_kernel_ops.h
  • sgl-kernel/csrc/common_extension.cc
  • sgl-kernel/CMakeLists.txt
  • sgl-kernel/include/utils.h
    DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
    macro and friends
  • sgl-kernel/csrc/elementwise/activation.cu
    — reference for the FP16/BF16/FP32 dispatch pattern
  • sgl-kernel/README.md
  • sgl-kernel/include/sgl_kernel_ops.h
  • sgl-kernel/csrc/common_extension.cc
  • sgl-kernel/CMakeLists.txt
  • sgl-kernel/include/utils.h
    DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
    宏及相关工具
  • sgl-kernel/csrc/elementwise/activation.cu
    — FP16/BF16/FP32分发模式的参考实现

Summary of Files Created/Modified

创建/修改的文件汇总

sgl-kernel/csrc/elementwise/scale.cu          # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h           # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc           # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt                     # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py   # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py      # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py                # NEW: tests
sgl-kernel/benchmark/bench_scale.py           # NEW: benchmark
sgl-kernel/csrc/elementwise/scale.cu          # 新增:CUDA内核 + 启动器
sgl-kernel/include/sgl_kernel_ops.h           # 修改:C++声明
sgl-kernel/csrc/common_extension.cc           # 修改:schema + 分发注册
sgl-kernel/CMakeLists.txt                     # 修改:添加源文件(按字母顺序)
sgl-kernel/python/sgl_kernel/elementwise.py   # 修改:Python包装器
sgl-kernel/python/sgl_kernel/__init__.py      # 修改:重新导出Python API
sgl-kernel/tests/test_scale.py                # 新增:测试代码
sgl-kernel/benchmark/bench_scale.py           # 新增:基准测试代码