exec-slurm-compile

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Compile TensorRT-LLM on SLURM Cluster

在SLURM集群上编译TensorRT-LLM

Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers.
使用enroot容器在SLURM集群上提交、监控并验证TensorRT-LLM编译作业。

When to Use

适用场景

ScenarioUse This Skill?
User wants to compile TRT-LLM on a SLURM clusterYes
User is already on a compute node and wants to compileNo — use
exec-local-compile
skill instead
场景是否使用该技能?
用户希望在SLURM集群上编译TRT-LLM
用户已在计算节点上,想要编译否 — 请改用
exec-local-compile
技能

Finding the Docker Image

查找Docker镜像

The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself:
<repo_dir>/jenkins/current_image_tags.properties
Read this file to find the current image URL (e.g.,
urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901
).
对应TensorRT-LLM版本的官方Docker镜像标签记录在仓库本身的以下文件中:
<repo_dir>/jenkins/current_image_tags.properties
读取该文件以获取当前镜像URL(例如:
urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901
)。

Pre-dumping the Container Image (enroot import)

预导出容器镜像(enroot import)

SLURM clusters using enroot/pyxis require a
.sqsh
container image. To avoid download overhead at compile time, pre-dump the image in advance using the
enroot-import
companion script:
bash
undefined
使用enroot/pyxis的SLURM集群需要
.sqsh
格式的容器镜像。为避免编译时的下载开销,请提前使用
enroot-import
配套脚本预导出镜像
bash
undefined

Basic usage — submits a SLURM job on a CPU partition to import the image

基础用法 — 在CPU分区提交SLURM作业以导入镜像

enroot-import --partition cpu_datamover --debug <docker_image_url>

The script submits an `sbatch` job that runs `enroot import docker://<image_url>` and produces a `.sqsh` file in the current directory. The output on stdout is the SLURM job ID.
enroot-import --partition cpu_datamover --debug <docker_image_url>

该脚本会提交一个`sbatch`作业,执行`enroot import docker://<image_url>`并在当前目录生成一个`.sqsh`文件。标准输出将返回SLURM作业ID。

enroot-import flags

enroot-import参数

FlagDescription
-p, --partition
SLURM partition for the import job (use a CPU partition like
cpu_datamover
)
-d, --debug
Enable debug output and preserve the SLURM log (recommended)
-o, --output
Custom output path for the
.sqsh
file
-A, --account
SLURM account (defaults to user's first account)
-t, --time
Time limit for the import job (default: 1 hour)
-n, --just-print
Print the sbatch command without executing
-J, --job-name
Custom job name
参数描述
-p, --partition
导入作业使用的SLURM分区(请使用
cpu_datamover
这类CPU分区)
-d, --debug
启用调试输出并保留SLURM日志(推荐使用)
-o, --output
.sqsh
文件的自定义输出路径
-A, --account
SLURM账户(默认使用用户的首个账户)
-t, --time
导入作业的时间限制(默认:1小时)
-n, --just-print
打印sbatch命令但不执行
-J, --job-name
自定义作业名称

enroot-import workflow

enroot-import工作流程

  1. Read the image tag from
    jenkins/current_image_tags.properties
    in the TRT-LLM repo.
  2. Run
    enroot-import
    to submit the import job:
    bash
    cd <directory_where_sqsh_should_be_stored>
    <path_to>/enroot-import --partition cpu_datamover --debug <image_url>
    IMPORTANT: Convert
    urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx
    to
    urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx
    to avoid credential issues.
  3. Wait for the import job to complete (
    squeue -j <job_id>
    ).
  4. The resulting
    .sqsh
    file is the
    container_image
    used in the compile step.
  1. 从TRT-LLM仓库的
    jenkins/current_image_tags.properties
    中读取镜像标签。
  2. 运行
    enroot-import
    提交导入作业:
    bash
    cd <存储sqsh文件的目录>
    <脚本路径>/enroot-import --partition cpu_datamover --debug <image_url>
    重要提示:
    urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx
    转换为
    urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx
    以避免凭证问题。
  3. 等待导入作业完成(使用
    squeue -j <job_id>
    查看状态)。
  4. 生成的
    .sqsh
    文件将作为编译步骤中的
    container_image
    参数。

Prerequisites

前置条件

The user must provide (or you must ask for) these values:
ParameterDescriptionExample
container_image
Path to
.sqsh
container image (see enroot import above)
/path/to/pytorch.sqsh
repo_dir
Path to the TensorRT-LLM repository
/path/to/TensorRT-LLM
mount_dir
Top-level directory to bind-mount into the container
/shared/users
partition
SLURM partition
batch
account
SLURM account
my_account
Optional parameters:
ParameterDescriptionDefault
jobname
SLURM job name
trtllm-compile.<username>
gpu_count
Number of GPUs to request
4
time_limit
Job time limit
02:00:00
arch
GPU architecture(s) for
-a
flag
100-real
extra_build_args
Extra flags for
build_wheel.py
(none)
用户必须提供(或你需要询问用户获取)以下参数:
参数描述示例
container_image
.sqsh
容器镜像的路径(见上述enroot导入步骤)
/path/to/pytorch.sqsh
repo_dir
TensorRT-LLM仓库的路径
/path/to/TensorRT-LLM
mount_dir
要绑定挂载到容器中的顶级目录
/shared/users
partition
SLURM分区
batch
account
SLURM账户
my_account
可选参数:
参数描述默认值
jobname
SLURM作业名称
trtllm-compile.<username>
gpu_count
请求的GPU数量
4
time_limit
作业时间限制
02:00:00
arch
-a
参数对应的GPU架构
100-real
extra_build_args
build_wheel.py
的额外参数

Companion Scripts

配套脚本

This skill includes three companion scripts in
scripts/
:
ScriptPurpose
enroot-import
Pre-dump a Docker image to
.sqsh
via a SLURM batch job
submit_compile.sh
Template for submitting the SLURM job — copy and customize
compile.slurm
SLURM batch script — launches the container and calls
compile.sh
compile.sh
Runs inside the container — executes
build_wheel.py
Scripts directory:
skills/exec-slurm-compile/scripts/
该技能在
scripts/
目录下包含三个配套脚本:
脚本用途
enroot-import
通过SLURM批处理作业将Docker镜像预导出为
.sqsh
格式
submit_compile.sh
提交SLURM作业的模板 — 可复制并自定义
compile.slurm
SLURM批处理脚本 — 启动容器并调用
compile.sh
compile.sh
在容器内运行 — 执行
build_wheel.py
脚本目录:
skills/exec-slurm-compile/scripts/

Instructions

操作步骤

Follow these steps in order:
按以下顺序执行:

Step 0: Resolve the Container Image (if needed)

步骤0:获取容器镜像(如需要)

If the user does not already have a
.sqsh
container image:
  1. Read the Docker image tag from
    <repo_dir>/jenkins/current_image_tags.properties
    .
  2. Use
    enroot-import
    to pre-dump it:
    bash
    cd <directory_for_sqsh_files>
    <scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url>
  3. Monitor the import job with
    squeue -j <job_id>
    .
  4. Once complete, the
    .sqsh
    file path becomes the
    container_image
    parameter.
If the user already has a
.sqsh
file, skip this step.
如果用户还没有
.sqsh
容器镜像:
  1. <repo_dir>/jenkins/current_image_tags.properties
    中读取Docker镜像标签。
  2. 使用
    enroot-import
    预导出镜像:
    bash
    cd <存储sqsh文件的目录>
    <脚本目录>/enroot-import --partition cpu_datamover --debug <image_url>
  3. 使用
    squeue -j <job_id>
    监控导入作业。
  4. 作业完成后,
    .sqsh
    文件路径即为
    container_image
    参数。
如果用户已有
.sqsh
文件,请跳过此步骤。

Step 1: Gather Information

步骤1:收集信息

Ask the user for any missing prerequisite values listed above. At minimum you need:
  • container_image
    (or the Docker image URL — then run Step 0 first)
  • repo_dir
  • mount_dir
  • partition
    and
    account
If the user has used this workflow before, check if previous values are stored in memory files.
向用户询问上述前置条件中缺失的参数。至少需要获取:
  • container_image
    (或Docker镜像URL — 需先执行步骤0)
  • repo_dir
  • mount_dir
  • partition
    account
如果用户之前使用过该流程,可检查内存文件中是否存储了历史参数。

Step 2: Prepare the Scripts Directory

步骤2:准备脚本目录

The compile scripts must be accessible from inside the container (i.e., under
mount_dir
). Either:
Option A — Copy companion scripts to a location under
mount_dir
:
bash
scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm
Option B — If the user already has scripts at a known location, use those directly.
编译脚本必须能从容器内部访问(即位于
mount_dir
下)。可选择以下两种方式:
选项A — 将配套脚本复制到
mount_dir
下的某个位置:
bash
scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm
选项B — 如果用户已在已知位置拥有脚本,可直接使用。

Step 3: Submit the Job

步骤3:提交作业

Run
sbatch
from the login node (or a node with SLURM client access):
bash
sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>
Capture and report the job ID from the
sbatch
output.
从登录节点(或具有SLURM客户端访问权限的节点)运行
sbatch
bash
sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>
捕获并报告
sbatch
输出中的作业ID。

Step 4: Monitor the Job (Proactive — Do NOT Wait for User)

步骤4:监控作业(主动监控 — 请勿等待用户询问)

You MUST actively poll the job until it completes. Do not submit and walk away.
bash
undefined
你必须主动轮询作业状态直至完成,不要提交后就不管了。
bash
undefined

Check job status (repeat every 30-60 seconds)

检查作业状态(每30-60秒重复执行)

squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)

作业运行后,定期查看日志尾部(请勿使用tail -f,使用tail -30)

tail -30 <scripts_dir>/log/compile_<job_id>.srun.log

**Monitoring loop:**
1. Poll `squeue -j <job_id>` to check state
2. If `PD` (pending) — report the reason, keep polling every 30-60s
3. If `R` (running) — tail the build log every 30-60s; look for `[XX%] Building`, errors, or completion
4. If the job disappears from `squeue`, it has finished — proceed to Step 5
5. If `F` (failed) — immediately read the full log and report the error

**Progress indicators to look for in the log:**
- `[XX%] Building CXX object...` — compilation progress
- `Linking CXX...` — link phase
- `FAILED:`, `error:`, `fatal error:` — build failure
- `Successfully built` — success
tail -30 <scripts_dir>/log/compile_<job_id>.srun.log

**监控循环:**
1. 执行`squeue -j <job_id>`检查状态
2. 如果状态为`PD`(等待) — 报告原因,每30-60秒轮询一次
3. 如果状态为`R`(运行) — 每30-60秒查看构建日志尾部;关注`[XX%] Building`、错误信息或完成提示
4. 如果作业从`squeue`中消失,说明已完成 — 进入步骤5
5. 如果状态为`F`(失败) — 立即读取完整日志并报告错误

**日志中的进度标识:**
- `[XX%] Building CXX object...` — 编译进度
- `Linking CXX...` — 链接阶段
- `FAILED:`, `error:`, `fatal error:` — 构建失败
- `Successfully built` — 构建成功

Step 5: Verify the Build

步骤5:验证构建结果

Once the job completes, check for success:
bash
undefined
作业完成后,检查是否成功:
bash
undefined

Check SLURM exit code

检查SLURM退出码

sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed

Check the build log for errors

检查构建日志是否存在错误

tail -50 <scripts_dir>/log/compile_<job_id>.srun.log

A successful build ends with a message like `Successfully built tensorrt_llm` or completes without error.
tail -50 <scripts_dir>/log/compile_<job_id>.srun.log

成功的构建会以`Successfully built tensorrt_llm`这类消息结尾,或无错误完成。

Common Build Flags Reference

常用构建参数参考

FlagDescription
--trt_root /usr/local/tensorrt
TensorRT installation path (standard in NVIDIA containers)
--benchmarks
Build the C++ benchmarks
-a "100-real"
Target architecture —
100
for Blackwell,
90
for Hopper, etc.
--nvtx
Enable NVTX markers for profiling
--no-venv
Skip virtual environment creation
--use_ccache
Use ccache to speed up recompilation
--skip_building_wheel
Build in-place without creating a wheel file
-f
Fast build — skip some kernels for faster dev compilation
-c
Clean build — wipe build directory before building
Common architecture values:
  • "100-real"
    — Blackwell (B200, GB200)
  • "90-real"
    — Hopper (H100, H200)
  • "89-real"
    — Ada Lovelace (L40S)
  • "80-real"
    — Ampere (A100)
  • "90;100-real"
    — Multiple architectures
参数描述
--trt_root /usr/local/tensorrt
TensorRT安装路径(NVIDIA容器中的标准路径)
--benchmarks
构建C++基准测试程序
-a "100-real"
目标架构 —
100
对应Blackwell,
90
对应Hopper等
--nvtx
启用用于性能分析的NVTX标记
--no-venv
跳过虚拟环境创建
--use_ccache
使用ccache加速重新编译
--skip_building_wheel
原地构建,不生成wheel文件
-f
快速构建 — 跳过部分内核以加快开发编译速度
-c
清理构建 — 构建前清空构建目录
常用架构值:
  • "100-real"
    — Blackwell(B200, GB200)
  • "90-real"
    — Hopper(H100, H200)
  • "89-real"
    — Ada Lovelace(L40S)
  • "80-real"
    — Ampere(A100)
  • "90;100-real"
    — 多架构

Troubleshooting

故障排查

IssueSolution
sbatch: error: invalid partition
Verify partition name with
sinfo -s
sbatch: error: invalid account
Check available accounts with
sacctmgr show assoc user=$USER
Container image not foundVerify the
.sqsh
path exists and is readable
Build fails with missing TensorRTEnsure
--trt_root
points to the correct path inside the container
Build OOM (out of memory)Reduce parallelism with
-j <N>
flag to
build_wheel.py
srun: error: Unable to create step
The node may lack enroot/pyxis — check with cluster admin
Job stuck in
PD
state
Check
squeue -j <id> -o %R
for the reason (e.g., resource limits, priority)
enroot import
fails with auth error
Check
~/.config/enroot/.credentials
has the correct registry credentials
enroot import
produces empty/corrupt
.sqsh
Re-run with
--debug
and check the SLURM log; verify the image URL has no
https://
prefix
Weird compile issuesRetry with a clean build (
-c
flag)
QOSGrpNodeLimit
shown in
NODELIST(REASON)
Not a blocker, just wait for the job to get scheduled
问题解决方案
sbatch: error: invalid partition
使用
sinfo -s
验证分区名称
sbatch: error: invalid account
使用
sacctmgr show assoc user=$USER
检查可用账户
容器镜像未找到验证
.sqsh
路径存在且可读取
构建时提示缺少TensorRT确保
--trt_root
指向容器内的正确路径
构建出现OOM(内存不足)
build_wheel.py
中使用
-j <N>
参数减少并行度
srun: error: Unable to create step
节点可能缺少enroot/pyxis — 联系集群管理员
作业卡在
PD
状态
使用
squeue -j <id> -o %R
查看原因(例如:资源限制、优先级)
enroot import
因认证失败
检查
~/.config/enroot/.credentials
是否包含正确的注册表凭证
enroot import
生成空/损坏的
.sqsh
加上
--debug
参数重新运行并检查SLURM日志;验证镜像URL没有
https://
前缀
编译出现异常问题使用
-c
参数重新执行清理构建
NODELIST(REASON)
中显示
QOSGrpNodeLimit
并非阻塞问题,只需等待作业调度

Example Interaction

交互示例

User: "Compile TRT-LLM on the OCI cluster"
Agent actions:
  1. Ask for container image path, repo path, mount dir (if not known)
  2. Confirm partition/account for OCI cluster
  3. Copy scripts to accessible location under mount_dir
  4. Submit with
    sbatch
  5. Report job ID
  6. Monitor with
    squeue
    until complete
  7. Check logs and report success/failure
用户:"在OCI集群上编译TRT-LLM"
Agent操作
  1. 询问容器镜像路径、仓库路径、挂载目录(如未知)
  2. 确认OCI集群的分区/账户
  3. 将脚本复制到挂载目录下的可访问位置
  4. 使用
    sbatch
    提交作业
  5. 报告作业ID
  6. 使用
    squeue
    监控直至完成
  7. 检查日志并报告成功/失败结果