exec-slurm-compile

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Compile TensorRT-LLM on SLURM Cluster

在SLURM集群上编译TensorRT-LLM

Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers.

使用enroot容器在SLURM集群上提交、监控并验证TensorRT-LLM编译作业。

When to Use

适用场景

Scenario	Use This Skill?
User wants to compile TRT-LLM on a SLURM cluster	Yes
User is already on a compute node and wants to compile	No — use `exec-local-compile` skill instead

场景	是否使用该技能？
用户希望在SLURM集群上编译TRT-LLM	是
用户已在计算节点上，想要编译	否 — 请改用 `exec-local-compile` 技能

Finding the Docker Image

查找Docker镜像

The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself:

<repo_dir>/jenkins/current_image_tags.properties

Read this file to find the current image URL (e.g.,

urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901

对应TensorRT-LLM版本的官方Docker镜像标签记录在仓库本身的以下文件中：

<repo_dir>/jenkins/current_image_tags.properties

读取该文件以获取当前镜像URL（例如：

urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901

）。

Pre-dumping the Container Image (enroot import)

预导出容器镜像（enroot import）

SLURM clusters using enroot/pyxis require a

.sqsh

container image. To avoid download overhead at compile time, pre-dump the image in advance using the

enroot-import

companion script:

bash

undefined

使用enroot/pyxis的SLURM集群需要

.sqsh

格式的容器镜像。为避免编译时的下载开销，请提前使用
enroot-import
配套脚本预导出镜像：

bash

undefined

Basic usage — submits a SLURM job on a CPU partition to import the image

基础用法 — 在CPU分区提交SLURM作业以导入镜像

enroot-import --partition cpu_datamover --debug <docker_image_url>


The script submits an `sbatch` job that runs `enroot import docker://<image_url>` and produces a `.sqsh` file in the current directory. The output on stdout is the SLURM job ID.

enroot-import --partition cpu_datamover --debug <docker_image_url>


该脚本会提交一个`sbatch`作业，执行`enroot import docker://<image_url>`并在当前目录生成一个`.sqsh`文件。标准输出将返回SLURM作业ID。

enroot-import flags

enroot-import参数

Flag	Description
`-p, --partition`	SLURM partition for the import job (use a CPU partition like `cpu_datamover` )
`-d, --debug`	Enable debug output and preserve the SLURM log (recommended)
`-o, --output`	Custom output path for the `.sqsh` file
`-A, --account`	SLURM account (defaults to user's first account)
`-t, --time`	Time limit for the import job (default: 1 hour)
`-n, --just-print`	Print the sbatch command without executing
`-J, --job-name`	Custom job name

参数	描述
`-p, --partition`	导入作业使用的SLURM分区（请使用 `cpu_datamover` 这类CPU分区）
`-d, --debug`	启用调试输出并保留SLURM日志（推荐使用）
`-o, --output`	`.sqsh` 文件的自定义输出路径
`-A, --account`	SLURM账户（默认使用用户的首个账户）
`-t, --time`	导入作业的时间限制（默认：1小时）
`-n, --just-print`	打印sbatch命令但不执行
`-J, --job-name`	自定义作业名称

enroot-import workflow

enroot-import工作流程

Read the image tag from
```
jenkins/current_image_tags.properties
```
in the TRT-LLM repo.

Run

enroot-import

to submit the import job:

bash

cd <directory_where_sqsh_should_be_stored>
<path_to>/enroot-import --partition cpu_datamover --debug <image_url>

IMPORTANT: Convert

urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx

urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx

to avoid credential issues.

Wait for the import job to complete (
```
squeue -j <job_id>
```
).
The resulting
```
.sqsh
```
file is the
```
container_image
```
used in the compile step.

从TRT-LLM仓库的
```
jenkins/current_image_tags.properties
```
中读取镜像标签。

运行

enroot-import

提交导入作业：

bash

cd <存储sqsh文件的目录>
<脚本路径>/enroot-import --partition cpu_datamover --debug <image_url>

重要提示： 将

urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx

转换为

urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx

以避免凭证问题。

等待导入作业完成（使用
```
squeue -j <job_id>
```
查看状态）。
生成的
```
.sqsh
```
文件将作为编译步骤中的
```
container_image
```
参数。

Prerequisites

前置条件

The user must provide (or you must ask for) these values:

Parameter	Description	Example
`container_image`	Path to `.sqsh` container image (see enroot import above)	`/path/to/pytorch.sqsh`
`repo_dir`	Path to the TensorRT-LLM repository	`/path/to/TensorRT-LLM`
`mount_dir`	Top-level directory to bind-mount into the container	`/shared/users`
`partition`	SLURM partition	`batch`
`account`	SLURM account	`my_account`

Optional parameters:

Parameter	Description	Default
`jobname`	SLURM job name	`trtllm-compile.<username>`
`gpu_count`	Number of GPUs to request	`4`
`time_limit`	Job time limit	`02:00:00`
`arch`	GPU architecture(s) for `-a` flag	`100-real`
`extra_build_args`	Extra flags for `build_wheel.py`	(none)

用户必须提供（或你需要询问用户获取）以下参数：

参数	描述	示例
`container_image`	`.sqsh` 容器镜像的路径（见上述enroot导入步骤）	`/path/to/pytorch.sqsh`
`repo_dir`	TensorRT-LLM仓库的路径	`/path/to/TensorRT-LLM`
`mount_dir`	要绑定挂载到容器中的顶级目录	`/shared/users`
`partition`	SLURM分区	`batch`
`account`	SLURM账户	`my_account`

可选参数：

参数	描述	默认值
`jobname`	SLURM作业名称	`trtllm-compile.<username>`
`gpu_count`	请求的GPU数量	`4`
`time_limit`	作业时间限制	`02:00:00`
`arch`	`-a` 参数对应的GPU架构	`100-real`
`extra_build_args`	`build_wheel.py` 的额外参数	无

Companion Scripts

配套脚本

This skill includes three companion scripts in

scripts/

Script	Purpose
`enroot-import`	Pre-dump a Docker image to `.sqsh` via a SLURM batch job
`submit_compile.sh`	Template for submitting the SLURM job — copy and customize
`compile.slurm`	SLURM batch script — launches the container and calls `compile.sh`
`compile.sh`	Runs inside the container — executes `build_wheel.py`

Scripts directory:

skills/exec-slurm-compile/scripts/

该技能在

scripts/

目录下包含三个配套脚本：

脚本	用途
`enroot-import`	通过SLURM批处理作业将Docker镜像预导出为 `.sqsh` 格式
`submit_compile.sh`	提交SLURM作业的模板 — 可复制并自定义
`compile.slurm`	SLURM批处理脚本 — 启动容器并调用 `compile.sh`
`compile.sh`	在容器内运行 — 执行 `build_wheel.py`

脚本目录：

skills/exec-slurm-compile/scripts/

Instructions

操作步骤

Follow these steps in order:

按以下顺序执行：

Step 0: Resolve the Container Image (if needed)

步骤0：获取容器镜像（如需要）

If the user does not already have a

.sqsh

container image:

Read the Docker image tag from

<repo_dir>/jenkins/current_image_tags.properties

Use

enroot-import

to pre-dump it:

bash

cd <directory_for_sqsh_files>
<scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url>

Monitor the import job with
```
squeue -j <job_id>
```
.
Once complete, the
```
.sqsh
```
file path becomes the
```
container_image
```
parameter.

If the user already has a

.sqsh

file, skip this step.

如果用户还没有

.sqsh

容器镜像：

从

<repo_dir>/jenkins/current_image_tags.properties

中读取Docker镜像标签。

使用

enroot-import

预导出镜像：

bash

cd <存储sqsh文件的目录>
<脚本目录>/enroot-import --partition cpu_datamover --debug <image_url>

使用
```
squeue -j <job_id>
```
监控导入作业。
作业完成后，
```
.sqsh
```
文件路径即为
```
container_image
```
参数。

如果用户已有

.sqsh

文件，请跳过此步骤。

Step 1: Gather Information

步骤1：收集信息

Ask the user for any missing prerequisite values listed above. At minimum you need:

```
container_image
```
(or the Docker image URL — then run Step 0 first)
```
repo_dir
```
```
mount_dir
```
```
partition
```
and
```
account
```

If the user has used this workflow before, check if previous values are stored in memory files.

向用户询问上述前置条件中缺失的参数。至少需要获取：

```
container_image
```
（或Docker镜像URL — 需先执行步骤0）
```
repo_dir
```
```
mount_dir
```
```
partition
```
和
```
account
```

如果用户之前使用过该流程，可检查内存文件中是否存储了历史参数。

Step 2: Prepare the Scripts Directory

步骤2：准备脚本目录

The compile scripts must be accessible from inside the container (i.e., under

mount_dir

). Either:

Option A — Copy companion scripts to a location under

mount_dir

bash

scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm

Option B — If the user already has scripts at a known location, use those directly.

编译脚本必须能从容器内部访问（即位于

mount_dir

下）。可选择以下两种方式：

选项A — 将配套脚本复制到

mount_dir

下的某个位置：

bash

scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm

选项B — 如果用户已在已知位置拥有脚本，可直接使用。

Step 3: Submit the Job

步骤3：提交作业

Run

sbatch

from the login node (or a node with SLURM client access):

bash

sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>

Capture and report the job ID from the

sbatch

output.

从登录节点（或具有SLURM客户端访问权限的节点）运行

sbatch

：

bash

sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>

捕获并报告

sbatch

输出中的作业ID。

Step 4: Monitor the Job (Proactive — Do NOT Wait for User)

步骤4：监控作业（主动监控 — 请勿等待用户询问）

You MUST actively poll the job until it completes. Do not submit and walk away.

bash

undefined

你必须主动轮询作业状态直至完成，不要提交后就不管了。

bash

undefined

Check job status (repeat every 30-60 seconds)

检查作业状态（每30-60秒重复执行）

squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)

作业运行后，定期查看日志尾部（请勿使用tail -f，使用tail -30）

tail -30 <scripts_dir>/log/compile_<job_id>.srun.log


**Monitoring loop:**
1. Poll `squeue -j <job_id>` to check state
2. If `PD` (pending) — report the reason, keep polling every 30-60s
3. If `R` (running) — tail the build log every 30-60s; look for `[XX%] Building`, errors, or completion
4. If the job disappears from `squeue`, it has finished — proceed to Step 5
5. If `F` (failed) — immediately read the full log and report the error

**Progress indicators to look for in the log:**
- `[XX%] Building CXX object...` — compilation progress
- `Linking CXX...` — link phase
- `FAILED:`, `error:`, `fatal error:` — build failure
- `Successfully built` — success

tail -30 <scripts_dir>/log/compile_<job_id>.srun.log


**监控循环：**
1. 执行`squeue -j <job_id>`检查状态
2. 如果状态为`PD`（等待） — 报告原因，每30-60秒轮询一次
3. 如果状态为`R`（运行） — 每30-60秒查看构建日志尾部；关注`[XX%] Building`、错误信息或完成提示
4. 如果作业从`squeue`中消失，说明已完成 — 进入步骤5
5. 如果状态为`F`（失败） — 立即读取完整日志并报告错误

**日志中的进度标识：**
- `[XX%] Building CXX object...` — 编译进度
- `Linking CXX...` — 链接阶段
- `FAILED:`, `error:`, `fatal error:` — 构建失败
- `Successfully built` — 构建成功

Step 5: Verify the Build

步骤5：验证构建结果

Once the job completes, check for success:

bash

undefined

作业完成后，检查是否成功：

bash

undefined

Check SLURM exit code

检查SLURM退出码

sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed

Check the build log for errors

检查构建日志是否存在错误

tail -50 <scripts_dir>/log/compile_<job_id>.srun.log


A successful build ends with a message like `Successfully built tensorrt_llm` or completes without error.

tail -50 <scripts_dir>/log/compile_<job_id>.srun.log


成功的构建会以`Successfully built tensorrt_llm`这类消息结尾，或无错误完成。

Common Build Flags Reference

常用构建参数参考

Flag	Description
`--trt_root /usr/local/tensorrt`	TensorRT installation path (standard in NVIDIA containers)
`--benchmarks`	Build the C++ benchmarks
`-a "100-real"`	Target architecture — `100` for Blackwell, `90` for Hopper, etc.
`--nvtx`	Enable NVTX markers for profiling
`--no-venv`	Skip virtual environment creation
`--use_ccache`	Use ccache to speed up recompilation
`--skip_building_wheel`	Build in-place without creating a wheel file
`-f`	Fast build — skip some kernels for faster dev compilation
`-c`	Clean build — wipe build directory before building

Common architecture values:

```
"100-real"
```
— Blackwell (B200, GB200)
```
"90-real"
```
— Hopper (H100, H200)
```
"89-real"
```
— Ada Lovelace (L40S)
```
"80-real"
```
— Ampere (A100)
```
"90;100-real"
```
— Multiple architectures

参数	描述
`--trt_root /usr/local/tensorrt`	TensorRT安装路径（NVIDIA容器中的标准路径）
`--benchmarks`	构建C++基准测试程序
`-a "100-real"`	目标架构 — `100` 对应Blackwell， `90` 对应Hopper等
`--nvtx`	启用用于性能分析的NVTX标记
`--no-venv`	跳过虚拟环境创建
`--use_ccache`	使用ccache加速重新编译
`--skip_building_wheel`	原地构建，不生成wheel文件
`-f`	快速构建 — 跳过部分内核以加快开发编译速度
`-c`	清理构建 — 构建前清空构建目录

常用架构值：

```
"100-real"
```
— Blackwell（B200, GB200）
```
"90-real"
```
— Hopper（H100, H200）
```
"89-real"
```
— Ada Lovelace（L40S）
```
"80-real"
```
— Ampere（A100）
```
"90;100-real"
```
— 多架构

Troubleshooting

故障排查

Issue	Solution
`sbatch: error: invalid partition`	Verify partition name with `sinfo -s`
`sbatch: error: invalid account`	Check available accounts with `sacctmgr show assoc user=$USER`
Container image not found	Verify the `.sqsh` path exists and is readable
Build fails with missing TensorRT	Ensure `--trt_root` points to the correct path inside the container
Build OOM (out of memory)	Reduce parallelism with `-j <N>` flag to `build_wheel.py`
`srun: error: Unable to create step`	The node may lack enroot/pyxis — check with cluster admin
Job stuck in `PD` state	Check `squeue -j <id> -o %R` for the reason (e.g., resource limits, priority)
`enroot import` fails with auth error	Check `~/.config/enroot/.credentials` has the correct registry credentials
`enroot import` produces empty/corrupt `.sqsh`	Re-run with `--debug` and check the SLURM log; verify the image URL has no `https://` prefix
Weird compile issues	Retry with a clean build ( `-c` flag)
`QOSGrpNodeLimit` shown in `NODELIST(REASON)`	Not a blocker, just wait for the job to get scheduled

问题	解决方案
`sbatch: error: invalid partition`	使用 `sinfo -s` 验证分区名称
`sbatch: error: invalid account`	使用 `sacctmgr show assoc user=$USER` 检查可用账户
容器镜像未找到	验证 `.sqsh` 路径存在且可读取
构建时提示缺少TensorRT	确保 `--trt_root` 指向容器内的正确路径
构建出现OOM（内存不足）	在 `build_wheel.py` 中使用 `-j <N>` 参数减少并行度
`srun: error: Unable to create step`	节点可能缺少enroot/pyxis — 联系集群管理员
作业卡在 `PD` 状态	使用 `squeue -j <id> -o %R` 查看原因（例如：资源限制、优先级）
`enroot import` 因认证失败	检查 `~/.config/enroot/.credentials` 是否包含正确的注册表凭证
`enroot import` 生成空/损坏的 `.sqsh`	加上 `--debug` 参数重新运行并检查SLURM日志；验证镜像URL没有 `https://` 前缀
编译出现异常问题	使用 `-c` 参数重新执行清理构建
`NODELIST(REASON)` 中显示 `QOSGrpNodeLimit`	并非阻塞问题，只需等待作业调度

Example Interaction

交互示例

User: "Compile TRT-LLM on the OCI cluster"

Agent actions:

Ask for container image path, repo path, mount dir (if not known)
Confirm partition/account for OCI cluster
Copy scripts to accessible location under mount_dir
Submit with
```
sbatch
```
Report job ID
Monitor with
```
squeue
```
until complete
Check logs and report success/failure

用户："在OCI集群上编译TRT-LLM"

Agent操作：

询问容器镜像路径、仓库路径、挂载目录（如未知）
确认OCI集群的分区/账户
将脚本复制到挂载目录下的可访问位置
使用
```
sbatch
```
提交作业
报告作业ID
使用
```
squeue
```
监控直至完成
检查日志并报告成功/失败结果