exec-slurm-compile
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCompile TensorRT-LLM on SLURM Cluster
在SLURM集群上编译TensorRT-LLM
Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers.
使用enroot容器在SLURM集群上提交、监控并验证TensorRT-LLM编译作业。
When to Use
适用场景
| Scenario | Use This Skill? |
|---|---|
| User wants to compile TRT-LLM on a SLURM cluster | Yes |
| User is already on a compute node and wants to compile | No — use |
| 场景 | 是否使用该技能? |
|---|---|
| 用户希望在SLURM集群上编译TRT-LLM | 是 |
| 用户已在计算节点上,想要编译 | 否 — 请改用 |
Finding the Docker Image
查找Docker镜像
The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself:
<repo_dir>/jenkins/current_image_tags.propertiesRead this file to find the current image URL (e.g., ).
urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901对应TensorRT-LLM版本的官方Docker镜像标签记录在仓库本身的以下文件中:
<repo_dir>/jenkins/current_image_tags.properties读取该文件以获取当前镜像URL(例如:)。
urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901Pre-dumping the Container Image (enroot import)
预导出容器镜像(enroot import)
SLURM clusters using enroot/pyxis require a container image. To avoid download overhead at compile time, pre-dump the image in advance using the companion script:
.sqshenroot-importbash
undefined使用enroot/pyxis的SLURM集群需要格式的容器镜像。为避免编译时的下载开销,请提前使用配套脚本预导出镜像:
.sqshenroot-importbash
undefinedBasic usage — submits a SLURM job on a CPU partition to import the image
基础用法 — 在CPU分区提交SLURM作业以导入镜像
enroot-import --partition cpu_datamover --debug <docker_image_url>
The script submits an `sbatch` job that runs `enroot import docker://<image_url>` and produces a `.sqsh` file in the current directory. The output on stdout is the SLURM job ID.enroot-import --partition cpu_datamover --debug <docker_image_url>
该脚本会提交一个`sbatch`作业,执行`enroot import docker://<image_url>`并在当前目录生成一个`.sqsh`文件。标准输出将返回SLURM作业ID。enroot-import flags
enroot-import参数
| Flag | Description |
|---|---|
| SLURM partition for the import job (use a CPU partition like |
| Enable debug output and preserve the SLURM log (recommended) |
| Custom output path for the |
| SLURM account (defaults to user's first account) |
| Time limit for the import job (default: 1 hour) |
| Print the sbatch command without executing |
| Custom job name |
| 参数 | 描述 |
|---|---|
| 导入作业使用的SLURM分区(请使用 |
| 启用调试输出并保留SLURM日志(推荐使用) |
| |
| SLURM账户(默认使用用户的首个账户) |
| 导入作业的时间限制(默认:1小时) |
| 打印sbatch命令但不执行 |
| 自定义作业名称 |
enroot-import workflow
enroot-import工作流程
- Read the image tag from in the TRT-LLM repo.
jenkins/current_image_tags.properties - Run to submit the import job:
enroot-importIMPORTANT: Convertbashcd <directory_where_sqsh_should_be_stored> <path_to>/enroot-import --partition cpu_datamover --debug <image_url>tourm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxxto avoid credential issues.urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx - Wait for the import job to complete ().
squeue -j <job_id> - The resulting file is the
.sqshused in the compile step.container_image
- 从TRT-LLM仓库的中读取镜像标签。
jenkins/current_image_tags.properties - 运行提交导入作业:
enroot-import重要提示: 将bashcd <存储sqsh文件的目录> <脚本路径>/enroot-import --partition cpu_datamover --debug <image_url>转换为urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx以避免凭证问题。urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx - 等待导入作业完成(使用查看状态)。
squeue -j <job_id> - 生成的文件将作为编译步骤中的
.sqsh参数。container_image
Prerequisites
前置条件
The user must provide (or you must ask for) these values:
| Parameter | Description | Example |
|---|---|---|
| Path to | |
| Path to the TensorRT-LLM repository | |
| Top-level directory to bind-mount into the container | |
| SLURM partition | |
| SLURM account | |
Optional parameters:
| Parameter | Description | Default |
|---|---|---|
| SLURM job name | |
| Number of GPUs to request | |
| Job time limit | |
| GPU architecture(s) for | |
| Extra flags for | (none) |
用户必须提供(或你需要询问用户获取)以下参数:
| 参数 | 描述 | 示例 |
|---|---|---|
| | |
| TensorRT-LLM仓库的路径 | |
| 要绑定挂载到容器中的顶级目录 | |
| SLURM分区 | |
| SLURM账户 | |
可选参数:
| 参数 | 描述 | 默认值 |
|---|---|---|
| SLURM作业名称 | |
| 请求的GPU数量 | |
| 作业时间限制 | |
| | |
| | 无 |
Companion Scripts
配套脚本
This skill includes three companion scripts in :
scripts/| Script | Purpose |
|---|---|
| Pre-dump a Docker image to |
| Template for submitting the SLURM job — copy and customize |
| SLURM batch script — launches the container and calls |
| Runs inside the container — executes |
Scripts directory:
skills/exec-slurm-compile/scripts/该技能在目录下包含三个配套脚本:
scripts/| 脚本 | 用途 |
|---|---|
| 通过SLURM批处理作业将Docker镜像预导出为 |
| 提交SLURM作业的模板 — 可复制并自定义 |
| SLURM批处理脚本 — 启动容器并调用 |
| 在容器内运行 — 执行 |
脚本目录:
skills/exec-slurm-compile/scripts/Instructions
操作步骤
Follow these steps in order:
按以下顺序执行:
Step 0: Resolve the Container Image (if needed)
步骤0:获取容器镜像(如需要)
If the user does not already have a container image:
.sqsh- Read the Docker image tag from .
<repo_dir>/jenkins/current_image_tags.properties - Use to pre-dump it:
enroot-importbashcd <directory_for_sqsh_files> <scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url> - Monitor the import job with .
squeue -j <job_id> - Once complete, the file path becomes the
.sqshparameter.container_image
If the user already has a file, skip this step.
.sqsh如果用户还没有容器镜像:
.sqsh- 从中读取Docker镜像标签。
<repo_dir>/jenkins/current_image_tags.properties - 使用预导出镜像:
enroot-importbashcd <存储sqsh文件的目录> <脚本目录>/enroot-import --partition cpu_datamover --debug <image_url> - 使用监控导入作业。
squeue -j <job_id> - 作业完成后,文件路径即为
.sqsh参数。container_image
如果用户已有文件,请跳过此步骤。
.sqshStep 1: Gather Information
步骤1:收集信息
Ask the user for any missing prerequisite values listed above. At minimum you need:
- (or the Docker image URL — then run Step 0 first)
container_image repo_dirmount_dir- and
partitionaccount
If the user has used this workflow before, check if previous values are stored in memory files.
向用户询问上述前置条件中缺失的参数。至少需要获取:
- (或Docker镜像URL — 需先执行步骤0)
container_image repo_dirmount_dir- 和
partitionaccount
如果用户之前使用过该流程,可检查内存文件中是否存储了历史参数。
Step 2: Prepare the Scripts Directory
步骤2:准备脚本目录
The compile scripts must be accessible from inside the container (i.e., under ). Either:
mount_dirOption A — Copy companion scripts to a location under :
mount_dirbash
scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurmOption B — If the user already has scripts at a known location, use those directly.
编译脚本必须能从容器内部访问(即位于下)。可选择以下两种方式:
mount_dir选项A — 将配套脚本复制到下的某个位置:
mount_dirbash
scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm选项B — 如果用户已在已知位置拥有脚本,可直接使用。
Step 3: Submit the Job
步骤3:提交作业
Run from the login node (or a node with SLURM client access):
sbatchbash
sbatch \
--nodes=1 --ntasks=1 --ntasks-per-node=1 \
--gres=gpu:<gpu_count> \
--partition=<partition> \
--account=<account> \
--job-name=<jobname> \
--time=<time_limit> \
<scripts_dir>/compile.slurm \
<container_image> <mount_dir> <scripts_dir> <repo_dir>Capture and report the job ID from the output.
sbatch从登录节点(或具有SLURM客户端访问权限的节点)运行:
sbatchbash
sbatch \
--nodes=1 --ntasks=1 --ntasks-per-node=1 \
--gres=gpu:<gpu_count> \
--partition=<partition> \
--account=<account> \
--job-name=<jobname> \
--time=<time_limit> \
<scripts_dir>/compile.slurm \
<container_image> <mount_dir> <scripts_dir> <repo_dir>捕获并报告输出中的作业ID。
sbatchStep 4: Monitor the Job (Proactive — Do NOT Wait for User)
步骤4:监控作业(主动监控 — 请勿等待用户询问)
You MUST actively poll the job until it completes. Do not submit and walk away.
bash
undefined你必须主动轮询作业状态直至完成,不要提交后就不管了。
bash
undefinedCheck job status (repeat every 30-60 seconds)
检查作业状态(每30-60秒重复执行)
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)
作业运行后,定期查看日志尾部(请勿使用tail -f,使用tail -30)
tail -30 <scripts_dir>/log/compile_<job_id>.srun.log
**Monitoring loop:**
1. Poll `squeue -j <job_id>` to check state
2. If `PD` (pending) — report the reason, keep polling every 30-60s
3. If `R` (running) — tail the build log every 30-60s; look for `[XX%] Building`, errors, or completion
4. If the job disappears from `squeue`, it has finished — proceed to Step 5
5. If `F` (failed) — immediately read the full log and report the error
**Progress indicators to look for in the log:**
- `[XX%] Building CXX object...` — compilation progress
- `Linking CXX...` — link phase
- `FAILED:`, `error:`, `fatal error:` — build failure
- `Successfully built` — successtail -30 <scripts_dir>/log/compile_<job_id>.srun.log
**监控循环:**
1. 执行`squeue -j <job_id>`检查状态
2. 如果状态为`PD`(等待) — 报告原因,每30-60秒轮询一次
3. 如果状态为`R`(运行) — 每30-60秒查看构建日志尾部;关注`[XX%] Building`、错误信息或完成提示
4. 如果作业从`squeue`中消失,说明已完成 — 进入步骤5
5. 如果状态为`F`(失败) — 立即读取完整日志并报告错误
**日志中的进度标识:**
- `[XX%] Building CXX object...` — 编译进度
- `Linking CXX...` — 链接阶段
- `FAILED:`, `error:`, `fatal error:` — 构建失败
- `Successfully built` — 构建成功Step 5: Verify the Build
步骤5:验证构建结果
Once the job completes, check for success:
bash
undefined作业完成后,检查是否成功:
bash
undefinedCheck SLURM exit code
检查SLURM退出码
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed
Check the build log for errors
检查构建日志是否存在错误
tail -50 <scripts_dir>/log/compile_<job_id>.srun.log
A successful build ends with a message like `Successfully built tensorrt_llm` or completes without error.tail -50 <scripts_dir>/log/compile_<job_id>.srun.log
成功的构建会以`Successfully built tensorrt_llm`这类消息结尾,或无错误完成。Common Build Flags Reference
常用构建参数参考
| Flag | Description |
|---|---|
| TensorRT installation path (standard in NVIDIA containers) |
| Build the C++ benchmarks |
| Target architecture — |
| Enable NVTX markers for profiling |
| Skip virtual environment creation |
| Use ccache to speed up recompilation |
| Build in-place without creating a wheel file |
| Fast build — skip some kernels for faster dev compilation |
| Clean build — wipe build directory before building |
Common architecture values:
- — Blackwell (B200, GB200)
"100-real" - — Hopper (H100, H200)
"90-real" - — Ada Lovelace (L40S)
"89-real" - — Ampere (A100)
"80-real" - — Multiple architectures
"90;100-real"
| 参数 | 描述 |
|---|---|
| TensorRT安装路径(NVIDIA容器中的标准路径) |
| 构建C++基准测试程序 |
| 目标架构 — |
| 启用用于性能分析的NVTX标记 |
| 跳过虚拟环境创建 |
| 使用ccache加速重新编译 |
| 原地构建,不生成wheel文件 |
| 快速构建 — 跳过部分内核以加快开发编译速度 |
| 清理构建 — 构建前清空构建目录 |
常用架构值:
- — Blackwell(B200, GB200)
"100-real" - — Hopper(H100, H200)
"90-real" - — Ada Lovelace(L40S)
"89-real" - — Ampere(A100)
"80-real" - — 多架构
"90;100-real"
Troubleshooting
故障排查
| Issue | Solution |
|---|---|
| Verify partition name with |
| Check available accounts with |
| Container image not found | Verify the |
| Build fails with missing TensorRT | Ensure |
| Build OOM (out of memory) | Reduce parallelism with |
| The node may lack enroot/pyxis — check with cluster admin |
Job stuck in | Check |
| Check |
| Re-run with |
| Weird compile issues | Retry with a clean build ( |
| Not a blocker, just wait for the job to get scheduled |
| 问题 | 解决方案 |
|---|---|
| 使用 |
| 使用 |
| 容器镜像未找到 | 验证 |
| 构建时提示缺少TensorRT | 确保 |
| 构建出现OOM(内存不足) | 在 |
| 节点可能缺少enroot/pyxis — 联系集群管理员 |
作业卡在 | 使用 |
| 检查 |
| 加上 |
| 编译出现异常问题 | 使用 |
| 并非阻塞问题,只需等待作业调度 |
Example Interaction
交互示例
User: "Compile TRT-LLM on the OCI cluster"
Agent actions:
- Ask for container image path, repo path, mount dir (if not known)
- Confirm partition/account for OCI cluster
- Copy scripts to accessible location under mount_dir
- Submit with
sbatch - Report job ID
- Monitor with until complete
squeue - Check logs and report success/failure
用户:"在OCI集群上编译TRT-LLM"
Agent操作:
- 询问容器镜像路径、仓库路径、挂载目录(如未知)
- 确认OCI集群的分区/账户
- 将脚本复制到挂载目录下的可访问位置
- 使用提交作业
sbatch - 报告作业ID
- 使用监控直至完成
squeue - 检查日志并报告成功/失败结果