tao-run-on-slurm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSLURM
SLURM
Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted
from the TAO service or SDK host to a login node over SSH, staged on a shared
filesystem, submitted with , and executed with container support.
sbatchsrunUse SLURM when the user has access to a managed GPU cluster, shared Lustre
storage, and scheduler-owned GPU allocation. Do not use SLURM for local files
that exist only on the agent machine; data and outputs must be reachable from
the cluster.
由SLURM管理的集群的远程GPU计算平台。作业从TAO服务或SDK主机通过SSH提交到登录节点,在共享文件系统上暂存,使用提交,并通过支持容器的执行。
sbatchsrun当用户有权访问受管理的GPU集群、共享Lustre存储以及由调度器分配的GPU资源时,请使用SLURM。不要将SLURM用于仅存在于Agent机器上的本地文件;数据和输出必须能从集群访问到。
Preflight
预检
bash
undefinedbash
undefined1. SSH to the login node works without a password prompt
1. 无需密码提示即可SSH登录到登录节点
SLURM_HOST="${SLURM_HOSTNAME%%,*}"
[ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || {
echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)."
exit 1
}
ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || {
echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md."
exit 1
}
SLURM_HOST="${SLURM_HOSTNAME%%,*}"
[ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || {
echo "缺失:请在环境变量中设置SLURM_USER和SLURM_HOSTNAME(故障转移用逗号分隔)(~/.config/tao/.env)。"
exit 1
}
ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || {
echo "缺失:无法无密码SSH登录到${SLURM_USER}@${SLURM_HOST}。请查看references/ssh-setup.md。"
exit 1
}
2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.
2. 可选:用于Job处理和S3封装的TAO SDK包装器。
nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).
nvidia-tao-sdk在公共PyPI上;版本固定信息在versions.yaml(wheels.tao_sdk_slurm)中。
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install "$PIN""
exit 1
}
If a check fails, the agent prompts the user to authorize the install/fix via Bash.
A third preflight step applies only for **private `nvcr.io` images**: Pyxis on
the compute nodes needs persistent enroot credentials in
`~/.config/enroot/.credentials` on the cluster (it does NOT read `NGC_KEY` from
the job env). Without them, auth-gated pulls fail with "Could not process JSON
input" at job startup. This runs once per (cluster, user). See
`references/ssh-setup.md` for the full check and the `printf | ssh` install
pattern that keeps `NGC_KEY` out of history, files, and chat output. Skip it for
public images.PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm)
python -c "import tao_sdk" 2>/dev/null || {
echo "缺失:未安装nvidia-tao-sdk。请运行:"
echo " pip install "$PIN""
exit 1
}
如果检查失败,Agent会提示用户通过Bash授权安装/修复。
第三个预检步骤仅适用于**私有`nvcr.io`镜像**:计算节点上的Pyxis需要在集群的`~/.config/enroot/.credentials`中存储持久化的enroot凭据(它不会从作业环境中读取`NGC_KEY`)。没有这些凭据,在作业启动时,受权限限制的拉取操作会因‘无法处理JSON输入’而失败。此操作每个(集群,用户)只需执行一次。有关完整检查以及避免`NGC_KEY`出现在历史记录、文件和聊天输出中的`printf | ssh`安装模式,请查看`references/ssh-setup.md`。公共镜像可跳过此步骤。Prerequisites
前提条件
Before any job is submitted, the host running the TAO service or SDK must log in
to at least one host from over SSH without an interactive
password prompt. The handler runs , , , , and
log tails non-interactively, so password or 2FA prompts will fail the job at
submit or status time.
SLURM_HOSTNAMEsbatchsqueuesacctscancelSet this up once per (host, login node, user) tuple: create an SSH keypair,
install the public key on each login host, trust the host key, lock private-key
permissions to , and verify with . See
for the full step-by-step (including the
alias, the container key-mount note, and the 2FA / fallback). The
same file holds the SSH failure remediation prompt to show the user when
passwordless SSH fails.
chmod 600ssh -o BatchMode=yes ...references/ssh-setup.md~/.ssh/configSSH_AUTH_SOCK在提交任何作业之前,运行TAO服务或SDK的主机必须能够通过SSH登录到中的至少一个主机,且无需交互式密码提示。处理器会非交互式地运行、、、以及日志尾部查看操作,因此密码或双因素认证提示会导致作业在提交或状态查询时失败。
SLURM_HOSTNAMEsbatchsqueuesacctscancel针对每个(主机,登录节点,用户)元组设置一次:创建SSH密钥对,在每个登录主机上安装公钥,信任主机密钥,将私钥权限锁定为,并使用验证。有关完整的分步说明(包括别名、容器密钥挂载说明以及双因素认证/回退方案),请查看。同一文件还包含当无密码SSH失败时向用户显示的SSH故障修复提示。
chmod 600ssh -o BatchMode=yes ...~/.ssh/configSSH_AUTH_SOCKreferences/ssh-setup.mdCredentials
凭据
- SLURM_USER (required): SSH username for the login node. In microservices
workspace metadata this is .
cloud_specific_details.slurm_user - SLURM_HOSTNAME (required): Comma-separated login hostnames for failover.
Microservices schema stores this as the list field
.
cloud_specific_details.slurm_hostname - SLURM_PARTITION (required): Partition list for GPU job submission. Ask
for this in the mandatory SLURM intake list. The packaged default is
, which are treated as 4-hour queues.
polar,polar3,polar4,grizzly - SSH_KEY_PATH (preferred and expected before launch): private key path for
non-interactive public-key auth to the login node. If passwordless SSH fails,
ask the user for and show the setup steps in
SSH_KEY_PATH=/path/to/private_key; do not bury this behind several alternate choices.references/ssh-setup.md - SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key
already loaded. Prefer in user-facing remediation prompts.
SSH_KEY_PATH - SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default
convention from is
tao-core, where/lustre/fsw/portfolios/edgeai/<your-dir>is your per-user directory on the cluster.<your-dir> - SLURM_ACCOUNT (usually required by site policy): Account charged by
.
#SBATCH --account
Do not ask for or in the initial
intake unless the user says their site requires an account, wants a custom
results root, or the workflow cannot proceed without overriding defaults.
SLURM_ACCOUNTSLURM_BASE_RESULTS_DIR- SLURM_USER(必填):登录节点的SSH用户名。在微服务工作区元数据中,此字段为。
cloud_specific_details.slurm_user - SLURM_HOSTNAME(必填):用于故障转移的逗号分隔登录主机名列表。微服务架构将其存储为列表字段。
cloud_specific_details.slurm_hostname - SLURM_PARTITION(必填):用于GPU作业提交的分区列表。请在必填的SLURM信息收集列表中询问此字段。打包的默认值为,这些被视为4小时队列。
polar,polar3,polar4,grizzly - SSH_KEY_PATH(推荐,启动前需配置):用于非交互式公钥认证登录节点的私钥路径。如果无密码SSH失败,请询问用户并显示
SSH_KEY_PATH=/path/to/private_key中的设置步骤;不要将此选项隐藏在多个替代选择之后。references/ssh-setup.md - SSH_AUTH_SOCK(高级回退方案):已加载可接受密钥的SSH代理套接字。在面向用户的修复提示中优先使用。
SSH_KEY_PATH - SLURM_BASE_RESULTS_DIR(可选):共享文件系统的基础路径。的默认约定为
tao-core,其中/lustre/fsw/portfolios/edgeai/<your-dir>是您在集群上的每个用户目录。<your-dir> - SLURM_ACCOUNT(通常受站点策略要求):指定的计费账户。
#SBATCH --account
除非用户表示其站点需要账户、想要自定义结果根目录,或者工作流无法在不覆盖默认值的情况下进行,否则不要在初始信息收集中询问或。
SLURM_ACCOUNTSLURM_BASE_RESULTS_DIRBackend Details
后端详情
Use when routing a job to this
platform. Supported backend details from the microservices schema:
backend_details.backend_type = "slurm"json
{
"backend_type": "slurm",
"partition": "polar,polar3,polar4,grizzly",
"cluster_name": "optional-name"
}Runtime metadata is stored under , especially
and . Do not invent these values. They are written
after returns a scheduler job id.
backend_details.slurm_metadataslurm_job_idjob_dirsbatch当将作业路由到此平台时,请使用。微服务架构支持的后端详情如下:
backend_details.backend_type = "slurm"json
{
"backend_type": "slurm",
"partition": "polar,polar3,polar4,grizzly",
"cluster_name": "optional-name"
}运行时元数据存储在下,尤其是和。请勿自行创建这些值。它们会在返回调度器作业ID后写入。
backend_details.slurm_metadataslurm_job_idjob_dirsbatchStorage
存储
SLURM jobs run on the cluster, so local paths from the API host are not valid
dataset paths. Prefer shared filesystem URIs:
- Use for user-provided datasets on Lustre.
lustre:///absolute/path - paths may appear in microservices metadata and are converted to actual Lustre paths before the container starts.
slurm:// - Avoid bare and
/local/pathdataset URIs for SLURM. Validation infile://rejects local and file paths for remote backends.tao-core
Accept either dataset roots or direct spec-key paths:
- Root mode: , which model skills map to required files such as
/lustre/.../<model>/trainand<root>/annotations.jsonas media path.<root> - Direct spec mode: exact fields such as
and
custom.train_dataset.annotation_path=/lustre/.../train.json.custom.train_dataset.media_path=/lustre/.../videos.tar.gz
After passwordless SSH succeeds and before generating scripts, validate each
required dataset file/path from the login host:
bash
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'If the remote fails, stop and ask for corrected paths or for the data
to be staged onto shared cluster storage. Do not create runner scripts that will
fail inside the first training job.
test -eResults default to:
text
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id><your-dir>The runner sets to the parent results directory because
container code appends the job id when writing status and artifacts.
TAO_API_RESULTS_DIRUse Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a GPU-idle timeout — a longdownload at the top of the script can burn the allocation before training begins, and the scheduler may kill the job. Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only for small auxiliary inputs (checkpoints, configs). Sees3://for the full rationale.references/sdk-usage.md
SLURM作业在集群上运行,因此API主机的本地路径不是有效的数据集路径。优先使用共享文件系统URI:
- 对于Lustre上用户提供的数据集,请使用。
lustre:///absolute/path - 路径可能出现在微服务元数据中,并会在容器启动前转换为实际的Lustre路径。
slurm:// - 对于SLURM,请避免使用裸路径和
/local/path数据集URI。file://中的验证会拒绝远程后端的本地和文件路径。tao-core
接受数据集根目录或直接的规范键路径:
- 根目录模式:,模型技能会将其映射到所需文件,例如
/lustre/.../<model>/train和作为媒体路径的<root>/annotations.json。<root> - 直接规范模式:精确字段,例如和
custom.train_dataset.annotation_path=/lustre/.../train.json。custom.train_dataset.media_path=/lustre/.../videos.tar.gz
在无密码SSH成功后、生成脚本之前,从登录主机验证每个所需的数据集文件/路径:
bash
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'如果远程失败,请停止操作并询问用户更正路径或将数据暂存到共享集群存储中。不要创建会在第一个训练作业内失败的运行器脚本。
test -e结果默认存储在:
text
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id><your-dir>运行器会将设置为结果父目录,因为容器代码在写入状态和工件时会追加作业ID。
TAO_API_RESULTS_DIRSLURM作业输入请使用Lustre,而非S3。 SLURM的调度器会强制执行GPU空闲超时——脚本开头的长时间下载可能会在训练开始前耗尽分配的资源,调度器可能会终止作业。 请先将训练数据暂存到Lustre上;仅对于小型辅助输入(检查点、配置),使用S3/HF/NGC预取是可行的。有关完整原理,请查看s3://。references/sdk-usage.md
Container Execution
容器执行
tao-core- Stage compact JSON files for specs, environment, and cloud metadata under
,
<job_dir>/specs, and<job_dir>/env.<job_dir>/meta - Optionally convert the Docker image to a cached SQSH image with
.
srun -n1 -p <conversion_partition> enroot import - Write an sbatch script under .
<job_dir>/sbatch/job_<job_id>.sbatch - Submit .
sbatch --export=ALL <script> - Run the container with .
srun --container-image=<image> --container-mounts=/lustre
Image formats accepted by the handler:
/path/to/image.sqshregistry#image:tagdocker://registry#image:tag- ordinary , which is converted to Pyxis form when needed
registry/image:tag
SQSH conversion is cached by image name. For images, cached SQSH is
used unless is enabled.
:latestforce_reconvert_latesttao-core- 将规范、环境和云元数据的紧凑JSON文件暂存到、
<job_dir>/specs和<job_dir>/env下。<job_dir>/meta - 可选地使用将Docker镜像转换为缓存的SQSH镜像。
srun -n1 -p <conversion_partition> enroot import - 在下编写sbatch脚本。
<job_dir>/sbatch/job_<job_id>.sbatch - 提交。
sbatch --export=ALL <script> - 使用运行容器。
srun --container-image=<image> --container-mounts=/lustre
处理器接受的镜像格式:
/path/to/image.sqshregistry#image:tagdocker://registry#image:tag- 普通的,会在需要时转换为Pyxis格式
registry/image:tag
SQSH转换会按镜像名称缓存。对于镜像,除非启用,否则会使用缓存的SQSH。
:latestforce_reconvert_latestResource Mapping
资源映射
Defaults from :
tao-core- : 1
num_nodes - : 4
num_gpus - : 8
max_num_gpus_per_node - : 16
cpus_per_task - : 4
time_hours - : 3.8
timeout_hours - : 4
max_time_hours - :
container_mounts/lustre - : true
use_requeue - : true
use_sqsh
When generating launchers or wrapper scripts for SLURM, set the wall-time
defaults explicitly from the packaged platform resource defaults:
bash
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"Do not default to 12 hours on SLURM. If the user supplies a longer
, verify that the selected partition supports it before
submitting. For the packaged default partition list
, reject requests above 4 hours and ask for a
different partition only if the user actually wants a longer wall time.
SLURM_TIME_HOURSpolar,polar3,polar4,grizzlyWhen is greater than or equal to , the
handler treats the request as exclusive per node and computes additional nodes
from total GPU count when necessary.
num_gpusmax_num_gpus_per_nodeFor multi-node jobs (), the sbatch script exports ,
, , , and , and Cosmos-RL
has special multi-node role handling for controller, policy, and rollout
workers. See for the full sbatch directives, the
rendezvous env-var table and contract, and cluster requirements.
num_nodes > 1WORLD_SIZEMASTER_ADDRMASTER_PORTNODE_RANKNUM_GPU_PER_NODEreferences/multi-node.mdtao-core- : 1
num_nodes - : 4
num_gpus - : 8
max_num_gpus_per_node - : 16
cpus_per_task - : 4
time_hours - : 3.8
timeout_hours - : 4
max_time_hours - :
container_mounts/lustre - : true
use_requeue - : true
use_sqsh
为SLURM生成启动器或包装脚本时,请从打包的平台资源默认值中显式设置默认墙钟时间:
bash
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"不要在SLURM上默认设置12小时。如果用户提供更长的,请在提交前验证所选分区是否支持该时长。对于打包的默认分区列表,拒绝超过4小时的请求,仅当用户确实需要更长的墙钟时间时才询问是否更换分区。
SLURM_TIME_HOURSpolar,polar3,polar4,grizzly当大于或等于时,处理器会将请求视为每个节点独占,并在必要时根据总GPU数量计算额外节点。
num_gpusmax_num_gpus_per_node对于多节点作业(),sbatch脚本会导出、、、和,并且Cosmos-RL对控制器、策略和rollout工作器有特殊的多节点角色处理。有关完整的sbatch指令、 rendezvous环境变量表和约定,以及集群要求,请查看。
num_nodes > 1WORLD_SIZEMASTER_ADDRMASTER_PORTNODE_RANKNUM_GPU_PER_NODEreferences/multi-node.mdMonitoring
监控
- Scheduler status comes from the stored SLURM job id via or
squeue.sacct - TAO terminal status comes from in the shared results folder.
status.json - If the user enabled chat monitoring, continue polling at the requested
interval while the job is ,
PENDING, or otherwise non-terminal. Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.RUNNING - Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
- Logs are read over SSH from:
text
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.errStatus mapping:
- ->
PENDINGPending - or
RUNNING->COMPLETINGRunning - -> check
COMPLETEDstatus.json - ,
FAILED,BOOT_FAIL,DEADLINE,OUT_OF_MEMORY-> retry if logs match retriable infrastructure patterns, otherwiseNODE_FAILError - ,
CANCELLED,PREEMPTED->REVOKEDCanceled - ->
TIMEOUTError - ,
SUSPENDED->STOPPEDPaused
- 调度器状态通过存储的SLURM作业ID,使用或
squeue获取。sacct - TAO终端状态来自共享结果文件夹中的。
status.json - 如果用户启用了聊天监控,请在作业处于、
PENDING或其他非终端状态时,按照请求的间隔继续轮询。不要在固定时间(如30分钟)后停止轮询;在共享GPU分区上,长时间排队等待是正常现象。RUNNING - 当启用聊天监控时,不要为非终端SLURM作业发送最终响应。最终响应是一种分离操作;仅当用户要求分离/停止或作业达到终端状态时才使用。
- 日志通过SSH从以下路径读取:
text
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err状态映射:
- ->
PENDINGPending - 或
RUNNING->COMPLETINGRunning - -> 检查
COMPLETEDstatus.json - 、
FAILED、BOOT_FAIL、DEADLINE、OUT_OF_MEMORY-> 如果日志匹配可重试的基础设施模式则重试,否则标记为NODE_FAILError - 、
CANCELLED、PREEMPTED->REVOKEDCanceled - ->
TIMEOUTError - 、
SUSPENDED->STOPPEDPaused
Cancellation
取消
Cancel by looking up and running
over SSH. Treat missing or already terminated SLURM
jobs as successful cancellation.
backend_details.slurm_metadata.slurm_job_idscancel <slurm_job_id>通过查找并通过SSH运行来取消作业。将缺失或已终止的SLURM作业视为取消成功。
backend_details.slurm_metadata.slurm_job_idscancel <slurm_job_id>Multi-node training (distributed)
多节点训练(分布式)
SLURM is the platform of choice for large multi-node runs — pass
and the SDK handles the sbatch directives and PyTorch-distributed env vars
automatically. See for a worked example,
the generated sbatch directives, the rendezvous env-var table (,
, , , ), the Cosmos-RL
role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and
upstream reference links.
num_nodes > 1references/multi-node.mdcreate_jobWORLD_SIZENUM_GPU_PER_NODENODE_RANKMASTER_ADDRMASTER_PORTSLURM是大型多节点运行的首选平台——传递,SDK会自动处理sbatch指令和PyTorch分布式环境变量。有关完整的示例、生成的sbatch指令、rendezvous环境变量表(、、、、)、Cosmos-RL角色说明、集群要求(Pyxis/Enroot、InfiniBand/NVLink、Lustre)以及上游参考链接,请查看。
num_nodes > 1create_jobWORLD_SIZENUM_GPU_PER_NODENODE_RANKMASTER_ADDRMASTER_PORTreferences/multi-node.mdRunning via the TAO SDK
通过TAO SDK运行
The SDK install is covered in Preflight — .
Use it when you want Job handles, the sbatch// plumbing handled
for you, run-folder durability via , or convenient cloud-storage
I/O (, , ). Without the SDK, drive and
yourself.
pip install 'nvidia-tao-sdk[slurm]'squeuesacctActionWorkflows3://hf_model://ngc://sbatchsrunAuto-retry is fully automatic: a background monitor polls /
and re-'s the staged script on infrastructure-looking failures up to
, while plain training failures surface immediately. In
addition, is set by default (, defaults
to ). See for the /
code example, the Lustre-not-S3 rule, the retriable-failure classification, and
the full auto-retry and requeue behavior.
squeuesacctsbatchMAX_JOB_RETRIES = 10#SBATCH --requeueSLURM_USE_REQUEUEtruereferences/sdk-usage.mdSlurmSDKbuild_entrypointSDK安装在预检部分已介绍——。当您需要Job处理、自动处理sbatch//流程、通过实现运行文件夹持久性,或便捷的云存储I/O(、、)时,请使用它。如果不使用SDK,则需要自行操作和。
pip install 'nvidia-tao-sdk[slurm]'squeuesacctActionWorkflows3://hf_model://ngc://sbatchsrun自动重试完全自动化:后台监控程序会轮询/,并在出现基础设施类故障时重新提交暂存的脚本,最多重试次,而普通训练失败会立即显示。此外,默认设置(,默认值为)。有关/代码示例、Lustre而非S3规则、可重试故障分类以及完整的自动重试和重新排队行为,请查看。
squeuesacctMAX_JOB_RETRIES = 10#SBATCH --requeueSLURM_USE_REQUEUEtrueSlurmSDKbuild_entrypointreferences/sdk-usage.mdFailure Modes
故障模式
Common failures: SSH auth failure, local dataset path rejected, SQSH conversion
timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which
the handler retries up to the configured limit). See
for the diagnosis and remediation of each.
references/troubleshooting.md常见故障:SSH认证失败、本地数据集路径被拒绝、SQSH转换超时、Pyxis/Enroot不可用,以及坏节点/临时GPU故障(处理器会根据配置的限制重试)。有关每种故障的诊断和修复方法,请查看。
references/troubleshooting.md