搜索：slurm - AI Agent Skills

AI & Machine Learningnvidia/skills

nemo-mbridge-multi-node-slurm

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

🇺🇸|EnglishTranslated

13

AI & Machine Learningnvidia/skills

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

🇺🇸|EnglishTranslated

11

AI & Machine Learningnvidia/skills

run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

🇺🇸|EnglishTranslated

10

AI & Machine Learningnvidia/skills

exec-slurm-compile

Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node.

🇺🇸|EnglishTranslated

9

4 scripts/Checked

DevOps & Cloud Servicesnvidia/skills

tao-run-on-slurm

Remote SLURM GPU cluster execution over SSH with sbatch/srun, Pyxis/Enroot containers, and Lustre-backed results. Use when running TAO training/eval/inference jobs on an on-prem or DGX SLURM cluster. Trigger phrases include "run on SLURM", "submit sbatch", "DGX SLURM cluster", "Pyxis/Enroot container", "Lustre dataset".

🇺🇸|EnglishTranslated

9

AI & Machine Learningnvidia/skills

multi-node-slurm

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

🇺🇸|EnglishTranslated

7

DevOps & Cloud Servicesawslabs/agent-plugins

hyperpod-issue-report

Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.

🇺🇸|EnglishTranslated

61

1 scripts/Attention

DevOps & Cloud Servicestogethercomputer/skills

together-gpu-clusters

On-demand and reserved GPU clusters (H100, H200, B200) on Together AI with Kubernetes or Slurm orchestration, shared storage, credential management, and cluster scaling for ML and HPC jobs. Reach for it when the user needs multi-node compute or infrastructure control rather than a managed model endpoint.

🇺🇸|EnglishTranslated

24

3 scripts/Checked

AI & Machine Learningnvidia/skills

nemo-automodel-launcher-config

Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.

🇺🇸|EnglishTranslated

16

AI & Machine Learningkiterlin/intelligent-dete...

nemo-evaluator-sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

🇺🇸|EnglishTranslated

12

DevOps & Cloud Servicesskypilot-org/skypilot

skypilot

Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability.

🇺🇸|EnglishTranslated

12

AI & Machine Learningnvidia/skills

nemo-gym-debugging

Use when debugging a Nemo Gym run or reward profiling job. Covers rollout collection failures, empty or partial JSONL outputs, stale materialized inputs, verifier/schema errors, Ray or Slurm issues, vLLM readiness, judge failures, tool/sandbox failures, cache problems, and throughput bottlenecks.

🇺🇸|EnglishTranslated

11

1 scripts/Checked

Search Results: slurm

nemo-mbridge-multi-node-slurm

mcore-run-on-slurm

run-on-slurm

exec-slurm-compile

tao-run-on-slurm

multi-node-slurm

hyperpod-issue-report

together-gpu-clusters

nemo-automodel-launcher-config

nemo-evaluator-sdk

skypilot

nemo-gym-debugging

Search Results: slurm

nemo-mbridge-multi-node-slurm

mcore-run-on-slurm

run-on-slurm

exec-slurm-compile

tao-run-on-slurm

multi-node-slurm

hyperpod-issue-report

together-gpu-clusters

nemo-automodel-launcher-config

nemo-evaluator-sdk

skypilot

nemo-gym-debugging