Loading...
Loading...
Found 20 Skills
High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"
Nsight Systems (nsys) CLI for system-level timeline profiling. Use when the user wants to run nsys profile, analyze .nsys-rep reports, use nsys stats/analyze/recipe commands, diagnose GPU idle time from timeline traces, or profile distributed training with NCCL overlap analysis. NOT for kernel-level metrics like SOL%, occupancy, or roofline (use perf-nsight-compute-analysis for ncu). NOT for writing or generating kernels. NOT for applying optimizations like CUDA Graphs.
Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
DGX Cloud Lepton managed GPU compute platform with run/status/cancel interface. Use when submitting TAO jobs to DGX Cloud, dispatching training/eval/inference to Lepton GPU resources, or managing Lepton workspace deployments. Trigger phrases include "run on Lepton", "submit to DGX Cloud", "Lepton job", "managed GPU on DGX Cloud".
How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.