Loading...
Loading...
Submit or run an ML experiment on a compute environment (local, SLURM HPC, RunAI/Kubernetes). Use when the user wants to launch a training run, submit a job, run ablations, or execute an experiment script on any compute cluster.
npx skill4agent add a-green-hand-jack/ml-research-skills run-experimentjobs/research-project-memory<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # Environment profiles (extend for new clusters)
└── templates/
├── slurm_job.sh # SLURM template (Ibex, UW, any SLURM cluster)
├── runai_job.sh # RunAI/Kubernetes template (EPFL)
└── local_run.sh # Local tmux/nohup template<installed-skill-dir>SKILL.md<installed-skill-dir>/environments.yamluv run python train.py --lr 1e-3baseline-cifar10ablation-no-attn.venvoutputs/<job-name>/--env--script--name--gpusgit rev-parse --show-toplevel 2>/dev/null || pwdgit rev-parse --short HEAD 2>/dev/null || echo "no-git"<installed-skill-dir>/templates/slurm_job.sh{PLACEHOLDER}| Placeholder | Value |
|---|---|
| project directory name |
| environment key (e.g., |
| display name from profile |
| today's date YYYY-MM-DD |
| short git SHA |
| user-provided job name |
| filename of the generated script |
| from env profile defaults (or user override) |
| cpus_per_task from profile (or user override) |
| user-provided GPU count |
| from profile defaults (or user override) |
| user-provided or profile default |
| |
| |
| absolute project root path |
| user-provided env name |
| user-provided command |
| scratch path from env profile |
module loadcommon_modulesconda activatesource .venv/activatejobs/<job-name>.sh<installed-skill-dir>/templates/runai_job.sh| Placeholder | Value |
|---|---|
| project directory name |
| today's date |
| short git SHA |
| user-provided job name |
| filename of generated script |
| from env profile |
| from env profile |
| GPU count |
| CPU count from profile defaults |
| memory from profile defaults |
| generated from |
| user-provided command |
jobs/<job-name>-runai.sh<installed-skill-dir>/templates/local_run.shjobs/<job-name>-local.shenvironments.yamlenvironments.yamlmkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>jobs/<job-name>.sh-runai.sh-local.sh# If you're already on the login node:
sbatch jobs/<job-name>.sh
# If submitting from your laptop (requires ssh access):
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/
ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
# Monitor:
squeue -u $USER
sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES
tail -f outputs/logs/<job-name>/slurm-<jobid>.outbash jobs/<job-name>-runai.sh
# Monitor:
runai list
runai logs <job-name> -f# Attached (output in terminal):
bash jobs/<job-name>-local.sh
# Detached in tmux:
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh"
tmux attach -t <job-name>
# Background with nohup:
nohup bash jobs/<job-name>-local.sh &scpssh sbatchjobs/jobs/README.mdjobs/index.md| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |init-python-projectdocs/runs/<DATE>-<job-name>.mdmemory/.agent/worktree-status.mdmemory/evidence-board.mdEXP-###plannedsubmittedrunningdocs/runs/memory/action-board.mddoingmemory/current-status.md<worktree>/.agent/worktree-status.mdneeds-verificationenvironments.yaml| Key | Type | Cluster | Notes |
|---|---|---|---|
| local | — | Current machine, tmux/nohup |
| slurm | KAUST Ibex | |
| slurm | UW HPC | Placeholder — update |
| runai | EPFL RunAI | Kubernetes; update project/image in |
<installed-skill-dir>/environments.yamlmy-cluster:
type: slurm # or runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."GIT_COMMIToutputs/<job-name>/outputs/logs/<job-name>/jobs/outputs/.gitignore/run-experiment # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1configs/sweep.yaml#SBATCH --array=0-{N-1}%{max_concurrent}--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_IDoutputs/<job-name>/$SLURM_ARRAY_TASK_ID/--ntasks-per-node={GPUS}torchrun --nproc_per_node={GPUS}srun python -m torch.distributed.launchsrun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bashrunai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>