run-experiment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRun Experiment
运行实验
Submit an ML experiment to a compute environment — local machine, SLURM HPC (Ibex, UW, etc.), or RunAI/Kubernetes (EPFL).
Generates a reproducible job script in that is committed alongside the code, then provides the exact submit command to run.
jobs/Pair this skill with when a launched job should be linked to a planned experiment, evidence item, worktree, or project action.
research-project-memory将ML实验提交至计算环境——本地机器、SLURM HPC(Ibex、UW等)或RunAI/Kubernetes(EPFL)。
会在目录下生成一份可复现的作业脚本,并与代码一同提交,随后提供用于运行的准确提交命令。
jobs/当启动的作业需要与规划好的实验、证据项、工作树或项目操作关联时,可将此技能与搭配使用。
research-project-memorySkill Directory Layout
技能目录结构
<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # Environment profiles (extend for new clusters)
└── templates/
├── slurm_job.sh # SLURM template (Ibex, UW, any SLURM cluster)
├── runai_job.sh # RunAI/Kubernetes template (EPFL)
└── local_run.sh # Local tmux/nohup template<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # 环境配置文件(可扩展以支持新集群)
└── templates/
├── slurm_job.sh # SLURM模板(适用于Ibex、UW及任何SLURM集群)
├── runai_job.sh # RunAI/Kubernetes模板(适用于EPFL)
└── local_run.sh # 本地tmux/nohup模板Steps to Follow
操作步骤
1. Read the environment registry
1. 读取环境注册表
Resolve as the directory containing this , then read .
<installed-skill-dir>SKILL.md<installed-skill-dir>/environments.yamlList the available environments to the user with a one-line description each.
将解析为包含此的目录,然后读取。
<installed-skill-dir>SKILL.md<installed-skill-dir>/environments.yaml向用户列出可用环境,每个环境附带一行描述。
2. Ask for experiment details
2. 询问实验详情
Ask the user in a single message:
- Environment: Which compute env? (show available choices from environments.yaml + "other")
- Script / command: What command to run? (e.g., )
uv run python train.py --lr 1e-3 - Job name: Short identifier (e.g., ,
baseline-cifar10). Default: script basename + date.ablation-no-attn - GPU count: How many GPUs? (default from env profile, 0 for CPU-only)
- Walltime / time limit: (SLURM only) How long? (default from env profile)
- Conda env or venv: Name of the conda environment, or path (if applicable)
.venv - Output directory: Where to save checkpoints/results? (default: )
outputs/<job-name>/ - Anything special?: Extra env vars, array job, specific GPU type, PVC mounts (RunAI), etc.
If , , , or were passed as arguments, pre-fill those answers.
--env--script--name--gpus在一条消息中向用户询问以下内容:
- 环境:选择哪个计算环境?(展示environments.yaml中的可用选项 + "其他")
- 脚本/命令:要运行什么命令?(例如:)
uv run python train.py --lr 1e-3 - 作业名称:简短标识符(例如:、
baseline-cifar10)。默认值:脚本基名 + 日期。ablation-no-attn - GPU数量:需要多少个GPU?(默认值来自环境配置,CPU-only场景填0)
- 运行时长/时间限制:(仅SLURM适用)运行多久?(默认值来自环境配置)
- Conda环境或venv:Conda环境名称,或路径(如适用)
.venv - 输出目录: checkpoint/结果保存至何处?(默认值:)
outputs/<job-name>/ - 特殊需求?:额外环境变量、数组作业、特定GPU类型、PVC挂载(RunAI)等。
如果传入了、、或参数,则预先填充对应的答案。
--env--script--name--gpus3. Locate the project root
3. 定位项目根目录
bash
git rev-parse --show-toplevel 2>/dev/null || pwdAlso capture the short git commit hash:
bash
git rev-parse --short HEAD 2>/dev/null || echo "no-git"bash
git rev-parse --show-toplevel 2>/dev/null || pwd同时捕获简短的git提交哈希值:
bash
git rev-parse --short HEAD 2>/dev/null || echo "no-git"4. Generate the job script
4. 生成作业脚本
Based on the environment type:
根据环境类型生成:
type: slurm (ibex, uw, or any SLURM cluster)
类型:slurm(ibex、uw或任何SLURM集群)
Read the SLURM template from .
<installed-skill-dir>/templates/slurm_job.shFill in all variables:
{PLACEHOLDER}| Placeholder | Value |
|---|---|
| project directory name |
| environment key (e.g., |
| display name from profile |
| today's date YYYY-MM-DD |
| short git SHA |
| user-provided job name |
| filename of the generated script |
| from env profile defaults (or user override) |
| cpus_per_task from profile (or user override) |
| user-provided GPU count |
| from profile defaults (or user override) |
| user-provided or profile default |
| |
| |
| absolute project root path |
| user-provided env name |
| user-provided command |
| scratch path from env profile |
Uncomment the relevant lines based on the env profile's .
Uncomment the or line based on user's answer.
If scratch path is in the env profile, uncomment the TMPDIR block.
module loadcommon_modulesconda activatesource .venv/activateOutput path:
jobs/<job-name>.sh从读取SLURM模板。
<installed-skill-dir>/templates/slurm_job.sh填充所有变量:
{PLACEHOLDER}| 占位符 | 取值 |
|---|---|
| 项目目录名称 |
| 环境键(例如: |
| 配置文件中的显示名称 |
| 当前日期 YYYY-MM-DD |
| 简短git SHA值 |
| 用户提供的作业名称 |
| 生成的脚本文件名 |
| 来自环境配置默认值(或用户覆盖值) |
| 配置文件中的cpus_per_task值(或用户覆盖值) |
| 用户提供的GPU数量 |
| 来自环境配置默认值(或用户覆盖值) |
| 用户提供的值或配置文件默认值 |
| |
| |
| 项目根目录绝对路径 |
| 用户提供的环境名称 |
| 用户提供的命令 |
| 环境配置中的临时存储路径 |
根据环境配置的,取消相关行的注释。
根据用户的回答,取消或行的注释。
如果环境配置中包含临时存储路径,取消TMPDIR块的注释。
common_modulesmodule loadconda activatesource .venv/activate输出路径:
jobs/<job-name>.shtype: runai (runai profile)
类型:runai(runai配置)
Read the RunAI template from .
<installed-skill-dir>/templates/runai_job.shFill in placeholders:
| Placeholder | Value |
|---|---|
| project directory name |
| today's date |
| short git SHA |
| user-provided job name |
| filename of generated script |
| from env profile |
| from env profile |
| GPU count |
| CPU count from profile defaults |
| memory from profile defaults |
| generated from |
| user-provided command |
Output path:
jobs/<job-name>-runai.sh从读取RunAI模板。
<installed-skill-dir>/templates/runai_job.sh填充占位符:
| 占位符 | 取值 |
|---|---|
| 项目目录名称 |
| 当前日期 |
| 简短git SHA值 |
| 用户提供的作业名称 |
| 生成的脚本文件名 |
| 来自环境配置的 |
| 来自环境配置的 |
| GPU数量 |
| 配置文件默认的CPU数量 |
| 配置文件默认的内存大小 |
| 根据配置文件中的 |
| 用户提供的命令 |
输出路径:
jobs/<job-name>-runai.shtype: local
类型:local
Read the local template from .
<installed-skill-dir>/templates/local_run.shFill in placeholders similarly. Uncomment conda/venv activation as appropriate.
Output path:
jobs/<job-name>-local.sh从读取本地模板。
<installed-skill-dir>/templates/local_run.sh类似地填充占位符,根据情况取消conda/venv激活行的注释。
输出路径:
jobs/<job-name>-local.shtype: other / unknown
类型:其他/未知
If the user specifies an environment not in :
environments.yaml- Ask: "What scheduler does it use? (slurm / runai / other)"
- If SLURM-compatible: use the SLURM template with the info the user provides.
- If truly novel: generate a minimal generic wrapper and explain what to fill in.
- Suggest: "Want me to add this environment to for future use?"
environments.yaml
如果用户指定的环境不在中:
environments.yaml- 询问:"它使用什么调度器?(slurm / runai / 其他)"
- 如果兼容SLURM:使用SLURM模板并结合用户提供的信息。
- 如果是全新环境:生成一个最小化的通用包装器,并说明需要填充的内容。
- 建议:"需要我将此环境添加到中以便未来使用吗?"
environments.yaml
5. Write the job script and preview
5. 写入作业脚本并预览
Create the job script directory, log directory, and output directory before previewing or submitting:
bash
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>Write the filled-in script to (or / ).
jobs/<job-name>.sh-runai.sh-local.shShow the user the full generated script for review.
在预览或提交前,创建作业脚本目录、日志目录和输出目录:
bash
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>将填充好的脚本写入(或 / )。
jobs/<job-name>.sh-runai.sh-local.sh向用户展示完整的生成脚本以供审核。
6. Show the submit command and ask to launch
6. 展示提交命令并询问是否启动
Print the exact command(s) to submit, tailored to the environment:
打印针对环境定制的准确提交命令:
SLURM (Ibex / UW / etc.)
SLURM(Ibex / UW / 等)
undefinedundefinedIf you're already on the login node:
若已在登录节点上:
sbatch jobs/<job-name>.sh
sbatch jobs/<job-name>.sh
If submitting from your laptop (requires ssh access):
若从笔记本提交(需ssh访问权限):
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/
ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/
ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
Monitor:
监控:
squeue -u $USER
sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES
tail -f outputs/logs/<job-name>/slurm-<jobid>.out
undefinedsqueue -u $USER
sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES
tail -f outputs/logs/<job-name>/slurm-<jobid>.out
undefinedRunAI
RunAI
bash jobs/<job-name>-runai.shbash jobs/<job-name>-runai.shMonitor:
监控:
runai list
runai logs <job-name> -f
undefinedrunai list
runai logs <job-name> -f
undefinedLocal
本地
undefinedundefinedAttached (output in terminal):
前台运行(输出在终端):
bash jobs/<job-name>-local.sh
bash jobs/<job-name>-local.sh
Detached in tmux:
在tmux中后台运行:
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh"
tmux attach -t <job-name>
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh"
tmux attach -t <job-name>
Background with nohup:
用nohup后台运行:
nohup bash jobs/<job-name>-local.sh &
Ask: **"Want me to run the submit command now?"**
- If yes and local: run it directly.
- If yes and remote: run the `scp` + `ssh sbatch` command (requires ssh key auth to be set up).
- If no: remind the user that the script is saved in `jobs/` and ready to submit.nohup bash jobs/<job-name>-local.sh &
询问:**"需要我现在运行提交命令吗?"**
- 若是本地环境:直接运行。
- 若是远程环境:运行`scp` + `ssh sbatch`命令(需已设置ssh密钥认证)。
- 若否:提醒用户脚本已保存至`jobs/`目录,随时可提交。7. Offer to add to jobs index (optional)
7. 可选:添加至作业索引
If a or exists, offer to append a one-line entry:
jobs/README.mdjobs/index.md| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |If the repo follows the code evidence layout from , also offer to create or update a short run pointer under:
init-python-projecttext
docs/runs/<DATE>-<job-name>.mdThis file should contain the command, config, commit, output path, expected metric, and monitor command. It should not contain raw logs.
若存在或,可提议添加一行条目:
jobs/README.mdjobs/index.md| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |若仓库遵循的代码证据结构,还可提议在以下路径创建或更新简短的运行指针:
init-python-projecttext
docs/runs/<DATE>-<job-name>.md该文件应包含命令、配置、提交哈希、输出路径、预期指标和监控命令,不应包含原始日志。
8. Update project memory when present
8. 若存在项目内存则更新
If the repo has or a worktree , update only verified run pointers:
memory/.agent/worktree-status.md- : add or update the linked
memory/evidence-board.mdwith job script path, commit, command, output directory, and statusEXP-###,planned, orsubmittedonly if verifiedrunning - : write a small run pointer when the code repo uses that convention
docs/runs/ - : mark the launch action as
memory/action-board.mdor create a monitor actiondoing - : record the latest known job and what must be checked next
memory/current-status.md - : link the run to the worktree purpose and exit condition
<worktree>/.agent/worktree-status.md
Do not store queue state, job success, or final metric values as durable facts unless they were verified in this session. Use for monitor tasks.
needs-verification若仓库包含目录或工作树,仅更新已验证的运行指针:
memory/.agent/worktree-status.md- :添加或更新关联的
memory/evidence-board.md条目,包含作业脚本路径、提交哈希、命令、输出目录,以及仅在验证后的状态EXP-###、planned或submittedrunning - :当代码仓库遵循该约定时,写入小型运行指针
docs/runs/ - :将启动操作标记为
memory/action-board.md,或创建监控操作doing - :记录最新已知作业及下一步需检查的内容
memory/current-status.md - :将运行与工作树用途及退出条件关联
<worktree>/.agent/worktree-status.md
除非在本次会话中已验证,否则不要将队列状态、作业成功或最终指标值作为持久事实存储。监控任务使用标记。
needs-verificationEnvironment Reference
环境参考
All environments are defined in . The current known environments:
environments.yaml| Key | Type | Cluster | Notes |
|---|---|---|---|
| local | — | Current machine, tmux/nohup |
| slurm | KAUST Ibex | |
| slurm | UW HPC | Placeholder — update |
| runai | EPFL RunAI | Kubernetes; update project/image in |
所有环境都在中定义。当前已知环境:
environments.yaml| 键 | 类型 | 集群 | 说明 |
|---|---|---|---|
| local | — | 当前机器,使用tmux/nohup |
| slurm | KAUST Ibex | |
| slurm | UW HPC | 占位符——需更新 |
| runai | EPFL RunAI | Kubernetes;需在 |
Adding a New Environment
添加新环境
Edit and add a block:
<installed-skill-dir>/environments.yamlyaml
my-cluster:
type: slurm # or runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."编辑并添加如下块:
<installed-skill-dir>/environments.yamlyaml
my-cluster:
type: slurm # 或runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."Reproducibility Conventions
可复现性约定
Every generated job script includes:
- Git commit hash in the header and as an env var ()
GIT_COMMIT - Structured output directory: for checkpoints,
outputs/<job-name>/for logsoutputs/logs/<job-name>/ - Timestamped log files so reruns don't overwrite
- Exit code propagation so job arrays and downstream scripts detect failures
The directory should be committed to git (the scripts are small text files). Actual outputs go to which is typically d.
jobs/outputs/.gitignore每个生成的作业脚本都包含:
- Git提交哈希:在头部和环境变量中
GIT_COMMIT - 结构化输出目录:用于保存checkpoint,
outputs/<job-name>/用于保存日志outputs/logs/<job-name>/ - 带时间戳的日志文件:避免重运行时覆盖
- 退出码传递:便于作业数组和下游脚本检测失败
jobs/outputs/.gitignoreExample Invocations
调用示例
/run-experiment # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1/run-experiment # 交互式向导
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1Common Patterns
常见模式
Job Array (SLURM) — hyperparameter sweep
作业数组(SLURM)——超参数扫描
When the user says "I want to sweep over N configs":
- Ask for the sweep configs or config file (e.g., with N entries).
configs/sweep.yaml - Add to the script.
#SBATCH --array=0-{N-1}%{max_concurrent} - Add to the run command:
--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID - Output dir:
outputs/<job-name>/$SLURM_ARRAY_TASK_ID/
当用户表示"我想要在N个配置上进行扫描"时:
- 询问扫描配置或配置文件(例如:包含N个条目的)。
configs/sweep.yaml - 在脚本中添加。
#SBATCH --array=0-{N-1}%{max_concurrent} - 在运行命令中添加:
--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID - 输出目录:
outputs/<job-name>/$SLURM_ARRAY_TASK_ID/
Multi-GPU (DDP)
多GPU(DDP)
When GPUs > 1 and the env is SLURM:
- Add directive
--ntasks-per-node={GPUS} - Wrap command with or
torchrun --nproc_per_node={GPUS}srun python -m torch.distributed.launch - Ask the user which distributed launcher they use
当GPU数量>1且环境为SLURM时:
- 添加指令
--ntasks-per-node={GPUS} - 用或
torchrun --nproc_per_node={GPUS}包裹命令srun python -m torch.distributed.launch - 询问用户使用哪种分布式启动器
Interactive Session (Debugging)
交互式会话(调试)
When the user wants to debug interactively (not submit a batch job):
Ibex:
bash
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bashRunAI:
bash
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>Generate this command directly without creating a script file.
当用户想要交互式调试(而非提交批处理作业)时:
Ibex:
bash
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bashRunAI:
bash
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>直接生成此命令,无需创建脚本文件。