Loading...
Loading...
Compare original and translation side by side
jobs/research-project-memoryjobs/research-project-memory<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # Environment profiles (extend for new clusters)
└── templates/
├── slurm_job.sh # SLURM template (Ibex, UW, any SLURM cluster)
├── runai_job.sh # RunAI/Kubernetes template (EPFL)
└── local_run.sh # Local tmux/nohup template<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # 环境配置文件(可扩展以支持新集群)
└── templates/
├── slurm_job.sh # SLURM模板(适用于Ibex、UW及任何SLURM集群)
├── runai_job.sh # RunAI/Kubernetes模板(适用于EPFL)
└── local_run.sh # 本地tmux/nohup模板<installed-skill-dir>SKILL.md<installed-skill-dir>/environments.yaml<installed-skill-dir>SKILL.md<installed-skill-dir>/environments.yamluv run python train.py --lr 1e-3baseline-cifar10ablation-no-attn.venvoutputs/<job-name>/--env--script--name--gpusuv run python train.py --lr 1e-3baseline-cifar10ablation-no-attn.venvoutputs/<job-name>/--env--script--name--gpusgit rev-parse --show-toplevel 2>/dev/null || pwdgit rev-parse --short HEAD 2>/dev/null || echo "no-git"git rev-parse --show-toplevel 2>/dev/null || pwdgit rev-parse --short HEAD 2>/dev/null || echo "no-git"<installed-skill-dir>/templates/slurm_job.sh{PLACEHOLDER}| Placeholder | Value |
|---|---|
| project directory name |
| environment key (e.g., |
| display name from profile |
| today's date YYYY-MM-DD |
| short git SHA |
| user-provided job name |
| filename of the generated script |
| from env profile defaults (or user override) |
| cpus_per_task from profile (or user override) |
| user-provided GPU count |
| from profile defaults (or user override) |
| user-provided or profile default |
| |
| |
| absolute project root path |
| user-provided env name |
| user-provided command |
| scratch path from env profile |
module loadcommon_modulesconda activatesource .venv/activatejobs/<job-name>.sh<installed-skill-dir>/templates/slurm_job.sh{PLACEHOLDER}| 占位符 | 取值 |
|---|---|
| 项目目录名称 |
| 环境键(例如: |
| 配置文件中的显示名称 |
| 当前日期 YYYY-MM-DD |
| 简短git SHA值 |
| 用户提供的作业名称 |
| 生成的脚本文件名 |
| 来自环境配置默认值(或用户覆盖值) |
| 配置文件中的cpus_per_task值(或用户覆盖值) |
| 用户提供的GPU数量 |
| 来自环境配置默认值(或用户覆盖值) |
| 用户提供的值或配置文件默认值 |
| |
| |
| 项目根目录绝对路径 |
| 用户提供的环境名称 |
| 用户提供的命令 |
| 环境配置中的临时存储路径 |
common_modulesmodule loadconda activatesource .venv/activatejobs/<job-name>.sh<installed-skill-dir>/templates/runai_job.sh| Placeholder | Value |
|---|---|
| project directory name |
| today's date |
| short git SHA |
| user-provided job name |
| filename of generated script |
| from env profile |
| from env profile |
| GPU count |
| CPU count from profile defaults |
| memory from profile defaults |
| generated from |
| user-provided command |
jobs/<job-name>-runai.sh<installed-skill-dir>/templates/runai_job.sh| 占位符 | 取值 |
|---|---|
| 项目目录名称 |
| 当前日期 |
| 简短git SHA值 |
| 用户提供的作业名称 |
| 生成的脚本文件名 |
| 来自环境配置的 |
| 来自环境配置的 |
| GPU数量 |
| 配置文件默认的CPU数量 |
| 配置文件默认的内存大小 |
| 根据配置文件中的 |
| 用户提供的命令 |
jobs/<job-name>-runai.sh<installed-skill-dir>/templates/local_run.shjobs/<job-name>-local.sh<installed-skill-dir>/templates/local_run.shjobs/<job-name>-local.shenvironments.yamlenvironments.yamlenvironments.yamlenvironments.yamlmkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>jobs/<job-name>.sh-runai.sh-local.shmkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>jobs/<job-name>.sh-runai.sh-local.shundefinedundefinedundefinedundefinedbash jobs/<job-name>-runai.shbash jobs/<job-name>-runai.shundefinedundefinedundefinedundefined
Ask: **"Want me to run the submit command now?"**
- If yes and local: run it directly.
- If yes and remote: run the `scp` + `ssh sbatch` command (requires ssh key auth to be set up).
- If no: remind the user that the script is saved in `jobs/` and ready to submit.
询问:**"需要我现在运行提交命令吗?"**
- 若是本地环境:直接运行。
- 若是远程环境:运行`scp` + `ssh sbatch`命令(需已设置ssh密钥认证)。
- 若否:提醒用户脚本已保存至`jobs/`目录,随时可提交。jobs/README.mdjobs/index.md| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |init-python-projectdocs/runs/<DATE>-<job-name>.mdjobs/README.mdjobs/index.md| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |init-python-projectdocs/runs/<DATE>-<job-name>.mdmemory/.agent/worktree-status.mdmemory/evidence-board.mdEXP-###plannedsubmittedrunningdocs/runs/memory/action-board.mddoingmemory/current-status.md<worktree>/.agent/worktree-status.mdneeds-verificationmemory/.agent/worktree-status.mdmemory/evidence-board.mdEXP-###plannedsubmittedrunningdocs/runs/memory/action-board.mddoingmemory/current-status.md<worktree>/.agent/worktree-status.mdneeds-verificationenvironments.yaml| Key | Type | Cluster | Notes |
|---|---|---|---|
| local | — | Current machine, tmux/nohup |
| slurm | KAUST Ibex | |
| slurm | UW HPC | Placeholder — update |
| runai | EPFL RunAI | Kubernetes; update project/image in |
environments.yaml| 键 | 类型 | 集群 | 说明 |
|---|---|---|---|
| local | — | 当前机器,使用tmux/nohup |
| slurm | KAUST Ibex | |
| slurm | UW HPC | 占位符——需更新 |
| runai | EPFL RunAI | Kubernetes;需在 |
<installed-skill-dir>/environments.yamlmy-cluster:
type: slurm # or runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."<installed-skill-dir>/environments.yamlmy-cluster:
type: slurm # 或runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."GIT_COMMIToutputs/<job-name>/outputs/logs/<job-name>/jobs/outputs/.gitignoreGIT_COMMIToutputs/<job-name>/outputs/logs/<job-name>/jobs/outputs/.gitignore/run-experiment # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1/run-experiment # 交互式向导
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1configs/sweep.yaml#SBATCH --array=0-{N-1}%{max_concurrent}--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_IDoutputs/<job-name>/$SLURM_ARRAY_TASK_ID/configs/sweep.yaml#SBATCH --array=0-{N-1}%{max_concurrent}--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_IDoutputs/<job-name>/$SLURM_ARRAY_TASK_ID/--ntasks-per-node={GPUS}torchrun --nproc_per_node={GPUS}srun python -m torch.distributed.launch--ntasks-per-node={GPUS}torchrun --nproc_per_node={GPUS}srun python -m torch.distributed.launchsrun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bashrunai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bashrunai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>