run-experiment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRun Experiment
运行实验
Deploy and run ML experiment: $ARGUMENTS
部署并运行ML实验:$ARGUMENTS
Workflow
工作流程
Step 1: Detect Environment
步骤1:检测环境
Read the project's to determine the experiment environment:
CLAUDE.md- Local GPU: Look for local CUDA/MPS setup info
- Remote server: Look for SSH alias, conda env, code directory
If no server info is found in , ask the user.
CLAUDE.md读取项目的文件以确定实验环境:
CLAUDE.md- 本地GPU:查找本地CUDA/MPS的配置信息
- 远程服务器:查找SSH别名、Conda环境、代码目录
如果在中未找到服务器信息,请询问用户。
CLAUDE.mdStep 2: Pre-flight Check
步骤2:预检查
Check GPU availability on the target machine:
Remote:
bash
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheaderLocal:
bash
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader检查目标机器上的GPU可用性:
远程服务器:
bash
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader本地机器:
bash
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheaderor for Mac MPS:
针对Mac MPS的命令:
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
Free GPU = memory.used < 500 MiB.python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
空闲GPU的判定标准:已用内存 < 500 MiB。Step 3: Sync Code (Remote Only)
步骤3:同步代码(仅远程服务器)
Check the project's for a setting. If not specified, default to .
CLAUDE.mdcode_syncrsync检查项目中的配置项。如果未指定,默认使用。
CLAUDE.mdcode_syncrsyncOption A: rsync (default)
选项A:rsync(默认)
Only sync necessary files — NOT data, checkpoints, or large files:
bash
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/仅同步必要文件——不包括数据、检查点或大文件:
bash
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/Option B: git (when code_sync: git
is set in CLAUDE.md)
code_sync: git选项B:git(当CLAUDE.md
中设置code_sync: git
时使用)
CLAUDE.mdcode_sync: gitPush local changes to remote repo, then pull on the server:
bash
undefined将本地更改推送到远程仓库,然后在服务器上拉取:
bash
undefined1. Push from local
1. 从本地推送
git add -A && git commit -m "sync: experiment deployment" && git push
git add -A && git commit -m "sync: experiment deployment" && git push
2. Pull on server
2. 在服务器上拉取
ssh <server> "cd <remote_dst> && git pull"
Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed.ssh <server> "cd <remote_dst> && git pull"
优势:支持版本追踪,一次推送即可同步多台服务器,无需设置rsync的包含/排除规则。Step 3.5: W&B Integration (when wandb: true
in CLAUDE.md)
wandb: true步骤3.5:W&B集成(当CLAUDE.md
中设置wandb: true
时使用)
CLAUDE.mdwandb: trueSkip this step entirely if is not set or is in CLAUDE.md.
wandbfalseBefore deploying, ensure the experiment scripts have W&B logging:
-
Check if wandb is already in the script — look foror
import wandb. If present, skip to Step 4.wandb.init -
If not present, add W&B logging to the training script:python
import wandb wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...}) # Inside training loop: wandb.log({"train/loss": loss, "train/lr": lr, "step": step}) # After eval: wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc}) # At end: wandb.finish() -
Metrics to log (add whichever apply to the experiment):
- — training loss per step
train/loss - — learning rate
train/lr - ,
eval/loss,eval/ppl— eval metrics per epocheval/accuracy - — GPU memory (via
gpu/memory_used)torch.cuda.max_memory_allocated() - — throughput
speed/samples_per_sec - Any custom metrics the experiment already computes
-
Verify wandb login on the target machine:bash
ssh <server> "wandb status" # should show logged in # If not logged in: ssh <server> "wandb login <WANDB_API_KEY>"
The W&B project name and API key come from(see example below). The experiment name is auto-generated from the script name + timestamp.CLAUDE.md
如果中未设置或其值为,则完全跳过此步骤。
CLAUDE.mdwandbfalse在部署前,确保实验脚本已配置W&B日志记录:
-
检查脚本中是否已包含wandb——查找或
import wandb语句。如果已存在,直接跳至步骤4。wandb.init -
如果未包含,为训练脚本添加W&B日志记录:python
import wandb wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...}) # 训练循环内: wandb.log({"train/loss": loss, "train/lr": lr, "step": step}) # 评估后: wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc}) # 结束时: wandb.finish() -
需记录的指标(添加所有与实验相关的指标):
- — 每步训练损失
train/loss - — 学习率
train/lr - ,
eval/loss,eval/ppl— 每轮评估指标eval/accuracy - — GPU内存使用量(通过
gpu/memory_used获取)torch.cuda.max_memory_allocated() - — 吞吐量
speed/samples_per_sec - 实验已计算的任何自定义指标
-
验证目标机器上的wandb登录状态:bash
ssh <server> "wandb status" # 应显示已登录 # 若未登录: ssh <server> "wandb login <WANDB_API_KEY>"
W&B项目名称和API密钥来自(见下方示例)。实验名称由脚本名称+时间戳自动生成。CLAUDE.md
Step 4: Deploy
步骤4:部署
Remote (via SSH + screen)
远程服务器(通过SSH + screen)
For each experiment, create a dedicated screen session with GPU binding:
bash
ssh <server> "screen -dmS <exp_name> bash -c '\
eval \"\$(<conda_path>/conda shell.bash hook)\" && \
conda activate <env> && \
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"为每个实验创建一个绑定GPU的专属screen会话:
bash
ssh <server> "screen -dmS <exp_name> bash -c '\
eval \"\$(<conda_path>/conda shell.bash hook)\" && \
conda activate <env> && \
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"Local
本地机器
bash
undefinedbash
undefinedLinux with CUDA
带CUDA的Linux系统
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>
Mac with MPS (PyTorch uses MPS automatically)
带MPS的Mac系统(PyTorch会自动使用MPS)
python <script> <args> 2>&1 | tee <log_file>
For local long-running jobs, use `run_in_background: true` to keep the conversation responsive.python <script> <args> 2>&1 | tee <log_file>
对于本地长时间运行的任务,使用`run_in_background: true`参数以保持对话响应性。Step 5: Verify Launch
步骤5:验证启动状态
Remote:
bash
ssh <server> "screen -ls"Local:
Check process is running and GPU is allocated.
远程服务器:
bash
ssh <server> "screen -ls"本地机器:
检查进程是否在运行以及GPU是否已分配。
Step 6: Feishu Notification (if configured)
步骤6:飞书通知(若已配置)
After deployment is verified, check :
~/.claude/feishu.json- Send notification: which experiments launched, which GPUs, estimated time
experiment_done - If config absent or mode : skip entirely (no-op)
"off"
验证部署完成后,检查配置文件:
~/.claude/feishu.json- 发送通知:包含已启动的实验、使用的GPU、预计耗时等信息
experiment_done - 如果配置文件不存在或模式为:完全跳过此步骤(无操作)
"off"
Key Rules
核心规则
- ALWAYS check GPU availability first — never blindly assign GPUs
- Each experiment gets its own screen session + GPU (remote) or background process (local)
- Use to save logs for later inspection
tee - Run deployment commands with to keep conversation responsive
run_in_background: true - Report back: which GPU, which screen/process, what command, estimated time
- If multiple experiments, launch them in parallel on different GPUs
- 始终优先检查GPU可用性——绝不盲目分配GPU
- 每个实验都拥有独立的screen会话+GPU(远程服务器)或后台进程(本地机器)
- 使用命令保存日志,以便后续检查
tee - 使用参数执行部署命令,以保持对话响应性
run_in_background: true - 反馈信息:使用的GPU、对应的screen/进程、执行的命令、预计耗时
- 若有多个实验,在不同GPU上并行启动
CLAUDE.md Example
CLAUDE.md示例
Users should add their server info to their project's :
CLAUDE.mdmarkdown
undefined用户应将其服务器信息添加到项目的文件中:
CLAUDE.mdmarkdown
undefinedRemote Server
Remote Server
- SSH:
ssh my-gpu-server - GPU: 4x A100 (80GB each)
- Conda:
eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research - Code dir:
/home/user/experiments/ - code_sync: rsync # default. Or set to "git" for git push/pull workflow
- wandb: false # set to "true" to auto-add W&B logging to experiment scripts
- wandb_project: my-project # W&B project name (required if wandb: true)
- wandb_entity: my-team # W&B team/user (optional, uses default if omitted)
- SSH:
ssh my-gpu-server - GPU: 4x A100 (80GB each)
- Conda:
eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research - Code dir:
/home/user/experiments/ - code_sync: rsync # 默认值。也可设置为"git"以使用git推送/拉取工作流
- wandb: false # 设置为"true"可自动为实验脚本添加W&B日志记录
- wandb_project: my-project # 若wandb设为true,此为必填项
- wandb_entity: my-team # W&B团队/用户(可选,省略则使用默认值)
Local Environment
Local Environment
- Mac MPS / Linux CUDA
- Conda env: (Python 3.10 + PyTorch)
ml
> **W&B setup**: Run `wandb login` on your server once (or set `WANDB_API_KEY` env var). The skill reads project/entity from CLAUDE.md and adds `wandb.init()` + `wandb.log()` to your training scripts automatically. Dashboard: `https://wandb.ai/<entity>/<project>`.- Mac MPS / Linux CUDA
- Conda env: (Python 3.10 + PyTorch)
ml
> **W&B配置说明**:在服务器上执行一次`wandb login`(或设置`WANDB_API_KEY`环境变量)。本功能会从CLAUDE.md中读取项目/实体信息,并自动在训练脚本中添加`wandb.init()`和`wandb.log()`语句。仪表盘地址:`https://wandb.ai/<entity>/<project>`。