run-experiment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Run Experiment

运行实验

Deploy and run ML experiment: $ARGUMENTS
部署并运行ML实验:$ARGUMENTS

Workflow

工作流程

Step 1: Detect Environment

步骤1:检测环境

Read the project's
CLAUDE.md
to determine the experiment environment:
  • Local GPU: Look for local CUDA/MPS setup info
  • Remote server: Look for SSH alias, conda env, code directory
If no server info is found in
CLAUDE.md
, ask the user.
读取项目的
CLAUDE.md
文件以确定实验环境:
  • 本地GPU:查找本地CUDA/MPS的配置信息
  • 远程服务器:查找SSH别名、Conda环境、代码目录
如果在
CLAUDE.md
中未找到服务器信息,请询问用户。

Step 2: Pre-flight Check

步骤2:预检查

Check GPU availability on the target machine:
Remote:
bash
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
Local:
bash
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
检查目标机器上的GPU可用性:
远程服务器:
bash
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
本地机器:
bash
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

or for Mac MPS:

针对Mac MPS的命令:

python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Free GPU = memory.used < 500 MiB.
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

空闲GPU的判定标准:已用内存 < 500 MiB。

Step 3: Sync Code (Remote Only)

步骤3:同步代码(仅远程服务器)

Check the project's
CLAUDE.md
for a
code_sync
setting. If not specified, default to
rsync
.
检查项目
CLAUDE.md
中的
code_sync
配置项。如果未指定,默认使用
rsync

Option A: rsync (default)

选项A:rsync(默认)

Only sync necessary files — NOT data, checkpoints, or large files:
bash
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/
仅同步必要文件——不包括数据、检查点或大文件:
bash
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/

Option B: git (when
code_sync: git
is set in CLAUDE.md)

选项B:git(当
CLAUDE.md
中设置
code_sync: git
时使用)

Push local changes to remote repo, then pull on the server:
bash
undefined
将本地更改推送到远程仓库,然后在服务器上拉取:
bash
undefined

1. Push from local

1. 从本地推送

git add -A && git commit -m "sync: experiment deployment" && git push
git add -A && git commit -m "sync: experiment deployment" && git push

2. Pull on server

2. 在服务器上拉取

ssh <server> "cd <remote_dst> && git pull"

Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed.
ssh <server> "cd <remote_dst> && git pull"

优势:支持版本追踪,一次推送即可同步多台服务器,无需设置rsync的包含/排除规则。

Step 3.5: W&B Integration (when
wandb: true
in CLAUDE.md)

步骤3.5:W&B集成(当
CLAUDE.md
中设置
wandb: true
时使用)

Skip this step entirely if
wandb
is not set or is
false
in CLAUDE.md.
Before deploying, ensure the experiment scripts have W&B logging:
  1. Check if wandb is already in the script — look for
    import wandb
    or
    wandb.init
    . If present, skip to Step 4.
  2. If not present, add W&B logging to the training script:
    python
    import wandb
    wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...})
    
    # Inside training loop:
    wandb.log({"train/loss": loss, "train/lr": lr, "step": step})
    
    # After eval:
    wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc})
    
    # At end:
    wandb.finish()
  3. Metrics to log (add whichever apply to the experiment):
    • train/loss
      — training loss per step
    • train/lr
      — learning rate
    • eval/loss
      ,
      eval/ppl
      ,
      eval/accuracy
      — eval metrics per epoch
    • gpu/memory_used
      — GPU memory (via
      torch.cuda.max_memory_allocated()
      )
    • speed/samples_per_sec
      — throughput
    • Any custom metrics the experiment already computes
  4. Verify wandb login on the target machine:
    bash
    ssh <server> "wandb status"  # should show logged in
    # If not logged in:
    ssh <server> "wandb login <WANDB_API_KEY>"
The W&B project name and API key come from
CLAUDE.md
(see example below). The experiment name is auto-generated from the script name + timestamp.
如果
CLAUDE.md
中未设置
wandb
或其值为
false
,则完全跳过此步骤。
在部署前,确保实验脚本已配置W&B日志记录:
  1. 检查脚本中是否已包含wandb——查找
    import wandb
    wandb.init
    语句。如果已存在,直接跳至步骤4。
  2. 如果未包含,为训练脚本添加W&B日志记录
    python
    import wandb
    wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...})
    
    # 训练循环内:
    wandb.log({"train/loss": loss, "train/lr": lr, "step": step})
    
    # 评估后:
    wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc})
    
    # 结束时:
    wandb.finish()
  3. 需记录的指标(添加所有与实验相关的指标):
    • train/loss
      — 每步训练损失
    • train/lr
      — 学习率
    • eval/loss
      ,
      eval/ppl
      ,
      eval/accuracy
      — 每轮评估指标
    • gpu/memory_used
      — GPU内存使用量(通过
      torch.cuda.max_memory_allocated()
      获取)
    • speed/samples_per_sec
      — 吞吐量
    • 实验已计算的任何自定义指标
  4. 验证目标机器上的wandb登录状态
    bash
    ssh <server> "wandb status"  # 应显示已登录
    # 若未登录:
    ssh <server> "wandb login <WANDB_API_KEY>"
W&B项目名称和API密钥来自
CLAUDE.md
(见下方示例)。实验名称由脚本名称+时间戳自动生成。

Step 4: Deploy

步骤4:部署

Remote (via SSH + screen)

远程服务器(通过SSH + screen)

For each experiment, create a dedicated screen session with GPU binding:
bash
ssh <server> "screen -dmS <exp_name> bash -c '\
  eval \"\$(<conda_path>/conda shell.bash hook)\" && \
  conda activate <env> && \
  CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"
为每个实验创建一个绑定GPU的专属screen会话:
bash
ssh <server> "screen -dmS <exp_name> bash -c '\
  eval \"\$(<conda_path>/conda shell.bash hook)\" && \
  conda activate <env> && \
  CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"

Local

本地机器

bash
undefined
bash
undefined

Linux with CUDA

带CUDA的Linux系统

CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>

Mac with MPS (PyTorch uses MPS automatically)

带MPS的Mac系统(PyTorch会自动使用MPS)

python <script> <args> 2>&1 | tee <log_file>

For local long-running jobs, use `run_in_background: true` to keep the conversation responsive.
python <script> <args> 2>&1 | tee <log_file>

对于本地长时间运行的任务,使用`run_in_background: true`参数以保持对话响应性。

Step 5: Verify Launch

步骤5:验证启动状态

Remote:
bash
ssh <server> "screen -ls"
Local: Check process is running and GPU is allocated.
远程服务器:
bash
ssh <server> "screen -ls"
本地机器: 检查进程是否在运行以及GPU是否已分配。

Step 6: Feishu Notification (if configured)

步骤6:飞书通知(若已配置)

After deployment is verified, check
~/.claude/feishu.json
:
  • Send
    experiment_done
    notification: which experiments launched, which GPUs, estimated time
  • If config absent or mode
    "off"
    : skip entirely (no-op)
验证部署完成后,检查
~/.claude/feishu.json
配置文件:
  • 发送
    experiment_done
    通知:包含已启动的实验、使用的GPU、预计耗时等信息
  • 如果配置文件不存在或模式为
    "off"
    :完全跳过此步骤(无操作)

Key Rules

核心规则

  • ALWAYS check GPU availability first — never blindly assign GPUs
  • Each experiment gets its own screen session + GPU (remote) or background process (local)
  • Use
    tee
    to save logs for later inspection
  • Run deployment commands with
    run_in_background: true
    to keep conversation responsive
  • Report back: which GPU, which screen/process, what command, estimated time
  • If multiple experiments, launch them in parallel on different GPUs
  • 始终优先检查GPU可用性——绝不盲目分配GPU
  • 每个实验都拥有独立的screen会话+GPU(远程服务器)或后台进程(本地机器)
  • 使用
    tee
    命令保存日志,以便后续检查
  • 使用
    run_in_background: true
    参数执行部署命令,以保持对话响应性
  • 反馈信息:使用的GPU、对应的screen/进程、执行的命令、预计耗时
  • 若有多个实验,在不同GPU上并行启动

CLAUDE.md Example

CLAUDE.md示例

Users should add their server info to their project's
CLAUDE.md
:
markdown
undefined
用户应将其服务器信息添加到项目的
CLAUDE.md
文件中:
markdown
undefined

Remote Server

Remote Server

  • SSH:
    ssh my-gpu-server
  • GPU: 4x A100 (80GB each)
  • Conda:
    eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research
  • Code dir:
    /home/user/experiments/
  • code_sync: rsync # default. Or set to "git" for git push/pull workflow
  • wandb: false # set to "true" to auto-add W&B logging to experiment scripts
  • wandb_project: my-project # W&B project name (required if wandb: true)
  • wandb_entity: my-team # W&B team/user (optional, uses default if omitted)
  • SSH:
    ssh my-gpu-server
  • GPU: 4x A100 (80GB each)
  • Conda:
    eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research
  • Code dir:
    /home/user/experiments/
  • code_sync: rsync # 默认值。也可设置为"git"以使用git推送/拉取工作流
  • wandb: false # 设置为"true"可自动为实验脚本添加W&B日志记录
  • wandb_project: my-project # 若wandb设为true,此为必填项
  • wandb_entity: my-team # W&B团队/用户(可选,省略则使用默认值)

Local Environment

Local Environment

  • Mac MPS / Linux CUDA
  • Conda env:
    ml
    (Python 3.10 + PyTorch)

> **W&B setup**: Run `wandb login` on your server once (or set `WANDB_API_KEY` env var). The skill reads project/entity from CLAUDE.md and adds `wandb.init()` + `wandb.log()` to your training scripts automatically. Dashboard: `https://wandb.ai/<entity>/<project>`.
  • Mac MPS / Linux CUDA
  • Conda env:
    ml
    (Python 3.10 + PyTorch)

> **W&B配置说明**:在服务器上执行一次`wandb login`(或设置`WANDB_API_KEY`环境变量)。本功能会从CLAUDE.md中读取项目/实体信息,并自动在训练脚本中添加`wandb.init()`和`wandb.log()`语句。仪表盘地址:`https://wandb.ai/<entity>/<project>`。