run-experiment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Run Experiment

运行实验

Deploy and run ML experiment: $ARGUMENTS

部署并运行ML实验：$ARGUMENTS

Workflow

工作流程

Step 1: Detect Environment

步骤1：检测环境

Read the project's

CLAUDE.md

to determine the experiment environment:

Local GPU: Look for local CUDA/MPS setup info
Remote server: Look for SSH alias, conda env, code directory

If no server info is found in

CLAUDE.md

, ask the user.

读取项目的

CLAUDE.md

文件以确定实验环境：

本地GPU：查找本地CUDA/MPS的配置信息
远程服务器：查找SSH别名、Conda环境、代码目录

如果在

CLAUDE.md

中未找到服务器信息，请询问用户。

Step 2: Pre-flight Check

步骤2：预检查

Check GPU availability on the target machine:

Remote:

bash

ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

Local:

bash

nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

检查目标机器上的GPU可用性：

远程服务器：

bash

ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

本地机器：

bash

nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

or for Mac MPS:

针对Mac MPS的命令：

python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"


Free GPU = memory.used < 500 MiB.

python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"


空闲GPU的判定标准：已用内存 < 500 MiB。

Step 3: Sync Code (Remote Only)

步骤3：同步代码（仅远程服务器）

Check the project's

CLAUDE.md

for a

code_sync

setting. If not specified, default to

rsync

检查项目

CLAUDE.md

中的

code_sync

配置项。如果未指定，默认使用

rsync

。

Option A: rsync (default)

选项A：rsync（默认）

Only sync necessary files — NOT data, checkpoints, or large files:

bash

rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/

仅同步必要文件——不包括数据、检查点或大文件：

bash

rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/

Option B: git (when

code_sync: git

is set in CLAUDE.md)

选项B：git（当

CLAUDE.md

中设置

code_sync: git

时使用）

Push local changes to remote repo, then pull on the server:

bash

undefined

将本地更改推送到远程仓库，然后在服务器上拉取：

bash

undefined

1. Push from local

1. 从本地推送

git add -A && git commit -m "sync: experiment deployment" && git push

2. Pull on server

2. 在服务器上拉取

ssh <server> "cd <remote_dst> && git pull"


Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed.

ssh <server> "cd <remote_dst> && git pull"


优势：支持版本追踪，一次推送即可同步多台服务器，无需设置rsync的包含/排除规则。

Step 3.5: W&B Integration (when

wandb: true

in CLAUDE.md)

步骤3.5：W&B集成（当

CLAUDE.md

中设置

wandb: true

时使用）

Skip this step entirely if
wandb
is not set or is
false
in CLAUDE.md.

Before deploying, ensure the experiment scripts have W&B logging:

Check if wandb is already in the script — look for
```
import wandb
```
or
```
wandb.init
```
. If present, skip to Step 4.

If not present, add W&B logging to the training script:

python

import wandb
wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...})

# Inside training loop:
wandb.log({"train/loss": loss, "train/lr": lr, "step": step})

# After eval:
wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc})

# At end:
wandb.finish()

Metrics to log (add whichever apply to the experiment):
- ```
train/loss
```
  — training loss per step
- ```
train/lr
```
  — learning rate
- ```
eval/loss
```
  ,
```
eval/ppl
```
  ,
```
eval/accuracy
```
  — eval metrics per epoch
- ```
gpu/memory_used
```
  — GPU memory (via
```
torch.cuda.max_memory_allocated()
```
  )
- ```
speed/samples_per_sec
```
  — throughput
- Any custom metrics the experiment already computes

Verify wandb login on the target machine:

bash

ssh <server> "wandb status"  # should show logged in
# If not logged in:
ssh <server> "wandb login <WANDB_API_KEY>"

The W&B project name and API key come from
CLAUDE.md
(see example below). The experiment name is auto-generated from the script name + timestamp.

如果
CLAUDE.md
中未设置
wandb
或其值为
false
，则完全跳过此步骤。

在部署前，确保实验脚本已配置W&B日志记录：

检查脚本中是否已包含wandb——查找
```
import wandb
```
或
```
wandb.init
```
语句。如果已存在，直接跳至步骤4。

如果未包含，为训练脚本添加W&B日志记录：

python

import wandb
wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...})

# 训练循环内：
wandb.log({"train/loss": loss, "train/lr": lr, "step": step})

# 评估后：
wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc})

# 结束时：
wandb.finish()

需记录的指标（添加所有与实验相关的指标）：
- ```
train/loss
```
  — 每步训练损失
- ```
train/lr
```
  — 学习率
- ```
eval/loss
```
  ,
```
eval/ppl
```
  ,
```
eval/accuracy
```
  — 每轮评估指标
- ```
gpu/memory_used
```
  — GPU内存使用量（通过
```
torch.cuda.max_memory_allocated()
```
  获取）
- ```
speed/samples_per_sec
```
  — 吞吐量
- 实验已计算的任何自定义指标

验证目标机器上的wandb登录状态：

bash

ssh <server> "wandb status"  # 应显示已登录
# 若未登录：
ssh <server> "wandb login <WANDB_API_KEY>"

W&B项目名称和API密钥来自
CLAUDE.md
（见下方示例）。实验名称由脚本名称+时间戳自动生成。

Step 4: Deploy

步骤4：部署

Remote (via SSH + screen)

远程服务器（通过SSH + screen）

For each experiment, create a dedicated screen session with GPU binding:

bash

ssh <server> "screen -dmS <exp_name> bash -c '\
  eval \"\$(<conda_path>/conda shell.bash hook)\" && \
  conda activate <env> && \
  CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"

为每个实验创建一个绑定GPU的专属screen会话：

bash

ssh <server> "screen -dmS <exp_name> bash -c '\
  eval \"\$(<conda_path>/conda shell.bash hook)\" && \
  conda activate <env> && \
  CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"

Local

本地机器

bash

undefined

bash

undefined

Linux with CUDA

带CUDA的Linux系统

CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>

Mac with MPS (PyTorch uses MPS automatically)

带MPS的Mac系统（PyTorch会自动使用MPS）

python <script> <args> 2>&1 | tee <log_file>


For local long-running jobs, use `run_in_background: true` to keep the conversation responsive.

python <script> <args> 2>&1 | tee <log_file>


对于本地长时间运行的任务，使用`run_in_background: true`参数以保持对话响应性。

Step 5: Verify Launch

步骤5：验证启动状态

Remote:

bash

ssh <server> "screen -ls"

Local: Check process is running and GPU is allocated.

远程服务器：

bash

ssh <server> "screen -ls"

本地机器： 检查进程是否在运行以及GPU是否已分配。

Step 6: Feishu Notification (if configured)

步骤6：飞书通知（若已配置）

After deployment is verified, check

~/.claude/feishu.json

Send
```
experiment_done
```
notification: which experiments launched, which GPUs, estimated time
If config absent or mode
```
"off"
```
: skip entirely (no-op)

验证部署完成后，检查

~/.claude/feishu.json

配置文件：

发送
```
experiment_done
```
通知：包含已启动的实验、使用的GPU、预计耗时等信息
如果配置文件不存在或模式为
```
"off"
```
：完全跳过此步骤（无操作）

Key Rules

核心规则

ALWAYS check GPU availability first — never blindly assign GPUs
Each experiment gets its own screen session + GPU (remote) or background process (local)
Use
```
tee
```
to save logs for later inspection
Run deployment commands with
```
run_in_background: true
```
to keep conversation responsive
Report back: which GPU, which screen/process, what command, estimated time
If multiple experiments, launch them in parallel on different GPUs

始终优先检查GPU可用性——绝不盲目分配GPU
每个实验都拥有独立的screen会话+GPU（远程服务器）或后台进程（本地机器）
使用
```
tee
```
命令保存日志，以便后续检查
使用
```
run_in_background: true
```
参数执行部署命令，以保持对话响应性
反馈信息：使用的GPU、对应的screen/进程、执行的命令、预计耗时
若有多个实验，在不同GPU上并行启动

CLAUDE.md Example

CLAUDE.md示例

Users should add their server info to their project's

CLAUDE.md

markdown

undefined

用户应将其服务器信息添加到项目的

CLAUDE.md

文件中：

markdown

undefined

Remote Server

SSH:
```
ssh my-gpu-server
```
GPU: 4x A100 (80GB each)

Conda:

eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research

Code dir:
```
/home/user/experiments/
```
code_sync: rsync # default. Or set to "git" for git push/pull workflow
wandb: false # set to "true" to auto-add W&B logging to experiment scripts
wandb_project: my-project # W&B project name (required if wandb: true)
wandb_entity: my-team # W&B team/user (optional, uses default if omitted)

SSH:
```
ssh my-gpu-server
```
GPU: 4x A100 (80GB each)

Conda:

eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research

Code dir:
```
/home/user/experiments/
```
code_sync: rsync # 默认值。也可设置为"git"以使用git推送/拉取工作流
wandb: false # 设置为"true"可自动为实验脚本添加W&B日志记录
wandb_project: my-project # 若wandb设为true，此为必填项
wandb_entity: my-team # W&B团队/用户（可选，省略则使用默认值）

Local Environment

Mac MPS / Linux CUDA
Conda env:
```
ml
```
(Python 3.10 + PyTorch)


> **W&B setup**: Run `wandb login` on your server once (or set `WANDB_API_KEY` env var). The skill reads project/entity from CLAUDE.md and adds `wandb.init()` + `wandb.log()` to your training scripts automatically. Dashboard: `https://wandb.ai/<entity>/<project>`.

Mac MPS / Linux CUDA
Conda env:
```
ml
```
(Python 3.10 + PyTorch)


> **W&B配置说明**：在服务器上执行一次`wandb login`（或设置`WANDB_API_KEY`环境变量）。本功能会从CLAUDE.md中读取项目/实体信息，并自动在训练脚本中添加`wandb.init()`和`wandb.log()`语句。仪表盘地址：`https://wandb.ai/<entity>/<project>`。

run-experiment

Original

Translation

Run Experiment

运行实验

Workflow

工作流程

Step 1: Detect Environment

步骤1：检测环境

Step 2: Pre-flight Check

步骤2：预检查

or for Mac MPS:

针对Mac MPS的命令：

Step 3: Sync Code (Remote Only)

步骤3：同步代码（仅远程服务器）

Option A: rsync (default)

选项A：rsync（默认）

Option B: git (when code_sync: git is set in CLAUDE.md)

选项B：git（当CLAUDE.md中设置code_sync: git时使用）

1. Push from local

1. 从本地推送

2. Pull on server

2. 在服务器上拉取

Step 3.5: W&B Integration (when wandb: true in CLAUDE.md)

步骤3.5：W&B集成（当CLAUDE.md中设置wandb: true时使用）

Step 4: Deploy

步骤4：部署

Remote (via SSH + screen)

远程服务器（通过SSH + screen）

Local

本地机器

Linux with CUDA

带CUDA的Linux系统

Mac with MPS (PyTorch uses MPS automatically)

带MPS的Mac系统（PyTorch会自动使用MPS）

Step 5: Verify Launch

步骤5：验证启动状态

Step 6: Feishu Notification (if configured)

步骤6：飞书通知（若已配置）

Key Rules

核心规则

CLAUDE.md Example

CLAUDE.md示例

Remote Server

Remote Server

Local Environment

Local Environment

Option B: git (when
`code_sync: git`
is set in CLAUDE.md)

选项B：git（当
`CLAUDE.md`
中设置
`code_sync: git`
时使用）

Step 3.5: W&B Integration (when
`wandb: true`
in CLAUDE.md)

步骤3.5：W&B集成（当
`CLAUDE.md`
中设置
`wandb: true`
时使用）