exec-remote

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Remote Execution Skill

远程执行技能

This skill handles running code on remote GPU or TPU clusters via SkyPilot.
本技能负责通过SkyPilot在远程GPU或TPU集群上运行代码。

1. Determine Target Device

1. 确定目标设备

Identify the target device from the user's request:
TargetCluster name fileLaunch scriptUV extraEnv prefix
GPU
.cluster_name_gpu
launch_gpu.sh
gpu
export CUDA_VISIBLE_DEVICES=0; 
TPU
.cluster_name_tpu
launch_tpu.sh
tpu
(none)
Execution Instructions: Before running the launch script, you must find its absolute path. It is located in the
scripts/
directory alongside this skill definition. Use your file search tools (e.g.,
glob
or
find
) to locate
launch_gpu.sh
or
launch_tpu.sh
before executing it.
If the user does not specify a device, ask them which one to use.
从用户的请求中识别目标设备:
目标集群名称文件启动脚本UV额外参数环境变量前缀
GPU
.cluster_name_gpu
launch_gpu.sh
gpu
export CUDA_VISIBLE_DEVICES=0; 
TPU
.cluster_name_tpu
launch_tpu.sh
tpu
(无)
执行说明: 在运行启动脚本之前,你必须找到它的绝对路径。它位于此技能定义所在的
scripts/
目录下。在执行前,请使用文件搜索工具(如
glob
find
)定位
launch_gpu.sh
launch_tpu.sh
如果用户未指定设备,请询问他们要使用哪一种。

2. Prerequisites

2. 前提条件

  • The cluster must already be provisioned. Check that the corresponding cluster name file (
    .cluster_name_gpu
    or
    .cluster_name_tpu
    ) exists and is non-empty in the project root.
  • If the file does not exist or is empty, ask the user to provision a cluster first using the appropriate launch script.
  • 集群必须已完成配置。请检查项目根目录下是否存在对应的集群名称文件(
    .cluster_name_gpu
    .cluster_name_tpu
    )且文件非空。
  • 如果文件不存在或为空,请要求用户先使用相应的启动脚本配置集群。

3. Cluster Management

3. 集群管理

Provisioning

配置

bash
undefined
bash
undefined

Note: First locate the scripts as instructed above, then run them.

注意:先按照上述说明定位脚本,再运行它们。

GPU — common accelerator types: H100:1, A100:1, L4:1

GPU — 常见加速器类型:H100:1, A100:1, L4:1

bash <absolute_path_to_launch_gpu.sh> <accelerator_type> <experiment_name>
bash <absolute_path_to_launch_gpu.sh> <accelerator_type> <experiment_name>

TPU — common accelerator types: tpu-v4-8, tpu-v4-16, tpu-v6e-1, tpu-v6e-4

TPU — 常见加速器类型:tpu-v4-8, tpu-v4-16, tpu-v6e-1, tpu-v6e-4

bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>

The launch script automatically updates the corresponding `.cluster_name_*` file.
bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>

启动脚本会自动更新对应的`.cluster_name_*`文件。

Teardown

销毁

bash
undefined
bash
undefined

GPU

GPU

sky down $(cat .cluster_name_gpu) -y
sky down $(cat .cluster_name_gpu) -y

TPU

TPU

sky down $(cat .cluster_name_tpu) -y
undefined
sky down $(cat .cluster_name_tpu) -y
undefined

4. Execution Command

4. 执行命令

GPU

GPU

bash
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python <PATH_TO_SCRIPT> [ARGS]"
  • export CUDA_VISIBLE_DEVICES=0;
    ensures deterministic single-GPU execution. Adjust for multi-GPU jobs.
  • --extra gpu
    activates GPU optional dependencies (e.g.
    jax[cuda]
    ).
bash
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python <PATH_TO_SCRIPT> [ARGS]"
  • export CUDA_VISIBLE_DEVICES=0;
    确保确定性的单GPU执行。多GPU任务可按需调整。
  • --extra gpu
    激活GPU可选依赖(如
    jax[cuda]
    )。

TPU

TPU

bash
sky exec $(cat .cluster_name_tpu) --workdir . "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"
  • --extra tpu
    activates TPU optional dependencies (e.g.
    jax[tpu]
    ).
bash
sky exec $(cat .cluster_name_tpu) --workdir . "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"
  • --extra tpu
    激活TPU可选依赖(如
    jax[tpu]
    )。

Common flags

通用参数

  • --workdir .
    syncs the current local directory to the remote instance before running.
  • For pytest, use
    python -m pytest <test_path>
    instead of calling pytest directly.
  • --workdir .
    在运行前将当前本地目录同步到远程实例。
  • 对于pytest,请使用
    python -m pytest <test_path>
    而非直接调用pytest。

5. Usage Examples

5. 使用示例

Run a benchmark on GPU:
bash
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python src/lynx/perf/benchmark_train.py"
Run tests on TPU:
bash
sky exec $(cat .cluster_name_tpu) --workdir . "uv run --extra tpu python -m pytest src/lynx/test/"
在GPU上运行基准测试:
bash
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python src/lynx/perf/benchmark_train.py"
在TPU上运行测试:
bash
sky exec $(cat .cluster_name_tpu) --workdir . "uv run --extra tpu python -m pytest src/lynx/test/"

6. Operational Notes

6. 操作注意事项

  • Logs: SkyPilot streams
    stdout
    and
    stderr
    directly to the terminal.
  • Interruption:
    Ctrl+C
    may not kill the remote process; check SkyPilot docs for cleanup if needed.
  • 日志: SkyPilot会将
    stdout
    stderr
    直接流式传输到终端。
  • 中断:
    Ctrl+C
    可能无法终止远程进程;如有需要,请查阅SkyPilot文档了解清理方法。