skypilot

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SkyPilot Skill

SkyPilot 使用技能

SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
SkyPilot是一个统一框架,可在任意云平台、Slurm或Kubernetes上运行AI工作负载。它提供单一接口,支持在25+云平台(包括AWS、GCP、Azure、Coreweave、Nebius、Lambda、Together AI、RunPod等)、Kubernetes集群及Slurm集群上启动集群、运行作业和部署模型。

When to Use SkyPilot

何时使用SkyPilot

Use SkyPilot when you need to:
  • Manage compute resources on any cloud, Slurm, or Kubernetes cluster
  • Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
  • Run training, fine-tuning, or batch inference jobs
  • Serve models with autoscaling and multi-cloud replicas (SkyServe)
  • Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
  • Find the cheapest or most available GPU across clouds
Don't use SkyPilot for:
  • Local-only workloads (use Docker/conda directly)
在以下场景中使用SkyPilot:
  • 管理任意云平台、Slurm或Kubernetes集群上的计算资源
  • 在任意云平台、Kubernetes或Slurm上启动CPU/GPU/TPU(如GB300、GB200、B200、H200、H100等)实例
  • 运行训练、微调或批量推理作业
  • 通过自动扩缩容和多云副本部署模型(SkyServe)
  • 运行具备自动生命周期管理与恢复能力的长期作业(托管作业)
  • 跨云寻找最便宜或可用性最高的GPU资源
请勿在以下场景使用SkyPilot:
  • 仅本地运行的工作负载(直接使用Docker/conda即可)

Capabilities: When to Use What

功能场景匹配

SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
1. SkyPilot Clusters (
sky launch
/
sky exec
) — Interactive development and debugging
  • Use during initial development, debugging, and experimentation
  • Launch a cluster, SSH in or connect VSCode/Cursor (
    code --remote ssh-remote+CLUSTER
    ), iterate quickly
  • Cluster stays up until you stop/down it or autostop triggers
  • Best for: prototyping, debugging, short experiments
2. Managed Jobs (
sky jobs launch
) — Long-running training and batch jobs
  • Use when submitting long-running jobs that should run unattended
  • Manages the full lifecycle: provisioning, execution, recovery, and teardown
  • Automatically recovers from spot preemptions, quota limits, and transient failures
  • Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
  • Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
3. SkyServe (
sky serve up
) — Production model serving
  • Use when serving models at scale with autoscaling
  • Start with
    sky launch
    + open port to test your serving setup, then use
    sky serve up
    to scale
  • Provides load balancing, autoscaling, and multi-cloud replicas
  • Best for: model serving endpoints, API services
SkyPilot包含三个核心抽象组件,根据工作流阶段选择合适的组件:
1. SkyPilot 集群
sky launch
/
sky exec
)—— 交互式开发与调试
  • 适用于初始开发、调试和实验阶段
  • 启动集群后,可通过SSH连接或VSCode/Cursor远程连接(
    code --remote ssh-remote+CLUSTER
    ),快速迭代开发
  • 集群会保持运行状态,直到你手动停止/销毁或自动停止触发
  • 最佳适用场景:原型开发、调试、短期实验
2. 托管作业
sky jobs launch
)—— 长期训练与批量作业
  • 适用于提交无需人工值守的长期作业
  • 管理完整生命周期:资源分配、执行、故障恢复和资源销毁
  • 自动从竞价实例抢占、配额限制和临时故障中恢复
  • 支持跨云平台、Kubernetes和Slurm运行(处理抢占和配额问题)
  • 最佳适用场景:训练任务、微调、超参数调优、批量推理
3. SkyServe
sky serve up
)—— 生产级模型部署
  • 适用于需要自动扩缩容的大规模模型部署
  • 先使用
    sky launch
    + 开放端口测试部署配置,再用
    sky serve up
    进行扩缩容
  • 提供负载均衡、自动扩缩容和多云副本能力
  • 最佳适用场景:模型部署端点、API服务

Before You Start (Agent Bootstrap)

开始前准备(Agent初始化)

Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
Step 1: Check installation and API server connectivity
bash
sky api info
Output containsMeaningNext action
Server version and statusServer is running and connectedBootstrap done. Skip to user's task.
No SkyPilot API server is connected
No server connectedGo to "Start or connect a server" below.
Could not connect to SkyPilot API server
Remote server unreachable or auth expiredTell the user and suggest
sky api login --relogin -e <endpoint>
to reconnect.
command not found: sky
SkyPilot not installedGo to "Install SkyPilot" below.
Install SkyPilot (only if
sky
command not found):
bash
pip install "skypilot[aws,gcp,kubernetes]"  # Pick clouds the user needs
Ask the user which clouds they need if unclear, then re-run
sky api info
.
Start or connect a server (only if "not running"):
Ask the user:
Do you have an existing SkyPilot API server to connect to, or should I start one locally?
  • Connect to existing server:
    sky api login -e <API_SERVER_URL>
    — get the URL from the user.
  • Start locally:
    sky api start
After either path, re-run
sky api info
to confirm the server is reachable.
Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)
bash
sky check -o json
This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).
先完成初始化,确认SkyPilot已安装、已连接API服务器且配置了云凭证。确认完成后,直接进入用户任务。
步骤1:检查安装状态与API服务器连接
bash
sky api info
输出内容含义下一步操作
服务器版本与状态服务器已运行并连接成功初始化完成,直接进入用户任务。
No SkyPilot API server is connected
未连接服务器进入下方「启动或连接服务器」环节。
Could not connect to SkyPilot API server
远程服务器不可达或认证过期告知用户,并建议执行
sky api login --relogin -e <endpoint>
重新连接。
command not found: sky
SkyPilot未安装进入下方「安装SkyPilot」环节。
安装SkyPilot(仅当
sky
命令未找到时执行):
bash
pip install "skypilot[aws,gcp,kubernetes]"  # 选择用户需要的云平台
若不确定用户需要哪些云平台,先询问用户,再重新执行
sky api info
启动或连接服务器(仅当服务器「未运行」时执行):
询问用户:
你是否有已存在的SkyPilot API服务器可连接,还是需要在本地启动一个?
  • 连接现有服务器:执行
    sky api login -e <API_SERVER_URL>
    —— 从用户处获取URL。
  • 本地启动:执行
    sky api start
完成任一操作后,重新执行
sky api info
确认服务器可达。
步骤2:检查云凭证(仅针对全新配置 —— 若服务器已运行则跳过)
bash
sky check -o json
该命令会显示哪些云平台已启用或禁用。若用户目标云平台未启用,引导用户完成凭证配置(参考故障排查)。

Essential Commands

核心命令

Use
-o json
with status/query commands to get structured JSON output instead of tables.
Clusters — interactive development and debugging:
CommandDescription
sky launch -c NAME task.yaml
Launch a cluster or run a task
sky exec NAME task.yaml
Run task on existing cluster (skips provisioning); syncs workdir each time
sky exec NAME task.yaml -d
Same, but detach immediately (don't stream logs)
sky status -o json
Show all clusters
sky logs NAME
Stream job logs from a cluster
sky logs NAME --no-follow
Print existing logs and exit immediately
sky logs NAME --tail 50
Print last 50 lines of logs and exit
sky logs NAME --status
Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled
sky queue NAME -o json
List jobs on a cluster with status (structured JSON)
sky stop NAME
/
sky start NAME
Stop/restart to save costs (preserves disk)
sky down NAME
Tear down a cluster completely
sky gpus list -o json
List available GPU types across clouds
Managed Jobs — long-running unattended workloads:
CommandDescription
sky jobs launch task.yaml
Launch a managed job (auto lifecycle + recovery)
sky jobs queue -o json
Show all managed jobs and their status
sky jobs logs JOB_ID
Stream logs from a managed job
sky jobs cancel JOB_ID
Cancel a managed job
SkyServe — model serving with autoscaling:
CommandDescription
sky serve up serve.yaml -n NAME
Start a model serving service
sky serve status NAME
Show service status and endpoint URL
sky serve update NAME new.yaml
Update a running service (rolling)
sky serve down NAME
Tear down a service
For complete CLI reference, see CLI Reference.
在状态/查询类命令中添加
-o json
参数,可获取结构化JSON输出而非表格格式。
集群 —— 交互式开发与调试:
命令描述
sky launch -c NAME task.yaml
启动集群或运行任务
sky exec NAME task.yaml
在现有集群上运行任务(跳过资源分配环节);每次执行时同步工作目录
sky exec NAME task.yaml -d
同上,但立即脱离终端(不流式输出日志)
sky status -o json
显示所有集群状态
sky logs NAME
流式输出集群中的作业日志
sky logs NAME --no-follow
打印现有日志后立即退出
sky logs NAME --tail 50
打印最后50行日志后退出
sky logs NAME --status
退出码规则:0=成功,100=失败,101=未完成,102=未找到,103=已取消
sky queue NAME -o json
列出集群中的作业及状态(结构化JSON格式)
sky stop NAME
/
sky start NAME
停止/重启集群以节省成本(保留磁盘数据)
sky down NAME
彻底销毁集群
sky gpus list -o json
列出跨云平台的可用GPU类型
托管作业 —— 无需值守的长期工作负载:
命令描述
sky jobs launch task.yaml
启动托管作业(自动生命周期管理+故障恢复)
sky jobs queue -o json
显示所有托管作业及其状态
sky jobs logs JOB_ID
流式输出托管作业的日志
sky jobs cancel JOB_ID
取消托管作业
SkyServe —— 具备自动扩缩容的模型部署:
命令描述
sky serve up serve.yaml -n NAME
启动模型部署服务
sky serve status NAME
显示服务状态与端点URL
sky serve update NAME new.yaml
滚动更新运行中的服务
sky serve down NAME
销毁服务
完整CLI参考请查看CLI参考文档

Quick Start

快速入门

bash
undefined
bash
undefined

Launch a GPU cluster

启动GPU集群

sky launch -c mycluster --gpus H100 -- nvidia-smi
sky launch -c mycluster --gpus H100 -- nvidia-smi

Run a task from YAML

通过YAML文件运行任务

sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml

SSH into cluster

SSH连接到集群

ssh mycluster
ssh mycluster

Connect VSCode or Cursor to the cluster for interactive development

连接VSCode或Cursor到集群进行交互式开发

code --remote ssh-remote+mycluster /home/user/sky_workdir
code --remote ssh-remote+mycluster /home/user/sky_workdir

or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir

或:cursor --remote ssh-remote+mycluster /home/user/sky_workdir

Tear down

销毁集群

sky down mycluster
undefined
sky down mycluster
undefined

Task YAML Structure

任务YAML结构

The task YAML is SkyPilot's primary interface. All fields are optional.
yaml
undefined
任务YAML是SkyPilot的核心交互接口,所有字段均为可选。
yaml
undefined

task.yaml

task.yaml

name: my-training-job
name: my-training-job

Local directory to sync to remote ~/sky_workdir

同步到远程~/sky_workdir的本地目录

workdir: .
workdir: .

Number of nodes (for distributed training)

节点数量(用于分布式训练)

num_nodes: 1
resources:

GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)

accelerators: H200:8

Optional: pin to a specific cloud/region/infra

infra: aws # or aws/us-east-1, k8s, ssh/my-pool

If infra is left out, SkyPilot automatically fails over across all

enabled clouds/regions to find the cheapest available option.

Use spot instances for cost savings

use_spot: false

Disk size in GB

disk_size: 256

Open ports for serving

ports: 8080
num_nodes: 1
resources:

GPU/TPU加速器(SkyPilot自动选择最便宜的云平台/区域)

accelerators: H200:8

可选:固定到特定云平台/区域/基础设施

infra: aws # 或 aws/us-east-1, k8s, ssh/my-pool

若未设置infra,SkyPilot会自动在所有启用的云平台/区域中故障转移,寻找最便宜的可用选项。

使用竞价实例节省成本

use_spot: false

磁盘大小(GB)

disk_size: 256

开放用于部署的端口

ports: 8080

Environment variables (accessible in file_mounts, setup, and run)

环境变量(可在file_mounts、setup和run中访问)

envs: MODEL_NAME: my-model BATCH_SIZE: 32
envs: MODEL_NAME: my-model BATCH_SIZE: 32

Setup: runs once on cluster creation, cached on reuse

Setup:集群创建时运行一次,复用集群时会缓存

setup: | pip install torch transformers
setup: | pip install torch transformers

Run: the main command

Run:主命令

run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE

For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see [YAML Specification](references/yaml-spec.md).
run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE

包含文件挂载、SkyPilot设置的环境变量及高级字段的完整YAML schema,请查看[YAML规范文档](references/yaml-spec.md)。

GPU and Cloud Selection

GPU与云平台选择

IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing
sky gpus list
output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify
infra:
when the user explicitly requests a specific cloud or region.
Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
yaml
resources:
  accelerators: H200:8  # SkyPilot picks the cheapest cloud/region with H200:8
If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run
sky gpus list
and pick for them — present options and let the user decide, or use
any_of
to let SkyPilot maximize availability:
yaml
undefined
重要提示:让SkyPilot自动选择云平台和区域。不要通过解析
sky gpus list
的输出手动选择云平台/区域/实例。SkyPilot的优化器会自动在所有启用的云平台中选择最便宜的可用选项。仅当用户明确要求特定云平台或区域时,才设置
infra:
字段。
默认行为(推荐): 只需指定GPU类型,SkyPilot会自动找到最便宜的云平台/区域:
yaml
resources:
  accelerators: H200:8  # SkyPilot会自动选择提供H200:8的最便宜云平台/区域
若用户未指定GPU类型,询问用户需要哪种GPU(或根据用户运行的模型/工作负载推荐)。不要自行运行
sky gpus list
并选择,应提供选项让用户决定,或使用
any_of
让SkyPilot最大化可用性:
yaml
undefined

Let SkyPilot choose from multiple acceptable GPU types (cheapest wins)

让SkyPilot从多个可接受的GPU类型中选择(最便宜的优先)

resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8

Use `ordered` only when the user has a strict preference:

```yaml
resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8

仅当用户有严格偏好时,使用`ordered`字段:

```yaml

Try H100 first on AWS, fall back to GCP, then A100

优先尝试AWS上的H100,失败则 fallback到GCP,再 fallback到AWS上的A100-80GB

resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8

Only set `infra:` when the user explicitly says something like "use AWS" or "run on GCP us-central1":

```yaml
resources:
  infra: aws             # User asked for AWS specifically
  accelerators: H100:8
resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8

仅当用户明确要求时才设置`infra:`,比如用户说"使用AWS"或"在GCP us-central1上运行":

```yaml
resources:
  infra: aws             # 用户明确要求使用AWS
  accelerators: H100:8

Cluster Lifecycle

集群生命周期

bash
undefined
bash
undefined

Launch and run a task

启动并运行任务

sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml

Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)

启动时设置自动停止(推荐:节省成本,无需后续操作)

sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle
sky launch -c mycluster task.yaml -i 30 # 空闲30分钟后停止集群 sky launch -c mycluster task.yaml -i 30 --down # 空闲30分钟后销毁集群

Override or pass environment variables via CLI

通过CLI覆盖或传递环境变量

sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64

Re-run a different task on the same cluster (fast, skips provisioning)

在同一集群上运行不同任务(速度快,跳过资源分配环节)

sky exec mycluster another_task.yaml
sky exec mycluster another_task.yaml

Run an inline command

运行内联命令

sky exec mycluster -- python train.py --epochs 10
sky exec mycluster -- python train.py --epochs 10

Set autostop after launch (use if you forgot to set -i at launch time)

启动后设置自动停止(若启动时忘记设置-i参数)

sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start) sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)
sky autostop mycluster -i 30 # 空闲30分钟后停止集群,保留磁盘(可通过sky start重启) sky autostop mycluster -i 30 --down # 空闲30分钟后销毁集群(磁盘被删除,无法重启)

Stop to save costs, restart later

停止集群以节省成本,后续可重启

sky stop mycluster sky start mycluster
sky stop mycluster sky start mycluster

Tear down completely

彻底销毁集群

sky down mycluster
undefined
sky down mycluster
undefined

Workdir Sync Behavior

工作目录同步行为

workdir:
is synced to
~/sky_workdir
on the remote via
rsync
before every
sky exec
. rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.
To ensure a clean slate, SSH and wipe before
sky exec
:
bash
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml
Or clean inside
run:
if only specific artifacts need removal:
yaml
run: |
  find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
  cd ~/sky_workdir && make
workdir:
会通过
rsync
同步到远程的
~/sky_workdir
,且每次执行
sky exec
前都会同步。rsync是增量同步——本地删除的文件不会从远程删除。这可能导致实验使用过时的构建产物或旧配置运行。
为确保环境干净,可在
sky exec
前通过SSH清理:
bash
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml
或在
run:
中清理特定产物:
yaml
run: |
  find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
  cd ~/sky_workdir && make

Managed Jobs

托管作业

Use
sky jobs launch
for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
yaml
undefined
对于无需值守的长期作业,使用
sky jobs launch
。SkyPilot会管理完整生命周期——资源分配、执行、从抢占/配额/故障中恢复,以及资源销毁:
yaml
undefined

managed-job.yaml

managed-job.yaml

name: training-job
resources: accelerators: A100:8
run: | python train.py --resume-from-checkpoint

```bash
name: training-job
resources: accelerators: A100:8
run: | python train.py --resume-from-checkpoint

```bash

Launch as managed job

启动托管作业

sky jobs launch managed-job.yaml
sky jobs launch managed-job.yaml

Check status

查看状态

sky jobs queue -o json
sky jobs queue -o json

Stream logs

流式输出日志

sky jobs logs <job_id>
sky jobs logs <job_id>

Cancel

取消作业

sky jobs cancel <job_id>

**Checkpoint pattern**: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.
sky jobs cancel <job_id>

**检查点模式**:训练脚本应将检查点保存到持久化存储(云存储桶或卷),并在重启时从最新检查点恢复。SkyPilot负责集群恢复,脚本负责状态恢复。

SkyServe: Model Serving

SkyServe:模型部署

yaml
undefined
yaml
undefined

serve.yaml

serve.yaml

resources: accelerators: A100:1 ports: 8080
run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5

```bash
resources: accelerators: A100:1 ports: 8080
run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5

```bash

Start service

启动服务

sky serve up serve.yaml -n my-llm
sky serve up serve.yaml -n my-llm

Check status / get endpoint

查看状态/获取端点

sky serve status my-llm sky serve status my-llm --endpoint
sky serve status my-llm sky serve status my-llm --endpoint

Update (rolling)

滚动更新服务

sky serve update my-llm new-serve.yaml
sky serve update my-llm new-serve.yaml

Tear down

销毁服务

sky serve down my-llm
undefined
sky serve down my-llm
undefined

Common Workflows

常见工作流

Fine-Tuning Workflow

微调工作流

  1. Write task YAML with
    setup
    (install deps) and
    run
    (training command)
  2. Use
    file_mounts
    or
    workdir
    to sync code
  3. sky launch -c train task.yaml
    to launch
  4. sky logs train
    to monitor
  5. sky exec train -- python eval.py
    to evaluate on same cluster
  6. sky down train
    when done
  1. 编写包含
    setup
    (安装依赖)和
    run
    (训练命令)的任务YAML
  2. 使用
    file_mounts
    workdir
    同步代码
  3. 执行
    sky launch -c train task.yaml
    启动集群
  4. 执行
    sky logs train
    监控进度
  5. 执行
    sky exec train -- python eval.py
    在同一集群上进行评估
  6. 完成后执行
    sky down train
    销毁集群

Hyperparameter Sweep

超参数调优

  1. Create parameterized YAML with
    envs
  2. Launch multiple managed jobs:
    bash
    for lr in 1e-4 1e-5 1e-6; do
      sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
    done
  3. Monitor with
    sky jobs queue -o json
  1. 创建包含
    envs
    的参数化YAML
  2. 启动多个托管作业:
    bash
    for lr in 1e-4 1e-5 1e-6; do
      sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
    done
  3. 使用
    sky jobs queue -o json
    监控所有作业

Model Serving Deployment

模型部署流程

  1. Write serve YAML with
    service:
    section
  2. sky serve up serve.yaml -n my-service
  3. Get endpoint:
    sky serve status my-service --endpoint
  4. Update model:
    sky serve update my-service updated.yaml
  1. 编写包含
    service:
    字段的部署YAML
  2. 执行
    sky serve up serve.yaml -n my-service
    启动服务
  3. 获取端点:
    sky serve status my-service --endpoint
  4. 更新模型:
    sky serve update my-service updated.yaml

Parallel Experiment Submission

并行实验提交

Use
sky exec -d
to submit jobs to multiple VMs without blocking, then collect results:
bash
undefined
使用
sky exec -d
向多个VM提交作业而不阻塞终端,之后收集结果:
bash
undefined

Submit all experiments (detached, returns after job is queued)

提交所有实验(脱离终端,作业排队后立即返回)

for i in 1 2 3 4; do sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d done
for i in 1 2 3 4; do sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d done

Get the latest job ID from a cluster

获取集群中最新的作业ID

job_id=$(sky queue exp-vm-01 -o json
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
job_id=$(sky queue exp-vm-01 -o json
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")

Wait for a specific job and fetch last 50 lines

等待特定作业完成并获取最后50行日志

sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50

Check all jobs across a cluster at once

一次性检查集群中所有作业状态

sky queue exp-vm-01 -o json
undefined
sky queue exp-vm-01 -o json
undefined

Agent Feedback Loop

Agent反馈循环

When using SkyPilot programmatically, follow this loop:
  1. Validate:
    sky launch --dryrun task.yaml
    (check resource availability/cost)
  2. Launch:
    sky launch -c mycluster task.yaml
  3. Monitor:
    sky status -o json
    and
    sky queue mycluster -o json
  4. Wait for completion:
    sky logs mycluster <JOB_ID>
    (streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from
    sky queue mycluster -o json
    ). For long-running jobs where you don't need intermediate output, use
    sky logs mycluster <JOB_ID> --status
    instead (blocks silently, exits 0 on success).
  5. Inspect output:
    sky logs mycluster <JOB_ID> --no-follow
    or
    sky logs mycluster <JOB_ID> --tail 100
  6. Debug:
    ssh mycluster
    (interactive)
  7. Iterate:
    sky exec mycluster updated_task.yaml
    (run on existing cluster)
  8. Cleanup:
    sky down mycluster
Never poll with
sleep
+
sky queue
— use
sky logs CLUSTER JOB_ID
to stream logs and block until done. Use
--status
if you only need the exit code, or
--tail N
to fetch recent output after completion.
通过编程方式使用SkyPilot时,遵循以下循环:
  1. 验证
    sky launch --dryrun task.yaml
    (检查资源可用性/成本)
  2. 启动
    sky launch -c mycluster task.yaml
  3. 监控
    sky status -o json
    sky queue mycluster -o json
  4. 等待完成
    sky logs mycluster <JOB_ID>
    (流式输出日志,可观察进度并响应停滞;阻塞直到作业完成;从
    sky queue mycluster -o json
    获取JOB_ID)。对于无需中间输出的长期作业,使用
    sky logs mycluster <JOB_ID> --status
    替代(静默阻塞,成功时退出码为0)。
  5. 检查输出
    sky logs mycluster <JOB_ID> --no-follow
    sky logs mycluster <JOB_ID> --tail 100
  6. 调试
    ssh mycluster
    (交互式调试)
  7. 迭代
    sky exec mycluster updated_task.yaml
    (在现有集群上运行更新后的任务)
  8. 清理
    sky down mycluster
切勿使用
sleep
+
sky queue
轮询状态
—— 使用
sky logs CLUSTER JOB_ID
流式输出日志并阻塞直到完成。若仅需要退出码,使用
--status
参数;若仅需要最新输出,使用
--tail N
参数获取最后N行日志。

Common Agent Mistakes

Agent常见错误

MistakeWhy it's wrongDo this instead
Manually picking cloud/region from
sky gpus list
output
SkyPilot optimizer does this automatically and betterJust set
accelerators:
and let SkyPilot choose
Using
sky launch
for long-running unattended jobs
No recovery if preempted or interruptedUse
sky jobs launch
for unattended work
Forgetting
sky down
or autostop after work is done
Wastes money on idle clustersAlways clean up, or use
-i <minutes> --down
at launch
Hardcoding
infra: aws
without user asking
Limits availability and increases costOnly set
infra:
when user explicitly requests a cloud
Not using
envs:
for configurable values
Hard to reuse or override from CLIUse
envs:
in YAML +
--env KEY=VAL
for parameterization
Running
sky launch
without
-c <name>
Creates randomly-named cluster, hard to referenceAlways name clusters with
-c
Parsing table output from status commandsTable formatting is for humans, fragile to parseUse
-o json
for structured output
Using deprecated
cloud:
/
region:
/
zone:
fields
Deprecated in favor of
infra:
Use
infra: aws/us-east-1
instead
Polling job status with
sleep
+
sky queue
Wastes tokens, introduces timing bugs, fragileUse
sky logs CLUSTER JOB_ID --status
to block until done
Assuming workdir sync removes remote filesrsync is additive; old remote files persist across
sky exec
calls
SSH and manually clean
~/sky_workdir
, or clean in
run:
script
Not using
--tail
when only last output matters
Streaming full logs wastes tokens for long jobsUse
sky logs CLUSTER JOB_ID --tail 50
for last N lines
错误行为错误原因正确做法
sky gpus list
输出中手动选择云平台/区域
SkyPilot优化器会自动且更好地完成此操作只需设置
accelerators:
,让SkyPilot自动选择
使用
sky launch
运行无需值守的长期作业
若被抢占或中断,无法自动恢复对于无需值守的工作负载,使用
sky jobs launch
作业完成后忘记执行
sky down
或设置自动停止
闲置集群会浪费资金作业完成后务必清理,或在启动时使用
-i <minutes> --down
参数
未经过用户同意就硬编码
infra: aws
会限制可用性并增加成本仅当用户明确要求特定云平台时才设置
infra:
未使用
envs:
存储可配置参数
难以复用或通过CLI覆盖在YAML中使用
envs:
+
--env KEY=VAL
实现参数化
运行
sky launch
时未使用
-c <name>
参数
会创建随机命名的集群,难以引用始终使用
-c
参数为集群命名
解析状态命令的表格输出表格格式面向人类,解析逻辑脆弱使用
-o json
获取结构化输出
使用已弃用的
cloud:
/
region:
/
zone:
字段
已被
infra:
替代
使用
infra: aws/us-east-1
替代
使用
sleep
+
sky queue
轮询作业状态
浪费令牌,引入时序错误,逻辑脆弱使用
sky logs CLUSTER JOB_ID --status
阻塞直到完成
假设工作目录同步会删除远程文件rsync是增量同步;远程旧文件会在多次
sky exec
后保留
通过SSH手动清理
~/sky_workdir
,或在
run:
脚本中清理
仅需要最新输出时未使用
--tail
参数
流式输出完整日志会浪费令牌使用
sky logs CLUSTER JOB_ID --tail 50
获取最后N行日志

Common Issues Quick Reference

常见问题速查

IssueSolution
GPU not availableUse
any_of
for fallback, or try different regions/clouds
Setup takes too longSkyPilot caches setup; use
sky exec
to skip it on reruns
Task fails silentlyCheck
sky logs <cluster>
or
ssh <cluster>
to debug
Cluster stuck in INIT
sky down <cluster>
and relaunch
Preemption/quotaUse
sky jobs launch
for automatic recovery and lifecycle management
Port not accessibleEnsure
ports:
is set in resources and security groups allow traffic
File sync slowUse cloud bucket mounts instead of
workdir
for large datasets
Credentials errorRun
sky check -o json
and inspect which clouds are disabled
问题解决方案
GPU不可用使用
any_of
设置备选方案,或尝试不同区域/云平台
Setup耗时过长SkyPilot会缓存Setup结果;使用
sky exec
跳过重复执行
任务静默失败检查
sky logs <cluster>
或通过
ssh <cluster>
调试
集群卡在INIT状态执行
sky down <cluster>
后重新启动
实例被抢占/配额不足使用
sky jobs launch
实现自动恢复与生命周期管理
端口无法访问确保在resources中设置了
ports:
,且安全组允许流量
文件同步缓慢对于大型数据集,使用云存储桶挂载替代
workdir
凭证错误执行
sky check -o json
并检查哪些云平台被禁用

References

参考文档

For detailed reference documentation:
  • CLI Reference — All commands and flags
  • YAML Specification — Complete task YAML schema, file mounts, environment variables
  • Python SDK — Programmatic API and SDK usage
  • Advanced Patterns — Multi-cloud, distributed training, production patterns
  • Troubleshooting — Error diagnosis and solutions
  • Examples — Copy-paste task YAML examples
详细参考文档:
  • CLI参考文档 —— 所有命令与参数
  • YAML规范文档 —— 完整任务YAML schema、文件挂载、环境变量
  • Python SDK —— 编程式API与SDK使用方法
  • 高级模式 —— 多云、分布式训练、生产级模式
  • 故障排查 —— 错误诊断与解决方案
  • 示例 —— 可直接复制使用的任务YAML示例