skypilot

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SkyPilot Skill

SkyPilot 使用技能

SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.

SkyPilot是一个统一框架，可在任意云平台、Slurm或Kubernetes上运行AI工作负载。它提供单一接口，支持在25+云平台（包括AWS、GCP、Azure、Coreweave、Nebius、Lambda、Together AI、RunPod等）、Kubernetes集群及Slurm集群上启动集群、运行作业和部署模型。

When to Use SkyPilot

何时使用SkyPilot

Use SkyPilot when you need to:

Manage compute resources on any cloud, Slurm, or Kubernetes cluster
Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
Run training, fine-tuning, or batch inference jobs
Serve models with autoscaling and multi-cloud replicas (SkyServe)
Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
Find the cheapest or most available GPU across clouds

Don't use SkyPilot for:

Local-only workloads (use Docker/conda directly)

在以下场景中使用SkyPilot：

管理任意云平台、Slurm或Kubernetes集群上的计算资源
在任意云平台、Kubernetes或Slurm上启动CPU/GPU/TPU（如GB300、GB200、B200、H200、H100等）实例
运行训练、微调或批量推理作业
通过自动扩缩容和多云副本部署模型（SkyServe）
运行具备自动生命周期管理与恢复能力的长期作业（托管作业）
跨云寻找最便宜或可用性最高的GPU资源

请勿在以下场景使用SkyPilot：

仅本地运行的工作负载（直接使用Docker/conda即可）

Capabilities: When to Use What

功能场景匹配

SkyPilot has three core abstractions. Use the right one for each stage of your workflow:

1. SkyPilot Clusters (

sky launch

sky exec

) — Interactive development and debugging

Use during initial development, debugging, and experimentation
Launch a cluster, SSH in or connect VSCode/Cursor (
```
code --remote ssh-remote+CLUSTER
```
), iterate quickly
Cluster stays up until you stop/down it or autostop triggers
Best for: prototyping, debugging, short experiments

2. Managed Jobs (

sky jobs launch

) — Long-running training and batch jobs

Use when submitting long-running jobs that should run unattended
Manages the full lifecycle: provisioning, execution, recovery, and teardown
Automatically recovers from spot preemptions, quota limits, and transient failures
Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference

3. SkyServe (

sky serve up

) — Production model serving

Use when serving models at scale with autoscaling
Start with
```
sky launch
```
+ open port to test your serving setup, then use
```
sky serve up
```
to scale
Provides load balancing, autoscaling, and multi-cloud replicas
Best for: model serving endpoints, API services

SkyPilot包含三个核心抽象组件，根据工作流阶段选择合适的组件：

1. SkyPilot 集群（

sky launch

sky exec

）—— 交互式开发与调试

适用于初始开发、调试和实验阶段
启动集群后，可通过SSH连接或VSCode/Cursor远程连接（
```
code --remote ssh-remote+CLUSTER
```
），快速迭代开发
集群会保持运行状态，直到你手动停止/销毁或自动停止触发
最佳适用场景：原型开发、调试、短期实验

2. 托管作业（

sky jobs launch

）—— 长期训练与批量作业

适用于提交无需人工值守的长期作业
管理完整生命周期：资源分配、执行、故障恢复和资源销毁
自动从竞价实例抢占、配额限制和临时故障中恢复
支持跨云平台、Kubernetes和Slurm运行（处理抢占和配额问题）
最佳适用场景：训练任务、微调、超参数调优、批量推理

3. SkyServe（

sky serve up

）—— 生产级模型部署

适用于需要自动扩缩容的大规模模型部署
先使用
```
sky launch
```
+ 开放端口测试部署配置，再用
```
sky serve up
```
进行扩缩容
提供负载均衡、自动扩缩容和多云副本能力
最佳适用场景：模型部署端点、API服务

Before You Start (Agent Bootstrap)

开始前准备（Agent初始化）

Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.

Step 1: Check installation and API server connectivity

bash

sky api info

Output contains	Meaning	Next action
Server version and status	Server is running and connected	Bootstrap done. Skip to user's task.
`No SkyPilot API server is connected`	No server connected	Go to "Start or connect a server" below.
`Could not connect to SkyPilot API server`	Remote server unreachable or auth expired	Tell the user and suggest `sky api login --relogin -e <endpoint>` to reconnect.
`command not found: sky`	SkyPilot not installed	Go to "Install SkyPilot" below.

Install SkyPilot (only if

sky

command not found):

bash

pip install "skypilot[aws,gcp,kubernetes]"  # Pick clouds the user needs

Ask the user which clouds they need if unclear, then re-run

sky api info

Start or connect a server (only if "not running"):

Ask the user:

Do you have an existing SkyPilot API server to connect to, or should I start one locally?

Connect to existing server:
```
sky api login -e <API_SERVER_URL>
```
— get the URL from the user.
Start locally:
```
sky api start
```

After either path, re-run

sky api info

to confirm the server is reachable.

Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)

bash

sky check -o json

This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).

先完成初始化，确认SkyPilot已安装、已连接API服务器且配置了云凭证。确认完成后，直接进入用户任务。

步骤1：检查安装状态与API服务器连接

bash

sky api info

输出内容	含义	下一步操作
服务器版本与状态	服务器已运行并连接成功	初始化完成，直接进入用户任务。
`No SkyPilot API server is connected`	未连接服务器	进入下方「启动或连接服务器」环节。
`Could not connect to SkyPilot API server`	远程服务器不可达或认证过期	告知用户，并建议执行 `sky api login --relogin -e <endpoint>` 重新连接。
`command not found: sky`	SkyPilot未安装	进入下方「安装SkyPilot」环节。

安装SkyPilot（仅当

sky

命令未找到时执行）：

bash

pip install "skypilot[aws,gcp,kubernetes]"  # 选择用户需要的云平台

若不确定用户需要哪些云平台，先询问用户，再重新执行

sky api info

。

启动或连接服务器（仅当服务器「未运行」时执行）：

询问用户：

你是否有已存在的SkyPilot API服务器可连接，还是需要在本地启动一个？

连接现有服务器：执行
```
sky api login -e <API_SERVER_URL>
```
—— 从用户处获取URL。
本地启动：执行
```
sky api start
```

完成任一操作后，重新执行

sky api info

确认服务器可达。

步骤2：检查云凭证（仅针对全新配置 —— 若服务器已运行则跳过）

bash

sky check -o json

该命令会显示哪些云平台已启用或禁用。若用户目标云平台未启用，引导用户完成凭证配置（参考故障排查）。

Essential Commands

核心命令

Use

-o json

with status/query commands to get structured JSON output instead of tables.

Clusters — interactive development and debugging:

Command	Description
`sky launch -c NAME task.yaml`	Launch a cluster or run a task
`sky exec NAME task.yaml`	Run task on existing cluster (skips provisioning); syncs workdir each time
`sky exec NAME task.yaml -d`	Same, but detach immediately (don't stream logs)
`sky status -o json`	Show all clusters
`sky logs NAME`	Stream job logs from a cluster
`sky logs NAME --no-follow`	Print existing logs and exit immediately
`sky logs NAME --tail 50`	Print last 50 lines of logs and exit
`sky logs NAME --status`	Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled
`sky queue NAME -o json`	List jobs on a cluster with status (structured JSON)
`sky stop NAME` / `sky start NAME`	Stop/restart to save costs (preserves disk)
`sky down NAME`	Tear down a cluster completely
`sky gpus list -o json`	List available GPU types across clouds

Managed Jobs — long-running unattended workloads:

Command	Description
`sky jobs launch task.yaml`	Launch a managed job (auto lifecycle + recovery)
`sky jobs queue -o json`	Show all managed jobs and their status
`sky jobs logs JOB_ID`	Stream logs from a managed job
`sky jobs cancel JOB_ID`	Cancel a managed job

SkyServe — model serving with autoscaling:

Command	Description
`sky serve up serve.yaml -n NAME`	Start a model serving service
`sky serve status NAME`	Show service status and endpoint URL
`sky serve update NAME new.yaml`	Update a running service (rolling)
`sky serve down NAME`	Tear down a service

For complete CLI reference, see CLI Reference.

在状态/查询类命令中添加

-o json

参数，可获取结构化JSON输出而非表格格式。

集群 —— 交互式开发与调试：

命令	描述
`sky launch -c NAME task.yaml`	启动集群或运行任务
`sky exec NAME task.yaml`	在现有集群上运行任务（跳过资源分配环节）；每次执行时同步工作目录
`sky exec NAME task.yaml -d`	同上，但立即脱离终端（不流式输出日志）
`sky status -o json`	显示所有集群状态
`sky logs NAME`	流式输出集群中的作业日志
`sky logs NAME --no-follow`	打印现有日志后立即退出
`sky logs NAME --tail 50`	打印最后50行日志后退出
`sky logs NAME --status`	退出码规则：0=成功，100=失败，101=未完成，102=未找到，103=已取消
`sky queue NAME -o json`	列出集群中的作业及状态（结构化JSON格式）
`sky stop NAME` / `sky start NAME`	停止/重启集群以节省成本（保留磁盘数据）
`sky down NAME`	彻底销毁集群
`sky gpus list -o json`	列出跨云平台的可用GPU类型

托管作业 —— 无需值守的长期工作负载：

命令	描述
`sky jobs launch task.yaml`	启动托管作业（自动生命周期管理+故障恢复）
`sky jobs queue -o json`	显示所有托管作业及其状态
`sky jobs logs JOB_ID`	流式输出托管作业的日志
`sky jobs cancel JOB_ID`	取消托管作业

SkyServe —— 具备自动扩缩容的模型部署：

命令	描述
`sky serve up serve.yaml -n NAME`	启动模型部署服务
`sky serve status NAME`	显示服务状态与端点URL
`sky serve update NAME new.yaml`	滚动更新运行中的服务
`sky serve down NAME`	销毁服务

完整CLI参考请查看CLI参考文档。

Quick Start

快速入门

bash

undefined

bash

undefined

Launch a GPU cluster

启动GPU集群

sky launch -c mycluster --gpus H100 -- nvidia-smi

Run a task from YAML

通过YAML文件运行任务

sky launch -c mycluster task.yaml

SSH into cluster

SSH连接到集群

ssh mycluster

Connect VSCode or Cursor to the cluster for interactive development

连接VSCode或Cursor到集群进行交互式开发

code --remote ssh-remote+mycluster /home/user/sky_workdir

or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir

或：cursor --remote ssh-remote+mycluster /home/user/sky_workdir

Tear down

销毁集群

sky down mycluster

undefined

sky down mycluster

undefined

Task YAML Structure

任务YAML结构

The task YAML is SkyPilot's primary interface. All fields are optional.

yaml

undefined

任务YAML是SkyPilot的核心交互接口，所有字段均为可选。

yaml

undefined

task.yaml

Local directory to sync to remote ~/sky_workdir

同步到远程~/sky_workdir的本地目录

workdir: .

Number of nodes (for distributed training)

节点数量（用于分布式训练）

num_nodes: 1

resources:

GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)

accelerators: H200:8

Optional: pin to a specific cloud/region/infra

infra: aws # or aws/us-east-1, k8s, ssh/my-pool

If infra is left out, SkyPilot automatically fails over across all

enabled clouds/regions to find the cheapest available option.

Use spot instances for cost savings

use_spot: false

Disk size in GB

disk_size: 256

Open ports for serving

ports: 8080

num_nodes: 1

resources:

GPU/TPU加速器（SkyPilot自动选择最便宜的云平台/区域）

accelerators: H200:8

可选：固定到特定云平台/区域/基础设施

infra: aws # 或 aws/us-east-1, k8s, ssh/my-pool

若未设置infra，SkyPilot会自动在所有启用的云平台/区域中故障转移，寻找最便宜的可用选项。

使用竞价实例节省成本

use_spot: false

磁盘大小（GB）

disk_size: 256

开放用于部署的端口

ports: 8080

Environment variables (accessible in file_mounts, setup, and run)

环境变量（可在file_mounts、setup和run中访问）

envs: MODEL_NAME: my-model BATCH_SIZE: 32

Setup: runs once on cluster creation, cached on reuse

Setup：集群创建时运行一次，复用集群时会缓存

setup: | pip install torch transformers

Run: the main command

Run：主命令

run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE


For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see [YAML Specification](references/yaml-spec.md).

run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE


包含文件挂载、SkyPilot设置的环境变量及高级字段的完整YAML schema，请查看[YAML规范文档](references/yaml-spec.md)。

GPU and Cloud Selection

GPU与云平台选择

IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing

sky gpus list

output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify

infra:

when the user explicitly requests a specific cloud or region.

Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:

yaml

resources:
  accelerators: H200:8  # SkyPilot picks the cheapest cloud/region with H200:8

If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run

sky gpus list

and pick for them — present options and let the user decide, or use

any_of

to let SkyPilot maximize availability:

yaml

undefined

重要提示：让SkyPilot自动选择云平台和区域。不要通过解析

sky gpus list

的输出手动选择云平台/区域/实例。SkyPilot的优化器会自动在所有启用的云平台中选择最便宜的可用选项。仅当用户明确要求特定云平台或区域时，才设置

infra:

字段。

默认行为（推荐）： 只需指定GPU类型，SkyPilot会自动找到最便宜的云平台/区域：

yaml

resources:
  accelerators: H200:8  # SkyPilot会自动选择提供H200:8的最便宜云平台/区域

若用户未指定GPU类型，询问用户需要哪种GPU（或根据用户运行的模型/工作负载推荐）。不要自行运行

sky gpus list

并选择，应提供选项让用户决定，或使用

any_of

让SkyPilot最大化可用性：

yaml

undefined

Let SkyPilot choose from multiple acceptable GPU types (cheapest wins)

让SkyPilot从多个可接受的GPU类型中选择（最便宜的优先）

resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8


Use `ordered` only when the user has a strict preference:

```yaml

resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8


仅当用户有严格偏好时，使用`ordered`字段：

```yaml

Try H100 first on AWS, fall back to GCP, then A100

优先尝试AWS上的H100，失败则 fallback到GCP，再 fallback到AWS上的A100-80GB

resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8


Only set `infra:` when the user explicitly says something like "use AWS" or "run on GCP us-central1":

```yaml
resources:
  infra: aws             # User asked for AWS specifically
  accelerators: H100:8

resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8


仅当用户明确要求时才设置`infra:`，比如用户说"使用AWS"或"在GCP us-central1上运行"：

```yaml
resources:
  infra: aws             # 用户明确要求使用AWS
  accelerators: H100:8

Cluster Lifecycle

集群生命周期

bash

undefined

bash

undefined

Launch and run a task

启动并运行任务

sky launch -c mycluster task.yaml

Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)

启动时设置自动停止（推荐：节省成本，无需后续操作）

sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle

sky launch -c mycluster task.yaml -i 30 # 空闲30分钟后停止集群 sky launch -c mycluster task.yaml -i 30 --down # 空闲30分钟后销毁集群

Override or pass environment variables via CLI

通过CLI覆盖或传递环境变量

sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64

Re-run a different task on the same cluster (fast, skips provisioning)

在同一集群上运行不同任务（速度快，跳过资源分配环节）

sky exec mycluster another_task.yaml

Run an inline command

运行内联命令

sky exec mycluster -- python train.py --epochs 10

Set autostop after launch (use if you forgot to set -i at launch time)

启动后设置自动停止（若启动时忘记设置-i参数）

sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start) sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)

sky autostop mycluster -i 30 # 空闲30分钟后停止集群，保留磁盘（可通过sky start重启） sky autostop mycluster -i 30 --down # 空闲30分钟后销毁集群（磁盘被删除，无法重启）

Stop to save costs, restart later

停止集群以节省成本，后续可重启

sky stop mycluster sky start mycluster

Tear down completely

彻底销毁集群

sky down mycluster

undefined

sky down mycluster

undefined

Workdir Sync Behavior

工作目录同步行为

workdir:

is synced to

~/sky_workdir

on the remote via

rsync

before every

sky exec

. rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.

To ensure a clean slate, SSH and wipe before

sky exec

bash

ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml

Or clean inside

run:

if only specific artifacts need removal:

yaml

run: |
  find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
  cd ~/sky_workdir && make

workdir:

会通过

rsync

同步到远程的

~/sky_workdir

，且每次执行

sky exec

前都会同步。rsync是增量同步——本地删除的文件不会从远程删除。这可能导致实验使用过时的构建产物或旧配置运行。

为确保环境干净，可在

sky exec

前通过SSH清理：

bash

ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml

或在

run:

中清理特定产物：

yaml

run: |
  find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
  cd ~/sky_workdir && make

Managed Jobs

托管作业

Use

sky jobs launch

for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:

yaml

undefined

对于无需值守的长期作业，使用

sky jobs launch

。SkyPilot会管理完整生命周期——资源分配、执行、从抢占/配额/故障中恢复，以及资源销毁：

yaml

undefined

managed-job.yaml

resources: accelerators: A100:8

run: | python train.py --resume-from-checkpoint


```bash

resources: accelerators: A100:8

run: | python train.py --resume-from-checkpoint


```bash

Launch as managed job

启动托管作业

sky jobs launch managed-job.yaml

Check status

查看状态

sky jobs queue -o json

Stream logs

流式输出日志

sky jobs logs <job_id>

Cancel

取消作业

sky jobs cancel <job_id>


**Checkpoint pattern**: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.

sky jobs cancel <job_id>


**检查点模式**：训练脚本应将检查点保存到持久化存储（云存储桶或卷），并在重启时从最新检查点恢复。SkyPilot负责集群恢复，脚本负责状态恢复。

SkyServe: Model Serving

SkyServe：模型部署

yaml

undefined

yaml

undefined

serve.yaml

resources: accelerators: A100:1 ports: 8080

run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080

service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5


```bash

resources: accelerators: A100:1 ports: 8080

run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080

service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5


```bash

Start service

启动服务

sky serve up serve.yaml -n my-llm

Check status / get endpoint

查看状态/获取端点

sky serve status my-llm sky serve status my-llm --endpoint

Update (rolling)

滚动更新服务

sky serve update my-llm new-serve.yaml

Tear down

销毁服务

sky serve down my-llm

undefined

sky serve down my-llm

undefined

Common Workflows

常见工作流

Fine-Tuning Workflow

微调工作流

Write task YAML with
```
setup
```
(install deps) and
```
run
```
(training command)
Use
```
file_mounts
```
or
```
workdir
```
to sync code
```
sky launch -c train task.yaml
```
to launch
```
sky logs train
```
to monitor
```
sky exec train -- python eval.py
```
to evaluate on same cluster
```
sky down train
```
when done

编写包含
```
setup
```
（安装依赖）和
```
run
```
（训练命令）的任务YAML
使用
```
file_mounts
```
或
```
workdir
```
同步代码
执行
```
sky launch -c train task.yaml
```
启动集群
执行
```
sky logs train
```
监控进度
执行
```
sky exec train -- python eval.py
```
在同一集群上进行评估
完成后执行
```
sky down train
```
销毁集群

Hyperparameter Sweep

超参数调优

Create parameterized YAML with
```
envs
```

Launch multiple managed jobs:

bash

for lr in 1e-4 1e-5 1e-6; do
  sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
done

Monitor with
```
sky jobs queue -o json
```

创建包含
```
envs
```
的参数化YAML

启动多个托管作业：

bash

for lr in 1e-4 1e-5 1e-6; do
  sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
done

使用
```
sky jobs queue -o json
```
监控所有作业

Model Serving Deployment

模型部署流程

Write serve YAML with
```
service:
```
section
```
sky serve up serve.yaml -n my-service
```
Get endpoint:
```
sky serve status my-service --endpoint
```

Update model:

sky serve update my-service updated.yaml

编写包含
```
service:
```
字段的部署YAML
执行
```
sky serve up serve.yaml -n my-service
```
启动服务
获取端点：
```
sky serve status my-service --endpoint
```

更新模型：

sky serve update my-service updated.yaml

Parallel Experiment Submission

并行实验提交

Use

sky exec -d

to submit jobs to multiple VMs without blocking, then collect results:

bash

undefined

使用

sky exec -d

向多个VM提交作业而不阻塞终端，之后收集结果：

bash

undefined

Submit all experiments (detached, returns after job is queued)

提交所有实验（脱离终端，作业排队后立即返回）

for i in 1 2 3 4; do sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d done

Get the latest job ID from a cluster

获取集群中最新的作业ID

job_id=$(sky queue exp-vm-01 -o json
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")

Wait for a specific job and fetch last 50 lines

等待特定作业完成并获取最后50行日志

sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50

Check all jobs across a cluster at once

一次性检查集群中所有作业状态

sky queue exp-vm-01 -o json

undefined

sky queue exp-vm-01 -o json

undefined

Agent Feedback Loop

Agent反馈循环

When using SkyPilot programmatically, follow this loop:

Validate:
```
sky launch --dryrun task.yaml
```
(check resource availability/cost)
Launch:
```
sky launch -c mycluster task.yaml
```

Monitor:

sky status -o json

and

sky queue mycluster -o json

Wait for completion:
```
sky logs mycluster <JOB_ID>
```
(streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from
```
sky queue mycluster -o json
```
). For long-running jobs where you don't need intermediate output, use
```
sky logs mycluster <JOB_ID> --status
```
instead (blocks silently, exits 0 on success).

Inspect output:

sky logs mycluster <JOB_ID> --no-follow

sky logs mycluster <JOB_ID> --tail 100

Debug:
```
ssh mycluster
```
(interactive)
Iterate:
```
sky exec mycluster updated_task.yaml
```
(run on existing cluster)
Cleanup:
```
sky down mycluster
```

Never poll with
sleep
+
sky queue
— use
sky logs CLUSTER JOB_ID
to stream logs and block until done. Use
--status
if you only need the exit code, or
--tail N
to fetch recent output after completion.

通过编程方式使用SkyPilot时，遵循以下循环：

验证：
```
sky launch --dryrun task.yaml
```
（检查资源可用性/成本）
启动：
```
sky launch -c mycluster task.yaml
```

监控：

sky status -o json

和

sky queue mycluster -o json

等待完成：
```
sky logs mycluster <JOB_ID>
```
（流式输出日志，可观察进度并响应停滞；阻塞直到作业完成；从
```
sky queue mycluster -o json
```
获取JOB_ID）。对于无需中间输出的长期作业，使用
```
sky logs mycluster <JOB_ID> --status
```
替代（静默阻塞，成功时退出码为0）。

检查输出：

sky logs mycluster <JOB_ID> --no-follow

或

sky logs mycluster <JOB_ID> --tail 100

调试：
```
ssh mycluster
```
（交互式调试）
迭代：
```
sky exec mycluster updated_task.yaml
```
（在现有集群上运行更新后的任务）
清理：
```
sky down mycluster
```

切勿使用
sleep
+
sky queue
轮询状态 —— 使用
sky logs CLUSTER JOB_ID
流式输出日志并阻塞直到完成。若仅需要退出码，使用
--status
参数；若仅需要最新输出，使用
--tail N
参数获取最后N行日志。

Common Agent Mistakes

Agent常见错误

Mistake	Why it's wrong	Do this instead
Manually picking cloud/region from `sky gpus list` output	SkyPilot optimizer does this automatically and better	Just set `accelerators:` and let SkyPilot choose
Using `sky launch` for long-running unattended jobs	No recovery if preempted or interrupted	Use `sky jobs launch` for unattended work
Forgetting `sky down` or autostop after work is done	Wastes money on idle clusters	Always clean up, or use `-i <minutes> --down` at launch
Hardcoding `infra: aws` without user asking	Limits availability and increases cost	Only set `infra:` when user explicitly requests a cloud
Not using `envs:` for configurable values	Hard to reuse or override from CLI	Use `envs:` in YAML + `--env KEY=VAL` for parameterization
Running `sky launch` without `-c <name>`	Creates randomly-named cluster, hard to reference	Always name clusters with `-c`
Parsing table output from status commands	Table formatting is for humans, fragile to parse	Use `-o json` for structured output
Using deprecated `cloud:` / `region:` / `zone:` fields	Deprecated in favor of `infra:`	Use `infra: aws/us-east-1` instead
Polling job status with `sleep` + `sky queue`	Wastes tokens, introduces timing bugs, fragile	Use `sky logs CLUSTER JOB_ID --status` to block until done
Assuming workdir sync removes remote files	rsync is additive; old remote files persist across `sky exec` calls	SSH and manually clean `~/sky_workdir` , or clean in `run:` script
Not using `--tail` when only last output matters	Streaming full logs wastes tokens for long jobs	Use `sky logs CLUSTER JOB_ID --tail 50` for last N lines

错误行为	错误原因	正确做法
从 `sky gpus list` 输出中手动选择云平台/区域	SkyPilot优化器会自动且更好地完成此操作	只需设置 `accelerators:` ，让SkyPilot自动选择
使用 `sky launch` 运行无需值守的长期作业	若被抢占或中断，无法自动恢复	对于无需值守的工作负载，使用 `sky jobs launch`
作业完成后忘记执行 `sky down` 或设置自动停止	闲置集群会浪费资金	作业完成后务必清理，或在启动时使用 `-i <minutes> --down` 参数
未经过用户同意就硬编码 `infra: aws`	会限制可用性并增加成本	仅当用户明确要求特定云平台时才设置 `infra:`
未使用 `envs:` 存储可配置参数	难以复用或通过CLI覆盖	在YAML中使用 `envs:` + `--env KEY=VAL` 实现参数化
运行 `sky launch` 时未使用 `-c <name>` 参数	会创建随机命名的集群，难以引用	始终使用 `-c` 参数为集群命名
解析状态命令的表格输出	表格格式面向人类，解析逻辑脆弱	使用 `-o json` 获取结构化输出
使用已弃用的 `cloud:` / `region:` / `zone:` 字段	已被 `infra:` 替代	使用 `infra: aws/us-east-1` 替代
使用 `sleep` + `sky queue` 轮询作业状态	浪费令牌，引入时序错误，逻辑脆弱	使用 `sky logs CLUSTER JOB_ID --status` 阻塞直到完成
假设工作目录同步会删除远程文件	rsync是增量同步；远程旧文件会在多次 `sky exec` 后保留	通过SSH手动清理 `~/sky_workdir` ，或在 `run:` 脚本中清理
仅需要最新输出时未使用 `--tail` 参数	流式输出完整日志会浪费令牌	使用 `sky logs CLUSTER JOB_ID --tail 50` 获取最后N行日志

Common Issues Quick Reference

常见问题速查

Issue	Solution
GPU not available	Use `any_of` for fallback, or try different regions/clouds
Setup takes too long	SkyPilot caches setup; use `sky exec` to skip it on reruns
Task fails silently	Check `sky logs <cluster>` or `ssh <cluster>` to debug
Cluster stuck in INIT	`sky down <cluster>` and relaunch
Preemption/quota	Use `sky jobs launch` for automatic recovery and lifecycle management
Port not accessible	Ensure `ports:` is set in resources and security groups allow traffic
File sync slow	Use cloud bucket mounts instead of `workdir` for large datasets
Credentials error	Run `sky check -o json` and inspect which clouds are disabled

问题	解决方案
GPU不可用	使用 `any_of` 设置备选方案，或尝试不同区域/云平台
Setup耗时过长	SkyPilot会缓存Setup结果；使用 `sky exec` 跳过重复执行
任务静默失败	检查 `sky logs <cluster>` 或通过 `ssh <cluster>` 调试
集群卡在INIT状态	执行 `sky down <cluster>` 后重新启动
实例被抢占/配额不足	使用 `sky jobs launch` 实现自动恢复与生命周期管理
端口无法访问	确保在resources中设置了 `ports:` ，且安全组允许流量
文件同步缓慢	对于大型数据集，使用云存储桶挂载替代 `workdir`
凭证错误	执行 `sky check -o json` 并检查哪些云平台被禁用

References

参考文档

For detailed reference documentation:

CLI Reference — All commands and flags
YAML Specification — Complete task YAML schema, file mounts, environment variables
Python SDK — Programmatic API and SDK usage
Advanced Patterns — Multi-cloud, distributed training, production patterns
Troubleshooting — Error diagnosis and solutions
Examples — Copy-paste task YAML examples

详细参考文档：

CLI参考文档 —— 所有命令与参数
YAML规范文档 —— 完整任务YAML schema、文件挂载、环境变量
Python SDK —— 编程式API与SDK使用方法
高级模式 —— 多云、分布式训练、生产级模式
故障排查 —— 错误诊断与解决方案
示例 —— 可直接复制使用的任务YAML示例