skypilot
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkyPilot Skill
SkyPilot 使用技能
SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
SkyPilot是一个统一框架,可在任意云平台、Slurm或Kubernetes上运行AI工作负载。它提供单一接口,支持在25+云平台(包括AWS、GCP、Azure、Coreweave、Nebius、Lambda、Together AI、RunPod等)、Kubernetes集群及Slurm集群上启动集群、运行作业和部署模型。
When to Use SkyPilot
何时使用SkyPilot
Use SkyPilot when you need to:
- Manage compute resources on any cloud, Slurm, or Kubernetes cluster
- Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
- Run training, fine-tuning, or batch inference jobs
- Serve models with autoscaling and multi-cloud replicas (SkyServe)
- Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
- Find the cheapest or most available GPU across clouds
Don't use SkyPilot for:
- Local-only workloads (use Docker/conda directly)
在以下场景中使用SkyPilot:
- 管理任意云平台、Slurm或Kubernetes集群上的计算资源
- 在任意云平台、Kubernetes或Slurm上启动CPU/GPU/TPU(如GB300、GB200、B200、H200、H100等)实例
- 运行训练、微调或批量推理作业
- 通过自动扩缩容和多云副本部署模型(SkyServe)
- 运行具备自动生命周期管理与恢复能力的长期作业(托管作业)
- 跨云寻找最便宜或可用性最高的GPU资源
请勿在以下场景使用SkyPilot:
- 仅本地运行的工作负载(直接使用Docker/conda即可)
Capabilities: When to Use What
功能场景匹配
SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
1. SkyPilot Clusters ( / ) — Interactive development and debugging
sky launchsky exec- Use during initial development, debugging, and experimentation
- Launch a cluster, SSH in or connect VSCode/Cursor (), iterate quickly
code --remote ssh-remote+CLUSTER - Cluster stays up until you stop/down it or autostop triggers
- Best for: prototyping, debugging, short experiments
2. Managed Jobs () — Long-running training and batch jobs
sky jobs launch- Use when submitting long-running jobs that should run unattended
- Manages the full lifecycle: provisioning, execution, recovery, and teardown
- Automatically recovers from spot preemptions, quota limits, and transient failures
- Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
- Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
3. SkyServe () — Production model serving
sky serve up- Use when serving models at scale with autoscaling
- Start with + open port to test your serving setup, then use
sky launchto scalesky serve up - Provides load balancing, autoscaling, and multi-cloud replicas
- Best for: model serving endpoints, API services
SkyPilot包含三个核心抽象组件,根据工作流阶段选择合适的组件:
1. SkyPilot 集群( / )—— 交互式开发与调试
sky launchsky exec- 适用于初始开发、调试和实验阶段
- 启动集群后,可通过SSH连接或VSCode/Cursor远程连接(),快速迭代开发
code --remote ssh-remote+CLUSTER - 集群会保持运行状态,直到你手动停止/销毁或自动停止触发
- 最佳适用场景:原型开发、调试、短期实验
2. 托管作业()—— 长期训练与批量作业
sky jobs launch- 适用于提交无需人工值守的长期作业
- 管理完整生命周期:资源分配、执行、故障恢复和资源销毁
- 自动从竞价实例抢占、配额限制和临时故障中恢复
- 支持跨云平台、Kubernetes和Slurm运行(处理抢占和配额问题)
- 最佳适用场景:训练任务、微调、超参数调优、批量推理
3. SkyServe()—— 生产级模型部署
sky serve up- 适用于需要自动扩缩容的大规模模型部署
- 先使用+ 开放端口测试部署配置,再用
sky launch进行扩缩容sky serve up - 提供负载均衡、自动扩缩容和多云副本能力
- 最佳适用场景:模型部署端点、API服务
Before You Start (Agent Bootstrap)
开始前准备(Agent初始化)
Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
Step 1: Check installation and API server connectivity
bash
sky api info| Output contains | Meaning | Next action |
|---|---|---|
| Server version and status | Server is running and connected | Bootstrap done. Skip to user's task. |
| No server connected | Go to "Start or connect a server" below. |
| Remote server unreachable or auth expired | Tell the user and suggest |
| SkyPilot not installed | Go to "Install SkyPilot" below. |
Install SkyPilot (only if command not found):
skybash
pip install "skypilot[aws,gcp,kubernetes]" # Pick clouds the user needsAsk the user which clouds they need if unclear, then re-run .
sky api infoStart or connect a server (only if "not running"):
Ask the user:
Do you have an existing SkyPilot API server to connect to, or should I start one locally?
- Connect to existing server: — get the URL from the user.
sky api login -e <API_SERVER_URL> - Start locally:
sky api start
After either path, re-run to confirm the server is reachable.
sky api infoStep 2: Check cloud credentials (only for fresh setups — skip if the server was already running)
bash
sky check -o jsonThis shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).
先完成初始化,确认SkyPilot已安装、已连接API服务器且配置了云凭证。确认完成后,直接进入用户任务。
步骤1:检查安装状态与API服务器连接
bash
sky api info| 输出内容 | 含义 | 下一步操作 |
|---|---|---|
| 服务器版本与状态 | 服务器已运行并连接成功 | 初始化完成,直接进入用户任务。 |
| 未连接服务器 | 进入下方「启动或连接服务器」环节。 |
| 远程服务器不可达或认证过期 | 告知用户,并建议执行 |
| SkyPilot未安装 | 进入下方「安装SkyPilot」环节。 |
安装SkyPilot(仅当命令未找到时执行):
skybash
pip install "skypilot[aws,gcp,kubernetes]" # 选择用户需要的云平台若不确定用户需要哪些云平台,先询问用户,再重新执行。
sky api info启动或连接服务器(仅当服务器「未运行」时执行):
询问用户:
你是否有已存在的SkyPilot API服务器可连接,还是需要在本地启动一个?
- 连接现有服务器:执行—— 从用户处获取URL。
sky api login -e <API_SERVER_URL> - 本地启动:执行
sky api start
完成任一操作后,重新执行确认服务器可达。
sky api info步骤2:检查云凭证(仅针对全新配置 —— 若服务器已运行则跳过)
bash
sky check -o json该命令会显示哪些云平台已启用或禁用。若用户目标云平台未启用,引导用户完成凭证配置(参考故障排查)。
Essential Commands
核心命令
Use with status/query commands to get structured JSON output instead of tables.
-o jsonClusters — interactive development and debugging:
| Command | Description |
|---|---|
| Launch a cluster or run a task |
| Run task on existing cluster (skips provisioning); syncs workdir each time |
| Same, but detach immediately (don't stream logs) |
| Show all clusters |
| Stream job logs from a cluster |
| Print existing logs and exit immediately |
| Print last 50 lines of logs and exit |
| Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled |
| List jobs on a cluster with status (structured JSON) |
| Stop/restart to save costs (preserves disk) |
| Tear down a cluster completely |
| List available GPU types across clouds |
Managed Jobs — long-running unattended workloads:
| Command | Description |
|---|---|
| Launch a managed job (auto lifecycle + recovery) |
| Show all managed jobs and their status |
| Stream logs from a managed job |
| Cancel a managed job |
SkyServe — model serving with autoscaling:
| Command | Description |
|---|---|
| Start a model serving service |
| Show service status and endpoint URL |
| Update a running service (rolling) |
| Tear down a service |
For complete CLI reference, see CLI Reference.
在状态/查询类命令中添加参数,可获取结构化JSON输出而非表格格式。
-o json集群 —— 交互式开发与调试:
| 命令 | 描述 |
|---|---|
| 启动集群或运行任务 |
| 在现有集群上运行任务(跳过资源分配环节);每次执行时同步工作目录 |
| 同上,但立即脱离终端(不流式输出日志) |
| 显示所有集群状态 |
| 流式输出集群中的作业日志 |
| 打印现有日志后立即退出 |
| 打印最后50行日志后退出 |
| 退出码规则:0=成功,100=失败,101=未完成,102=未找到,103=已取消 |
| 列出集群中的作业及状态(结构化JSON格式) |
| 停止/重启集群以节省成本(保留磁盘数据) |
| 彻底销毁集群 |
| 列出跨云平台的可用GPU类型 |
托管作业 —— 无需值守的长期工作负载:
| 命令 | 描述 |
|---|---|
| 启动托管作业(自动生命周期管理+故障恢复) |
| 显示所有托管作业及其状态 |
| 流式输出托管作业的日志 |
| 取消托管作业 |
SkyServe —— 具备自动扩缩容的模型部署:
| 命令 | 描述 |
|---|---|
| 启动模型部署服务 |
| 显示服务状态与端点URL |
| 滚动更新运行中的服务 |
| 销毁服务 |
完整CLI参考请查看CLI参考文档。
Quick Start
快速入门
bash
undefinedbash
undefinedLaunch a GPU cluster
启动GPU集群
sky launch -c mycluster --gpus H100 -- nvidia-smi
sky launch -c mycluster --gpus H100 -- nvidia-smi
Run a task from YAML
通过YAML文件运行任务
sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml
SSH into cluster
SSH连接到集群
ssh mycluster
ssh mycluster
Connect VSCode or Cursor to the cluster for interactive development
连接VSCode或Cursor到集群进行交互式开发
code --remote ssh-remote+mycluster /home/user/sky_workdir
code --remote ssh-remote+mycluster /home/user/sky_workdir
or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir
或:cursor --remote ssh-remote+mycluster /home/user/sky_workdir
Tear down
销毁集群
sky down mycluster
undefinedsky down mycluster
undefinedTask YAML Structure
任务YAML结构
The task YAML is SkyPilot's primary interface. All fields are optional.
yaml
undefined任务YAML是SkyPilot的核心交互接口,所有字段均为可选。
yaml
undefinedtask.yaml
task.yaml
name: my-training-job
name: my-training-job
Local directory to sync to remote ~/sky_workdir
同步到远程~/sky_workdir的本地目录
workdir: .
workdir: .
Number of nodes (for distributed training)
节点数量(用于分布式训练)
num_nodes: 1
resources:
GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)
accelerators: H200:8
Optional: pin to a specific cloud/region/infra
infra: aws # or aws/us-east-1, k8s, ssh/my-pool
If infra is left out, SkyPilot automatically fails over across all
enabled clouds/regions to find the cheapest available option.
Use spot instances for cost savings
use_spot: false
Disk size in GB
disk_size: 256
Open ports for serving
ports: 8080
num_nodes: 1
resources:
GPU/TPU加速器(SkyPilot自动选择最便宜的云平台/区域)
accelerators: H200:8
可选:固定到特定云平台/区域/基础设施
infra: aws # 或 aws/us-east-1, k8s, ssh/my-pool
若未设置infra,SkyPilot会自动在所有启用的云平台/区域中故障转移,寻找最便宜的可用选项。
使用竞价实例节省成本
use_spot: false
磁盘大小(GB)
disk_size: 256
开放用于部署的端口
ports: 8080
Environment variables (accessible in file_mounts, setup, and run)
环境变量(可在file_mounts、setup和run中访问)
envs:
MODEL_NAME: my-model
BATCH_SIZE: 32
envs:
MODEL_NAME: my-model
BATCH_SIZE: 32
Setup: runs once on cluster creation, cached on reuse
Setup:集群创建时运行一次,复用集群时会缓存
setup: |
pip install torch transformers
setup: |
pip install torch transformers
Run: the main command
Run:主命令
run: |
python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see [YAML Specification](references/yaml-spec.md).run: |
python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
包含文件挂载、SkyPilot设置的环境变量及高级字段的完整YAML schema,请查看[YAML规范文档](references/yaml-spec.md)。GPU and Cloud Selection
GPU与云平台选择
IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify when the user explicitly requests a specific cloud or region.
sky gpus listinfra:Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
yaml
resources:
accelerators: H200:8 # SkyPilot picks the cheapest cloud/region with H200:8If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run and pick for them — present options and let the user decide, or use to let SkyPilot maximize availability:
sky gpus listany_ofyaml
undefined重要提示:让SkyPilot自动选择云平台和区域。不要通过解析的输出手动选择云平台/区域/实例。SkyPilot的优化器会自动在所有启用的云平台中选择最便宜的可用选项。仅当用户明确要求特定云平台或区域时,才设置字段。
sky gpus listinfra:默认行为(推荐): 只需指定GPU类型,SkyPilot会自动找到最便宜的云平台/区域:
yaml
resources:
accelerators: H200:8 # SkyPilot会自动选择提供H200:8的最便宜云平台/区域若用户未指定GPU类型,询问用户需要哪种GPU(或根据用户运行的模型/工作负载推荐)。不要自行运行并选择,应提供选项让用户决定,或使用让SkyPilot最大化可用性:
sky gpus listany_ofyaml
undefinedLet SkyPilot choose from multiple acceptable GPU types (cheapest wins)
让SkyPilot从多个可接受的GPU类型中选择(最便宜的优先)
resources:
any_of:
- accelerators: H100:8
- accelerators: A100-80GB:8
- accelerators: A100:8
Use `ordered` only when the user has a strict preference:
```yamlresources:
any_of:
- accelerators: H100:8
- accelerators: A100-80GB:8
- accelerators: A100:8
仅当用户有严格偏好时,使用`ordered`字段:
```yamlTry H100 first on AWS, fall back to GCP, then A100
优先尝试AWS上的H100,失败则 fallback到GCP,再 fallback到AWS上的A100-80GB
resources:
ordered:
- infra: aws/us-east-1
accelerators: H100:8
- infra: gcp/us-central1
accelerators: H100:8
- infra: aws/us-west-2
accelerators: A100-80GB:8
Only set `infra:` when the user explicitly says something like "use AWS" or "run on GCP us-central1":
```yaml
resources:
infra: aws # User asked for AWS specifically
accelerators: H100:8resources:
ordered:
- infra: aws/us-east-1
accelerators: H100:8
- infra: gcp/us-central1
accelerators: H100:8
- infra: aws/us-west-2
accelerators: A100-80GB:8
仅当用户明确要求时才设置`infra:`,比如用户说"使用AWS"或"在GCP us-central1上运行":
```yaml
resources:
infra: aws # 用户明确要求使用AWS
accelerators: H100:8Cluster Lifecycle
集群生命周期
bash
undefinedbash
undefinedLaunch and run a task
启动并运行任务
sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml
Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)
启动时设置自动停止(推荐:节省成本,无需后续操作)
sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle
sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle
sky launch -c mycluster task.yaml -i 30 # 空闲30分钟后停止集群
sky launch -c mycluster task.yaml -i 30 --down # 空闲30分钟后销毁集群
Override or pass environment variables via CLI
通过CLI覆盖或传递环境变量
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
Re-run a different task on the same cluster (fast, skips provisioning)
在同一集群上运行不同任务(速度快,跳过资源分配环节)
sky exec mycluster another_task.yaml
sky exec mycluster another_task.yaml
Run an inline command
运行内联命令
sky exec mycluster -- python train.py --epochs 10
sky exec mycluster -- python train.py --epochs 10
Set autostop after launch (use if you forgot to set -i at launch time)
启动后设置自动停止(若启动时忘记设置-i参数)
sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start)
sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)
sky autostop mycluster -i 30 # 空闲30分钟后停止集群,保留磁盘(可通过sky start重启)
sky autostop mycluster -i 30 --down # 空闲30分钟后销毁集群(磁盘被删除,无法重启)
Stop to save costs, restart later
停止集群以节省成本,后续可重启
sky stop mycluster
sky start mycluster
sky stop mycluster
sky start mycluster
Tear down completely
彻底销毁集群
sky down mycluster
undefinedsky down mycluster
undefinedWorkdir Sync Behavior
工作目录同步行为
workdir:~/sky_workdirrsyncsky execTo ensure a clean slate, SSH and wipe before :
sky execbash
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yamlOr clean inside if only specific artifacts need removal:
run:yaml
run: |
find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
cd ~/sky_workdir && makeworkdir:rsync~/sky_workdirsky exec为确保环境干净,可在前通过SSH清理:
sky execbash
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml或在中清理特定产物:
run:yaml
run: |
find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
cd ~/sky_workdir && makeManaged Jobs
托管作业
Use for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
sky jobs launchyaml
undefined对于无需值守的长期作业,使用。SkyPilot会管理完整生命周期——资源分配、执行、从抢占/配额/故障中恢复,以及资源销毁:
sky jobs launchyaml
undefinedmanaged-job.yaml
managed-job.yaml
name: training-job
resources:
accelerators: A100:8
run: |
python train.py --resume-from-checkpoint
```bashname: training-job
resources:
accelerators: A100:8
run: |
python train.py --resume-from-checkpoint
```bashLaunch as managed job
启动托管作业
sky jobs launch managed-job.yaml
sky jobs launch managed-job.yaml
Check status
查看状态
sky jobs queue -o json
sky jobs queue -o json
Stream logs
流式输出日志
sky jobs logs <job_id>
sky jobs logs <job_id>
Cancel
取消作业
sky jobs cancel <job_id>
**Checkpoint pattern**: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.sky jobs cancel <job_id>
**检查点模式**:训练脚本应将检查点保存到持久化存储(云存储桶或卷),并在重启时从最新检查点恢复。SkyPilot负责集群恢复,脚本负责状态恢复。SkyServe: Model Serving
SkyServe:模型部署
yaml
undefinedyaml
undefinedserve.yaml
serve.yaml
resources:
accelerators: A100:1
ports: 8080
run: |
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
service:
readiness_probe: /v1/models
replica_policy:
min_replicas: 1
max_replicas: 3
target_qps_per_replica: 5
```bashresources:
accelerators: A100:1
ports: 8080
run: |
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
--model meta-llama/Llama-3.1-8B-Instruct
--port 8080
service:
readiness_probe: /v1/models
replica_policy:
min_replicas: 1
max_replicas: 3
target_qps_per_replica: 5
```bashStart service
启动服务
sky serve up serve.yaml -n my-llm
sky serve up serve.yaml -n my-llm
Check status / get endpoint
查看状态/获取端点
sky serve status my-llm
sky serve status my-llm --endpoint
sky serve status my-llm
sky serve status my-llm --endpoint
Update (rolling)
滚动更新服务
sky serve update my-llm new-serve.yaml
sky serve update my-llm new-serve.yaml
Tear down
销毁服务
sky serve down my-llm
undefinedsky serve down my-llm
undefinedCommon Workflows
常见工作流
Fine-Tuning Workflow
微调工作流
- Write task YAML with (install deps) and
setup(training command)run - Use or
file_mountsto sync codeworkdir - to launch
sky launch -c train task.yaml - to monitor
sky logs train - to evaluate on same cluster
sky exec train -- python eval.py - when done
sky down train
- 编写包含(安装依赖)和
setup(训练命令)的任务YAMLrun - 使用或
file_mounts同步代码workdir - 执行启动集群
sky launch -c train task.yaml - 执行监控进度
sky logs train - 执行在同一集群上进行评估
sky exec train -- python eval.py - 完成后执行销毁集群
sky down train
Hyperparameter Sweep
超参数调优
- Create parameterized YAML with
envs - Launch multiple managed jobs:
bash
for lr in 1e-4 1e-5 1e-6; do sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr done - Monitor with
sky jobs queue -o json
- 创建包含的参数化YAML
envs - 启动多个托管作业:
bash
for lr in 1e-4 1e-5 1e-6; do sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr done - 使用监控所有作业
sky jobs queue -o json
Model Serving Deployment
模型部署流程
- Write serve YAML with section
service: sky serve up serve.yaml -n my-service- Get endpoint:
sky serve status my-service --endpoint - Update model:
sky serve update my-service updated.yaml
- 编写包含字段的部署YAML
service: - 执行启动服务
sky serve up serve.yaml -n my-service - 获取端点:
sky serve status my-service --endpoint - 更新模型:
sky serve update my-service updated.yaml
Parallel Experiment Submission
并行实验提交
Use to submit jobs to multiple VMs without blocking, then collect results:
sky exec -dbash
undefined使用向多个VM提交作业而不阻塞终端,之后收集结果:
sky exec -dbash
undefinedSubmit all experiments (detached, returns after job is queued)
提交所有实验(脱离终端,作业排队后立即返回)
for i in 1 2 3 4; do
sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
done
for i in 1 2 3 4; do
sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
done
Get the latest job ID from a cluster
获取集群中最新的作业ID
job_id=$(sky queue exp-vm-01 -o json
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
job_id=$(sky queue exp-vm-01 -o json
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
Wait for a specific job and fetch last 50 lines
等待特定作业完成并获取最后50行日志
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
Check all jobs across a cluster at once
一次性检查集群中所有作业状态
sky queue exp-vm-01 -o json
undefinedsky queue exp-vm-01 -o json
undefinedAgent Feedback Loop
Agent反馈循环
When using SkyPilot programmatically, follow this loop:
- Validate: (check resource availability/cost)
sky launch --dryrun task.yaml - Launch:
sky launch -c mycluster task.yaml - Monitor: and
sky status -o jsonsky queue mycluster -o json - Wait for completion: (streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from
sky logs mycluster <JOB_ID>). For long-running jobs where you don't need intermediate output, usesky queue mycluster -o jsoninstead (blocks silently, exits 0 on success).sky logs mycluster <JOB_ID> --status - Inspect output: or
sky logs mycluster <JOB_ID> --no-followsky logs mycluster <JOB_ID> --tail 100 - Debug: (interactive)
ssh mycluster - Iterate: (run on existing cluster)
sky exec mycluster updated_task.yaml - Cleanup:
sky down mycluster
Never poll with+sleep— usesky queueto stream logs and block until done. Usesky logs CLUSTER JOB_IDif you only need the exit code, or--statusto fetch recent output after completion.--tail N
通过编程方式使用SkyPilot时,遵循以下循环:
- 验证:(检查资源可用性/成本)
sky launch --dryrun task.yaml - 启动:
sky launch -c mycluster task.yaml - 监控:和
sky status -o jsonsky queue mycluster -o json - 等待完成:(流式输出日志,可观察进度并响应停滞;阻塞直到作业完成;从
sky logs mycluster <JOB_ID>获取JOB_ID)。对于无需中间输出的长期作业,使用sky queue mycluster -o json替代(静默阻塞,成功时退出码为0)。sky logs mycluster <JOB_ID> --status - 检查输出:或
sky logs mycluster <JOB_ID> --no-followsky logs mycluster <JOB_ID> --tail 100 - 调试:(交互式调试)
ssh mycluster - 迭代:(在现有集群上运行更新后的任务)
sky exec mycluster updated_task.yaml - 清理:
sky down mycluster
切勿使用+sleep轮询状态 —— 使用sky queue流式输出日志并阻塞直到完成。若仅需要退出码,使用sky logs CLUSTER JOB_ID参数;若仅需要最新输出,使用--status参数获取最后N行日志。--tail N
Common Agent Mistakes
Agent常见错误
| Mistake | Why it's wrong | Do this instead |
|---|---|---|
Manually picking cloud/region from | SkyPilot optimizer does this automatically and better | Just set |
Using | No recovery if preempted or interrupted | Use |
Forgetting | Wastes money on idle clusters | Always clean up, or use |
Hardcoding | Limits availability and increases cost | Only set |
Not using | Hard to reuse or override from CLI | Use |
Running | Creates randomly-named cluster, hard to reference | Always name clusters with |
| Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use |
Using deprecated | Deprecated in favor of | Use |
Polling job status with | Wastes tokens, introduces timing bugs, fragile | Use |
| Assuming workdir sync removes remote files | rsync is additive; old remote files persist across | SSH and manually clean |
Not using | Streaming full logs wastes tokens for long jobs | Use |
| 错误行为 | 错误原因 | 正确做法 |
|---|---|---|
从 | SkyPilot优化器会自动且更好地完成此操作 | 只需设置 |
使用 | 若被抢占或中断,无法自动恢复 | 对于无需值守的工作负载,使用 |
作业完成后忘记执行 | 闲置集群会浪费资金 | 作业完成后务必清理,或在启动时使用 |
未经过用户同意就硬编码 | 会限制可用性并增加成本 | 仅当用户明确要求特定云平台时才设置 |
未使用 | 难以复用或通过CLI覆盖 | 在YAML中使用 |
运行 | 会创建随机命名的集群,难以引用 | 始终使用 |
| 解析状态命令的表格输出 | 表格格式面向人类,解析逻辑脆弱 | 使用 |
使用已弃用的 | 已被 | 使用 |
使用 | 浪费令牌,引入时序错误,逻辑脆弱 | 使用 |
| 假设工作目录同步会删除远程文件 | rsync是增量同步;远程旧文件会在多次 | 通过SSH手动清理 |
仅需要最新输出时未使用 | 流式输出完整日志会浪费令牌 | 使用 |
Common Issues Quick Reference
常见问题速查
| Issue | Solution |
|---|---|
| GPU not available | Use |
| Setup takes too long | SkyPilot caches setup; use |
| Task fails silently | Check |
| Cluster stuck in INIT | |
| Preemption/quota | Use |
| Port not accessible | Ensure |
| File sync slow | Use cloud bucket mounts instead of |
| Credentials error | Run |
| 问题 | 解决方案 |
|---|---|
| GPU不可用 | 使用 |
| Setup耗时过长 | SkyPilot会缓存Setup结果;使用 |
| 任务静默失败 | 检查 |
| 集群卡在INIT状态 | 执行 |
| 实例被抢占/配额不足 | 使用 |
| 端口无法访问 | 确保在resources中设置了 |
| 文件同步缓慢 | 对于大型数据集,使用云存储桶挂载替代 |
| 凭证错误 | 执行 |
References
参考文档
For detailed reference documentation:
- CLI Reference — All commands and flags
- YAML Specification — Complete task YAML schema, file mounts, environment variables
- Python SDK — Programmatic API and SDK usage
- Advanced Patterns — Multi-cloud, distributed training, production patterns
- Troubleshooting — Error diagnosis and solutions
- Examples — Copy-paste task YAML examples
详细参考文档:
- CLI参考文档 —— 所有命令与参数
- YAML规范文档 —— 完整任务YAML schema、文件挂载、环境变量
- Python SDK —— 编程式API与SDK使用方法
- 高级模式 —— 多云、分布式训练、生产级模式
- 故障排查 —— 错误诊断与解决方案
- 示例 —— 可直接复制使用的任务YAML示例