skypilot-multi-cloud-orchestration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkyPilot Multi-Cloud Orchestration
SkyPilot 多云编排
Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.
使用SkyPilot在多云环境中运行机器学习工作负载并实现自动成本优化的综合指南。
When to use SkyPilot
何时使用SkyPilot
Use SkyPilot when:
- Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
- Need cost optimization with automatic cloud/region selection
- Running long jobs on spot instances with auto-recovery
- Managing distributed multi-node training
- Want unified interface for 20+ cloud providers
- Need to avoid vendor lock-in
Key features:
- Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
- Cost optimization: Automatic cheapest cloud/region selection
- Spot instances: 3-6x cost savings with automatic recovery
- Distributed training: Multi-node jobs with gang scheduling
- Managed jobs: Auto-recovery, checkpointing, fault tolerance
- Sky Serve: Model serving with autoscaling
Use alternatives instead:
- Modal: For simpler serverless GPU with Python-native API
- RunPod: For single-cloud persistent pods
- Kubernetes: For existing K8s infrastructure
- Ray: For pure Ray-based orchestration
以下场景适用SkyPilot:
- 在多云环境(AWS、GCP、Azure等)中运行ML工作负载
- 需要通过自动选择云厂商/区域实现成本优化
- 在spot instances上运行长期作业并需要自动恢复功能
- 管理分布式多节点训练
- 想要一个支持20+云厂商的统一操作界面
- 需要避免厂商锁定
核心特性:
- 多云支持:AWS、GCP、Azure、Kubernetes、Lambda、RunPod等20+云厂商
- 成本优化:自动选择最便宜的云厂商/区域
- Spot instances:可节省3-6倍成本,且支持自动恢复
- 分布式训练:支持gang调度的多节点作业
- 托管作业:自动恢复、checkpointing、容错
- Sky Serve:支持自动扩缩容的模型服务
可选择以下替代工具:
- Modal:适用于需要更简洁的Python原生API的无服务器GPU场景
- RunPod:适用于单云环境下的持久化pod
- Kubernetes:适用于已有K8s基础设施的场景
- Ray:适用于纯基于Ray的编排场景
Quick start
快速开始
Installation
安装
bash
pip install "skypilot[aws,gcp,azure,kubernetes]"bash
pip install "skypilot[aws,gcp,azure,kubernetes]"Verify cloud credentials
Verify cloud credentials
sky check
undefinedsky check
undefinedHello World
快速入门示例
Create :
hello.yamlyaml
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"Launch:
bash
sky launch -c hello hello.yaml创建:
hello.yamlyaml
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"启动任务:
bash
sky launch -c hello hello.yamlSSH to cluster
SSH to cluster
ssh hello
ssh hello
Terminate
Terminate
sky down hello
undefinedsky down hello
undefinedCore concepts
核心概念
Task YAML structure
任务YAML结构
yaml
undefinedyaml
undefinedTask name (optional)
Task name (optional)
name: my-task
name: my-task
Resource requirements
Resource requirements
resources:
cloud: aws # Optional: auto-select if omitted
region: us-west-2 # Optional: auto-select if omitted
accelerators: A100:4 # GPU type and count
cpus: 8+ # Minimum CPUs
memory: 32+ # Minimum memory (GB)
use_spot: true # Use spot instances
disk_size: 256 # Disk size (GB)
resources:
cloud: aws # Optional: auto-select if omitted
region: us-west-2 # Optional: auto-select if omitted
accelerators: A100:4 # GPU type and count
cpus: 8+ # Minimum CPUs
memory: 32+ # Minimum memory (GB)
use_spot: true # Use spot instances
disk_size: 256 # Disk size (GB)
Number of nodes for distributed training
Number of nodes for distributed training
num_nodes: 2
num_nodes: 2
Working directory (synced to ~/sky_workdir)
Working directory (synced to ~/sky_workdir)
workdir: .
workdir: .
Setup commands (run once)
Setup commands (run once)
setup: |
pip install -r requirements.txt
setup: |
pip install -r requirements.txt
Run commands
Run commands
run: |
python train.py
undefinedrun: |
python train.py
undefinedKey commands
核心命令
| Command | Purpose |
|---|---|
| Launch cluster and run task |
| Run task on existing cluster |
| Show cluster status |
| Stop cluster (preserve state) |
| Terminate cluster |
| View task logs |
| Show job queue |
| Launch managed job |
| Deploy serving endpoint |
| 命令 | 用途 |
|---|---|
| 启动集群并运行任务 |
| 在已有集群上运行任务 |
| 查看集群状态 |
| 停止集群(保留状态) |
| 终止集群 |
| 查看任务日志 |
| 查看作业队列 |
| 启动托管作业 |
| 部署服务端点 |
GPU configuration
GPU配置
Available accelerators
可用加速器
yaml
undefinedyaml
undefinedNVIDIA GPUs
NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
Cloud-specific
Cloud-specific
accelerators: V100:4 # AWS/GCP
accelerators: TPU-v4-8 # GCP TPUs
undefinedaccelerators: V100:4 # AWS/GCP
accelerators: TPU-v4-8 # GCP TPUs
undefinedGPU fallbacks
GPU备选配置
yaml
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azureyaml
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azureSpot instances
Spot instances配置
yaml
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER # Auto-recover on preemptionyaml
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER # Auto-recover on preemptionCluster management
集群管理
Launch and execute
启动与执行
bash
undefinedbash
undefinedLaunch new cluster
Launch new cluster
sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml
Run on existing cluster (skip setup)
Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml
sky exec mycluster another_task.yaml
Interactive SSH
Interactive SSH
ssh mycluster
ssh mycluster
Stream logs
Stream logs
sky logs mycluster
undefinedsky logs mycluster
undefinedAutostop
自动停止
yaml
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true # Terminate instead of stopbash
undefinedyaml
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true # Terminate instead of stopbash
undefinedSet autostop via CLI
Set autostop via CLI
sky autostop mycluster -i 30 --down
undefinedsky autostop mycluster -i 30 --down
undefinedCluster status
集群状态
bash
undefinedbash
undefinedAll clusters
All clusters
sky status
sky status
Detailed view
Detailed view
sky status -a
undefinedsky status -a
undefinedDistributed training
分布式训练
Multi-node setup
多节点配置
yaml
resources:
accelerators: A100:8
num_nodes: 4 # 4 nodes × 8 GPUs = 32 GPUs total
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.pyyaml
resources:
accelerators: A100:8
num_nodes: 4 # 4 nodes × 8 GPUs = 32 GPUs total
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.pyEnvironment variables
环境变量
| Variable | Description |
|---|---|
| Node index (0 to num_nodes-1) |
| Newline-separated IP addresses |
| Total number of nodes |
| GPUs per node |
| 变量 | 说明 |
|---|---|
| 节点索引(0到num_nodes-1) |
| 换行分隔的IP地址列表 |
| 总节点数 |
| 每节点GPU数量 |
Head-node-only execution
仅主节点执行
bash
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fibash
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fiManaged jobs
托管作业
Spot recovery
Spot实例恢复
bash
undefinedbash
undefinedLaunch managed job with spot recovery
Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml
undefinedsky jobs launch -n my-job train.yaml
undefinedCheckpointing
检查点配置
yaml
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latestyaml
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latestJob management
作业管理
bash
undefinedbash
undefinedList jobs
List jobs
sky jobs queue
sky jobs queue
View logs
View logs
sky jobs logs my-job
sky jobs logs my-job
Cancel job
Cancel job
sky jobs cancel my-job
undefinedsky jobs cancel my-job
undefinedFile mounts and storage
文件挂载与存储
Local file sync
本地文件同步
yaml
workdir: ./my-project # Synced to ~/sky_workdir
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrcyaml
workdir: ./my-project # Synced to ~/sky_workdir
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrcCloud storage
云存储
yaml
file_mounts:
# Mount S3 bucket
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT # Stream from S3
# Copy GCS bucket
/models:
source: gs://my-bucket/models
mode: COPY # Pre-fetch to disk
# Cached mount (fast writes)
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHEDyaml
file_mounts:
# Mount S3 bucket
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT # Stream from S3
# Copy GCS bucket
/models:
source: gs://my-bucket/models
mode: COPY # Pre-fetch to disk
# Cached mount (fast writes)
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHEDStorage modes
存储模式
| Mode | Description | Best For |
|---|---|---|
| Stream from cloud | Large datasets, read-heavy |
| Pre-fetch to disk | Small files, random access |
| Cache with async upload | Checkpoints, outputs |
| 模式 | 说明 | 适用场景 |
|---|---|---|
| 从云存储流式读取 | 大型数据集、读密集型场景 |
| 预取到本地磁盘 | 小型文件、随机访问场景 |
| 缓存并异步上传 | 检查点、输出文件场景 |
Sky Serve (Model Serving)
Sky Serve(模型服务)
Basic service
基础服务
yaml
undefinedyaml
undefinedservice.yaml
service.yaml
service:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
```bashservice:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
```bashDeploy
Deploy
sky serve up -n my-service service.yaml
sky serve up -n my-service service.yaml
Check status
Check status
sky serve status
sky serve status
Get endpoint
Get endpoint
sky serve status my-service
undefinedsky serve status my-service
undefinedAutoscaling policies
自动扩缩容策略
yaml
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robinyaml
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robinCost optimization
成本优化
Automatic cloud selection
自动云厂商选择
yaml
undefinedyaml
undefinedSkyPilot finds cheapest option
SkyPilot finds cheapest option
resources:
accelerators: A100:8
No cloud specified - auto-select cheapest
```bashresources:
accelerators: A100:8
No cloud specified - auto-select cheapest
```bashShow optimizer decision
Show optimizer decision
sky launch task.yaml --dryrun
undefinedsky launch task.yaml --dryrun
undefinedCloud preferences
云厂商偏好设置
yaml
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azureyaml
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azureEnvironment variables
环境变量
yaml
envs:
HF_TOKEN: $HF_TOKEN # Inherited from local env
WANDB_API_KEY: $WANDB_API_KEYyaml
envs:
HF_TOKEN: $HF_TOKEN # Inherited from local env
WANDB_API_KEY: $WANDB_API_KEYOr use secrets
Or use secrets
secrets:
- HF_TOKEN
- WANDB_API_KEY
undefinedsecrets:
- HF_TOKEN
- WANDB_API_KEY
undefinedCommon workflows
常见工作流
Workflow 1: Fine-tuning with checkpoints
工作流1:带检查点的微调
yaml
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resumeyaml
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resumeWorkflow 2: Hyperparameter sweep
工作流2:超参数搜索
yaml
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_IDbash
undefinedyaml
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_IDbash
undefinedLaunch multiple jobs
Launch multiple jobs
for i in {1..10}; do
sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
undefinedfor i in {1..10}; do
sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
undefinedDebugging
调试
bash
undefinedbash
undefinedSSH to cluster
SSH to cluster
ssh mycluster
ssh mycluster
View logs
View logs
sky logs mycluster
sky logs mycluster
Check job queue
Check job queue
sky queue mycluster
sky queue mycluster
View managed job logs
View managed job logs
sky jobs logs my-job
undefinedsky jobs logs my-job
undefinedCommon issues
常见问题
| Issue | Solution |
|---|---|
| Quota exceeded | Request quota increase, try different region |
| Spot preemption | Use |
| Slow file sync | Use |
| GPU not available | Use |
| 问题 | 解决方案 |
|---|---|
| Quota exceeded | 申请配额提升,尝试其他区域 |
| Spot preemption | 使用 |
| Slow file sync | 对输出文件使用 |
| GPU not available | 使用 |
References
参考资料
- Advanced Usage - Multi-cloud, optimization, production patterns
- Troubleshooting - Common issues and solutions
- 高级用法 - 多云、优化、生产模式
- 故障排查 - 常见问题与解决方案
Resources
资源链接
- Documentation: https://docs.skypilot.co
- GitHub: https://github.com/skypilot-org/skypilot
- Slack: https://slack.skypilot.co
- Examples: https://github.com/skypilot-org/skypilot/tree/master/examples