SkyPilot Skill
SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
When to Use SkyPilot
Use SkyPilot when you need to:
- Manage compute resources on any cloud, Slurm, or Kubernetes cluster
- Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
- Run training, fine-tuning, or batch inference jobs
- Serve models with autoscaling and multi-cloud replicas (SkyServe)
- Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
- Find the cheapest or most available GPU across clouds
Don't use SkyPilot for:
- Local-only workloads (use Docker/conda directly)
Capabilities: When to Use What
SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
1. SkyPilot Clusters (
/
) — Interactive development and debugging
- Use during initial development, debugging, and experimentation
- Launch a cluster, SSH in or connect VSCode/Cursor (
code --remote ssh-remote+CLUSTER
), iterate quickly
- Cluster stays up until you stop/down it or autostop triggers
- Best for: prototyping, debugging, short experiments
2. Managed Jobs (
) — Long-running training and batch jobs
- Use when submitting long-running jobs that should run unattended
- Manages the full lifecycle: provisioning, execution, recovery, and teardown
- Automatically recovers from spot preemptions, quota limits, and transient failures
- Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
- Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
3. SkyServe (
) — Production model serving
- Use when serving models at scale with autoscaling
- Start with + open port to test your serving setup, then use to scale
- Provides load balancing, autoscaling, and multi-cloud replicas
- Best for: model serving endpoints, API services
Before You Start (Agent Bootstrap)
Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
Step 1: Check installation and API server connectivity
| Output contains | Meaning | Next action |
|---|
| Server version and status | Server is running and connected | Bootstrap done. Skip to user's task. |
No SkyPilot API server is connected
| No server connected | Go to "Start or connect a server" below. |
Could not connect to SkyPilot API server
| Remote server unreachable or auth expired | Tell the user and suggest sky api login --relogin -e <endpoint>
to reconnect. |
| SkyPilot not installed | Go to "Install SkyPilot" below. |
Install SkyPilot (only if
command not found):
bash
pip install "skypilot[aws,gcp,kubernetes]" # Pick clouds the user needs
Ask the user which clouds they need if unclear, then re-run
.
Start or connect a server (only if "not running"):
Ask the user:
Do you have an existing SkyPilot API server to connect to, or should I start one locally?
- Connect to existing server:
sky api login -e <API_SERVER_URL>
— get the URL from the user.
- Start locally:
After either path, re-run
to confirm the server is reachable.
Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)
This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).
Essential Commands
Use
with status/query commands to get structured JSON output instead of tables.
Clusters — interactive development and debugging:
| Command | Description |
|---|
sky launch -c NAME task.yaml
| Launch a cluster or run a task |
| Run task on existing cluster (skips provisioning); syncs workdir each time |
sky exec NAME task.yaml -d
| Same, but detach immediately (don't stream logs) |
| Show all clusters |
| Stream job logs from a cluster |
sky logs NAME --no-follow
| Print existing logs and exit immediately |
| Print last 50 lines of logs and exit |
| Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled |
| List jobs on a cluster with status (structured JSON) |
| / | Stop/restart to save costs (preserves disk) |
| Tear down a cluster completely |
| List available GPU types across clouds |
Managed Jobs — long-running unattended workloads:
| Command | Description |
|---|
sky jobs launch task.yaml
| Launch a managed job (auto lifecycle + recovery) |
| Show all managed jobs and their status |
| Stream logs from a managed job |
| Cancel a managed job |
SkyServe — model serving with autoscaling:
| Command | Description |
|---|
sky serve up serve.yaml -n NAME
| Start a model serving service |
| Show service status and endpoint URL |
sky serve update NAME new.yaml
| Update a running service (rolling) |
| Tear down a service |
For complete CLI reference, see CLI Reference.
Quick Start
bash
# Launch a GPU cluster
sky launch -c mycluster --gpus H100 -- nvidia-smi
# Run a task from YAML
sky launch -c mycluster task.yaml
# SSH into cluster
ssh mycluster
# Connect VSCode or Cursor to the cluster for interactive development
code --remote ssh-remote+mycluster /home/user/sky_workdir
# or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir
# Tear down
sky down mycluster
Task YAML Structure
The task YAML is SkyPilot's primary interface. All fields are optional.
yaml
# task.yaml
name: my-training-job
# Local directory to sync to remote ~/sky_workdir
workdir: .
# Number of nodes (for distributed training)
num_nodes: 1
resources:
# GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)
accelerators: H200:8
# Optional: pin to a specific cloud/region/infra
# infra: aws # or aws/us-east-1, k8s, ssh/my-pool
# If infra is left out, SkyPilot automatically fails over across all
# enabled clouds/regions to find the cheapest available option.
# Use spot instances for cost savings
use_spot: false
# Disk size in GB
disk_size: 256
# Open ports for serving
ports: 8080
# Environment variables (accessible in file_mounts, setup, and run)
envs:
MODEL_NAME: my-model
BATCH_SIZE: 32
# Setup: runs once on cluster creation, cached on reuse
setup: |
pip install torch transformers
# Run: the main command
run: |
python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see YAML Specification.
GPU and Cloud Selection
IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing
output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify
when the user explicitly requests a specific cloud or region.
Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
yaml
resources:
accelerators: H200:8 # SkyPilot picks the cheapest cloud/region with H200:8
If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run
and pick for them — present options and let the user decide, or use
to let SkyPilot maximize availability:
yaml
# Let SkyPilot choose from multiple acceptable GPU types (cheapest wins)
resources:
any_of:
- accelerators: H100:8
- accelerators: A100-80GB:8
- accelerators: A100:8
Use
only when the user has a strict preference:
yaml
# Try H100 first on AWS, fall back to GCP, then A100
resources:
ordered:
- infra: aws/us-east-1
accelerators: H100:8
- infra: gcp/us-central1
accelerators: H100:8
- infra: aws/us-west-2
accelerators: A100-80GB:8
Only set
when the user explicitly says something like "use AWS" or "run on GCP us-central1":
yaml
resources:
infra: aws # User asked for AWS specifically
accelerators: H100:8
Cluster Lifecycle
bash
# Launch and run a task
sky launch -c mycluster task.yaml
# Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)
sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle
sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle
# Override or pass environment variables via CLI
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
# Re-run a different task on the same cluster (fast, skips provisioning)
sky exec mycluster another_task.yaml
# Run an inline command
sky exec mycluster -- python train.py --epochs 10
# Set autostop after launch (use if you forgot to set -i at launch time)
sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start)
sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)
# Stop to save costs, restart later
sky stop mycluster
sky start mycluster
# Tear down completely
sky down mycluster
Workdir Sync Behavior
is synced to
on the remote via
before every
.
rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.
To ensure a clean slate, SSH and wipe before
:
bash
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml
Or clean inside
if only specific artifacts need removal:
yaml
run: |
find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
cd ~/sky_workdir && make
Managed Jobs
Use
for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
yaml
# managed-job.yaml
name: training-job
resources:
accelerators: A100:8
run: |
python train.py --resume-from-checkpoint
bash
# Launch as managed job
sky jobs launch managed-job.yaml
# Check status
sky jobs queue -o json
# Stream logs
sky jobs logs <job_id>
# Cancel
sky jobs cancel <job_id>
Checkpoint pattern: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.
SkyServe: Model Serving
yaml
# serve.yaml
resources:
accelerators: A100:1
ports: 8080
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8080
service:
readiness_probe: /v1/models
replica_policy:
min_replicas: 1
max_replicas: 3
target_qps_per_replica: 5
bash
# Start service
sky serve up serve.yaml -n my-llm
# Check status / get endpoint
sky serve status my-llm
sky serve status my-llm --endpoint
# Update (rolling)
sky serve update my-llm new-serve.yaml
# Tear down
sky serve down my-llm
Common Workflows
Fine-Tuning Workflow
- Write task YAML with (install deps) and (training command)
- Use or to sync code
sky launch -c train task.yaml
to launch
- to monitor
sky exec train -- python eval.py
to evaluate on same cluster
- when done
Hyperparameter Sweep
- Create parameterized YAML with
- Launch multiple managed jobs:
bash
for lr in 1e-4 1e-5 1e-6; do
sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
done
- Monitor with
Model Serving Deployment
- Write serve YAML with section
sky serve up serve.yaml -n my-service
- Get endpoint:
sky serve status my-service --endpoint
- Update model:
sky serve update my-service updated.yaml
Parallel Experiment Submission
Use
to submit jobs to multiple VMs without blocking, then collect results:
bash
# Submit all experiments (detached, returns after job is queued)
for i in 1 2 3 4; do
sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
done
# Get the latest job ID from a cluster
job_id=$(sky queue exp-vm-01 -o json \
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
# Wait for a specific job and fetch last 50 lines
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
# Check all jobs across a cluster at once
sky queue exp-vm-01 -o json
Agent Feedback Loop
When using SkyPilot programmatically, follow this loop:
- Validate:
sky launch --dryrun task.yaml
(check resource availability/cost)
- Launch:
sky launch -c mycluster task.yaml
- Monitor: and
sky queue mycluster -o json
- Wait for completion:
sky logs mycluster <JOB_ID>
(streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from sky queue mycluster -o json
). For long-running jobs where you don't need intermediate output, use sky logs mycluster <JOB_ID> --status
instead (blocks silently, exits 0 on success).
- Inspect output:
sky logs mycluster <JOB_ID> --no-follow
or sky logs mycluster <JOB_ID> --tail 100
- Debug: (interactive)
- Iterate:
sky exec mycluster updated_task.yaml
(run on existing cluster)
- Cleanup:
Never poll with + — use
to stream logs and block until done. Use
if you only need the exit code, or
to fetch recent output after completion.
Common Agent Mistakes
| Mistake | Why it's wrong | Do this instead |
|---|
| Manually picking cloud/region from output | SkyPilot optimizer does this automatically and better | Just set and let SkyPilot choose |
| Using for long-running unattended jobs | No recovery if preempted or interrupted | Use for unattended work |
| Forgetting or autostop after work is done | Wastes money on idle clusters | Always clean up, or use at launch |
| Hardcoding without user asking | Limits availability and increases cost | Only set when user explicitly requests a cloud |
| Not using for configurable values | Hard to reuse or override from CLI | Use in YAML + for parameterization |
| Running without | Creates randomly-named cluster, hard to reference | Always name clusters with |
| Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use for structured output |
| Using deprecated // fields | Deprecated in favor of | Use instead |
| Polling job status with + | Wastes tokens, introduces timing bugs, fragile | Use sky logs CLUSTER JOB_ID --status
to block until done |
| Assuming workdir sync removes remote files | rsync is additive; old remote files persist across calls | SSH and manually clean , or clean in script |
| Not using when only last output matters | Streaming full logs wastes tokens for long jobs | Use sky logs CLUSTER JOB_ID --tail 50
for last N lines |
Common Issues Quick Reference
| Issue | Solution |
|---|
| GPU not available | Use for fallback, or try different regions/clouds |
| Setup takes too long | SkyPilot caches setup; use to skip it on reruns |
| Task fails silently | Check or to debug |
| Cluster stuck in INIT | and relaunch |
| Preemption/quota | Use for automatic recovery and lifecycle management |
| Port not accessible | Ensure is set in resources and security groups allow traffic |
| File sync slow | Use cloud bucket mounts instead of for large datasets |
| Credentials error | Run and inspect which clouds are disabled |
References
For detailed reference documentation:
- CLI Reference — All commands and flags
- YAML Specification — Complete task YAML schema, file mounts, environment variables
- Python SDK — Programmatic API and SDK usage
- Advanced Patterns — Multi-cloud, distributed training, production patterns
- Troubleshooting — Error diagnosis and solutions
- Examples — Copy-paste task YAML examples