Skypilot Multi Cloud Orchestration

Name: Skypilot Multi Cloud Orchestration
Author: orchestra-research

orchestra-research/ai-research-skills

396 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

skypilot-multi-cloud-orchestration is a Claude Code skill that helps developers run and failover GPU training jobs across GCP, AWS, Azure, and Kubernetes using SkyPilot YAML resource patterns.

About

skypilot-multi-cloud-orchestration is an AI research skill for SkyPilot multi-cloud GPU job orchestration. The guide defines YAML resources with accelerators like A100:8 and any_of cloud preference lists spanning GCP us-central1, AWS us-west-2, and Azure westus2, plus wildcard regions such as aws us-* for spot capacity. Kubernetes entries can precede public cloud fallbacks, and instance_type constraints like p4d.24xlarge pin specific hardware SKUs. Developers reach for skypilot-multi-cloud-orchestration when GPU training jobs must survive quota limits or regional outages by automatically failing over across clouds instead of maintaining separate launch scripts per provider.

Cloud fallback chains with any_of for GCP, AWS, Azure, and Kubernetes
Wildcard regions (e.g. us-*) and instance-type or CPU/memory/accelerator constraints
Production managed jobs with spot recovery FAILOVER and max_restarts_on_errors
Disk tier, network tier, and controller memory scaling for hundreds of jobs
Static credential guidance for long-lived SkyPilot controllers

Skypilot Multi Cloud Orchestration by the numbers

396 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #393 of 1,041 Cloud & Infrastructure skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill skypilot-multi-cloud-orchestration

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/skypilot-multi-cloud-orchestration.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/skypilot-multi-cloud-orchestration)

Installs	396
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you failover GPU jobs across clouds with SkyPilot?

Run and failover GPU training jobs across GCP, AWS, Azure, and Kubernetes with SkyPilot YAML patterns from your agent.

Who is it for?

ML platform engineers orchestrating GPU training across GCP, AWS, Azure, and Kubernetes who need SkyPilot any_of failover YAML patterns.

Skip if: Single-cloud Lambda Labs DDP setup or teams without SkyPilot installed who only run local torchrun scripts.

When should I use this skill?

User asks about SkyPilot multi-cloud YAML, GPU job failover, or any_of cloud resource configuration.

What you get

SkyPilot task YAML, multi-cloud any_of resource spec, and launched cross-cloud GPU job with automatic failover.

SkyPilot task YAML
Multi-cloud job launch command
Failover-enabled GPU cluster allocation

By the numbers

Example accelerator request A100:8 across three public clouds
Documents p4d.24xlarge instance_type pinning for AWS

Files

SKILL.mdMarkdownGitHub ↗

SkyPilot Multi-Cloud Orchestration

Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.

When to use SkyPilot

Use SkyPilot when:

Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
Need cost optimization with automatic cloud/region selection
Running long jobs on spot instances with auto-recovery
Managing distributed multi-node training
Want unified interface for 20+ cloud providers
Need to avoid vendor lock-in

Key features:

Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
Cost optimization: Automatic cheapest cloud/region selection
Spot instances: 3-6x cost savings with automatic recovery
Distributed training: Multi-node jobs with gang scheduling
Managed jobs: Auto-recovery, checkpointing, fault tolerance
Sky Serve: Model serving with autoscaling

Use alternatives instead:

Modal: For simpler serverless GPU with Python-native API
RunPod: For single-cloud persistent pods
Kubernetes: For existing K8s infrastructure
Ray: For pure Ray-based orchestration

Quick start

Installation

pip install "skypilot[aws,gcp,azure,kubernetes]"

# Verify cloud credentials
sky check

Hello World

Create hello.yaml:

resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"

Launch:

sky launch -c hello hello.yaml

# SSH to cluster
ssh hello

# Terminate
sky down hello

Core concepts

Task YAML structure

# Task name (optional)
name: my-task

# Resource requirements
resources:
  cloud: aws              # Optional: auto-select if omitted
  region: us-west-2       # Optional: auto-select if omitted
  accelerators: A100:4    # GPU type and count
  cpus: 8+                # Minimum CPUs
  memory: 32+             # Minimum memory (GB)
  use_spot: true          # Use spot instances
  disk_size: 256          # Disk size (GB)

# Number of nodes for distributed training
num_nodes: 2

# Working directory (synced to ~/sky_workdir)
workdir: .

# Setup commands (run once)
setup: |
  pip install -r requirements.txt

# Run commands
run: |
  python train.py

Key commands

Command	Purpose
`sky launch`	Launch cluster and run task
`sky exec`	Run task on existing cluster
`sky status`	Show cluster status
`sky stop`	Stop cluster (preserve state)
`sky down`	Terminate cluster
`sky logs`	View task logs
`sky queue`	Show job queue
`sky jobs launch`	Launch managed job
`sky serve up`	Deploy serving endpoint

GPU configuration

Available accelerators

# NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8

# Cloud-specific
accelerators: V100:4         # AWS/GCP
accelerators: TPU-v4-8       # GCP TPUs

GPU fallbacks

resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

Spot instances

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

Cluster management

Launch and execute

# Launch new cluster
sky launch -c mycluster task.yaml

# Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml

# Interactive SSH
ssh mycluster

# Stream logs
sky logs mycluster

Autostop

resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop

# Set autostop via CLI
sky autostop mycluster -i 30 --down

Cluster status

# All clusters
sky status

# Detailed view
sky status -a

Distributed training

Multi-node setup

resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

Environment variables

Variable	Description
`SKYPILOT_NODE_RANK`	Node index (0 to num_nodes-1)
`SKYPILOT_NODE_IPS`	Newline-separated IP addresses
`SKYPILOT_NUM_NODES`	Total number of nodes
`SKYPILOT_NUM_GPUS_PER_NODE`	GPUs per node

Head-node-only execution

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

Managed jobs

Spot recovery

# Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml

Checkpointing

name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

Job management

# List jobs
sky jobs queue

# View logs
sky jobs logs my-job

# Cancel job
sky jobs cancel my-job

File mounts and storage

Local file sync

workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

Cloud storage

file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

Storage modes

Mode	Description	Best For
`MOUNT`	Stream from cloud	Large datasets, read-heavy
`COPY`	Pre-fetch to disk	Small files, random access
`MOUNT_CACHED`	Cache with async upload	Checkpoints, outputs

Sky Serve (Model Serving)

Basic service

# service.yaml
service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0

resources:
  accelerators: A100:1

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000

# Deploy
sky serve up -n my-service service.yaml

# Check status
sky serve status

# Get endpoint
sky serve status my-service

Autoscaling policies

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin

Cost optimization

Automatic cloud selection

# SkyPilot finds cheapest option
resources:
  accelerators: A100:8
  # No cloud specified - auto-select cheapest

# Show optimizer decision
sky launch task.yaml --dryrun

Cloud preferences

resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure

Environment variables

envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY

# Or use secrets
secrets:
  - HF_TOKEN
  - WANDB_API_KEY

Common workflows

Workflow 1: Fine-tuning with checkpoints

name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume

Workflow 2: Hyperparameter sweep

name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID

# Launch multiple jobs
for i in {1..10}; do
  sky jobs launch sweep.yaml \
    --env RUN_ID=$i \
    --env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done

Debugging

# SSH to cluster
ssh mycluster

# View logs
sky logs mycluster

# Check job queue
sky queue mycluster

# View managed job logs
sky jobs logs my-job

Common issues

Issue	Solution
Quota exceeded	Request quota increase, try different region
Spot preemption	Use `sky jobs launch` for auto-recovery
Slow file sync	Use `MOUNT_CACHED` mode for outputs
GPU not available	Use `any_of` for fallback clouds

References

[Advanced Usage](references/advanced-usage.md) - Multi-cloud, optimization, production patterns
[Troubleshooting](references/troubleshooting.md) - Common issues and solutions

Resources

Documentation: https://docs.skypilot.co
GitHub: https://github.com/skypilot-org/skypilot
Slack: https://slack.skypilot.co
Examples: https://github.com/skypilot-org/skypilot/tree/master/examples

SkyPilot Advanced Usage Guide

Multi-Cloud Strategies

Cloud selection patterns

# Prefer specific clouds in order
resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-west-2
    - cloud: azure
      region: westus2

Wildcard regions

resources:
  cloud: aws
  region: us-*  # Any US region
  accelerators: A100:8

Kubernetes + Cloud fallback

resources:
  accelerators: A100:8
  any_of:
    - cloud: kubernetes
    - cloud: aws
    - cloud: gcp

Advanced Resource Configuration

Instance type constraints

resources:
  instance_type: p4d.24xlarge  # Specific instance
  # OR
  cpus: 32+
  memory: 128+
  accelerators: A100:8

Disk configuration

resources:
  disk_size: 500  # GB
  disk_tier: best  # low, medium, high, ultra, best

Network tier

resources:
  network_tier: best  # High-performance networking

Production Managed Jobs

Job configuration

name: production-training

resources:
  accelerators: H100:8
  use_spot: true
  spot_recovery: FAILOVER

# Retry configuration
max_restarts_on_errors: 3

Controller scaling

For large-scale deployments (hundreds of jobs):

# Increase controller memory
sky jobs launch --controller-resources memory=32

Static credentials

Use non-expiring credentials for controllers:

# AWS: Use IAM role or long-lived access keys
# GCP: Use service account JSON key
# Azure: Use service principal

Advanced File Mounts

Git repository workdir

workdir:
  url: https://github.com/user/repo.git
  ref: main
  # For private repos, set GIT_TOKEN env var

Multiple storage backends

file_mounts:
  /data/s3:
    source: s3://my-bucket/data
    mode: MOUNT

  /data/gcs:
    source: gs://my-bucket/data
    mode: MOUNT

  /outputs:
    name: training-outputs
    store: s3
    mode: MOUNT_CACHED

Rsync exclude patterns

workdir: .

# Use .skyignore or .gitignore for excludes

Create .skyignore:

__pycache__/
*.pyc
.git/
.env
node_modules/

Distributed Training Patterns

PyTorch DDP

num_nodes: 4

resources:
  accelerators: A100:8

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

DeepSpeed

num_nodes: 4

resources:
  accelerators: A100:8

setup: |
  pip install deepspeed

run: |
  # Create hostfile
  echo "$SKYPILOT_NODE_IPS" | awk '{print $1 " slots=8"}' > /tmp/hostfile

  deepspeed --hostfile=/tmp/hostfile \
    --num_nodes=$SKYPILOT_NUM_NODES \
    --num_gpus=$SKYPILOT_NUM_GPUS_PER_NODE \
    train.py --deepspeed ds_config.json

Ray Train

num_nodes: 4

resources:
  accelerators: A100:8

run: |
  # Head node starts Ray head
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    ray start --head --port=6379
    # Wait for workers
    sleep 30
    python train_ray.py
  else
    ray start --address=$(echo "$SKYPILOT_NODE_IPS" | head -n1):6379
  fi

Sky Serve Advanced

Multi-replica serving

service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 60
    period_seconds: 10

  replica_policy:
    min_replicas: 2
    max_replicas: 20
    target_qps_per_replica: 5.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

  load_balancing_policy: round_robin  # or least_connections

Blue-green deployment

# Deploy new version
sky serve up -n my-service-v2 service_v2.yaml

# Test new version
curl https://my-service-v2.skypilot.cloud/health

# Switch traffic (update DNS/load balancer)
# Then terminate old version
sky serve down my-service-v1

Service with multiple accelerator options

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 5

resources:
  accelerators:
    L40S: 1
    A100: 1
    A10G: 1
  any_of:
    - cloud: aws
    - cloud: gcp

Cost Optimization

Spot instance strategies

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # or FAILOVER_NO_WAIT

# Always checkpoint for spot jobs
file_mounts:
  /checkpoints:
    name: spot-checkpoints
    store: s3
    mode: MOUNT_CACHED

Reserved instance hints

resources:
  accelerators: A100:8
  # SkyPilot considers reserved instances in cost calculation

Budget constraints

# Dry run to see cost estimate
sky launch task.yaml --dryrun

# Set max cluster cost (future feature)
# sky launch task.yaml --max-cost-per-hour 50

Kubernetes Integration

Using existing clusters

# Configure kubeconfig
export KUBECONFIG=~/.kube/config

# Verify
sky check kubernetes

Pod configuration

resources:
  cloud: kubernetes
  accelerators: A100:1

config:
  kubernetes:
    pod_config:
      spec:
        runtimeClassName: nvidia
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"

Multi-cluster

resources:
  any_of:
    - cloud: kubernetes
      infra: cluster1
    - cloud: kubernetes
      infra: cluster2
    - cloud: aws

API Server Deployment

Team setup

# Start API server
sky api serve --host 0.0.0.0 --port 8000

# Connect clients
sky api login --endpoint https://your-server:8000

Authentication

# Create service account
sky api create-service-account my-service

# Use token in CI/CD
export SKYPILOT_API_TOKEN=...
sky launch task.yaml

Advanced CLI Patterns

Parallel cluster operations

# Launch multiple clusters in parallel
for i in {1..10}; do
  sky launch -c cluster-$i task.yaml --detach &
done
wait

Batch job submission

# Submit many jobs
for config in configs/*.yaml; do
  name=$(basename $config .yaml)
  sky jobs launch -n $name $config
done

# Monitor all jobs
sky jobs queue

Conditional execution

run: |
  # Only run on head node
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python main.py
  else
    python worker.py
  fi

Environment Management

Environment variables

envs:
  WANDB_PROJECT: my-project
  HF_TOKEN: $HF_TOKEN  # Inherit from local
  CUDA_VISIBLE_DEVICES: "0,1,2,3"

# Secrets (hidden in logs)
secrets:
  - WANDB_API_KEY
  - HF_TOKEN

Config overrides

config:
  # Override global config
  jobs:
    controller:
      resources:
        memory: 32

Monitoring and Observability

Log streaming

# Stream logs
sky logs mycluster

# Follow specific job
sky logs mycluster 1

# Managed job logs
sky jobs logs my-job --follow

Integration with W&B/MLflow

envs:
  WANDB_API_KEY: $WANDB_API_KEY
  WANDB_PROJECT: my-project

run: |
  wandb login $WANDB_API_KEY
  python train.py --wandb

Debugging

SSH access

# SSH to head node
ssh mycluster

# SSH to worker node
ssh mycluster-worker1

# Port forwarding
ssh -L 8080:localhost:8080 mycluster

Interactive debugging

# Launch interactive cluster
sky launch -c debug --gpus A100:1

# SSH and debug
ssh debug

Job inspection

# View job queue
sky queue mycluster

# Cancel specific job
sky cancel mycluster 1

# View job details
sky logs mycluster 1

SkyPilot Troubleshooting Guide

Installation Issues

Cloud credentials not found

Error: sky check shows clouds as disabled

Solutions:

# AWS
aws configure
# Verify: aws sts get-caller-identity

# GCP
gcloud auth application-default login
# Verify: gcloud auth list

# Azure
az login
az account set -s <subscription-id>

# Kubernetes
export KUBECONFIG=~/.kube/config
kubectl get nodes

# Re-check after configuration
sky check

Permission errors

Error: PermissionError or AccessDenied

Solutions:

# AWS: Ensure IAM permissions include EC2, S3, IAM
# Required policies: AmazonEC2FullAccess, AmazonS3FullAccess, IAMFullAccess

# GCP: Ensure roles include Compute Admin, Storage Admin
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:email@example.com" \
  --role="roles/compute.admin"

# Azure: Ensure Contributor role on subscription
az role assignment create \
  --assignee email@example.com \
  --role Contributor \
  --scope /subscriptions/SUBSCRIPTION_ID

Cluster Launch Issues

Quota exceeded

Error: Quota exceeded for resource

Solutions:

# Try different region
resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-west1
    - cloud: gcp
      region: europe-west4
    - cloud: aws
      region: us-east-1

# Or request quota increase from cloud provider

# Check quota before launching
sky show-gpus --cloud gcp

GPU not available

Error: No resources available in region

Solutions:

# Use fallback accelerators
resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

# Check GPU availability
sky show-gpus A100
sky show-gpus --cloud aws

Instance type not found

Error: Instance type 'xyz' not found

Solutions:

# Let SkyPilot choose instance automatically
resources:
  accelerators: A100:8
  cpus: 96+
  memory: 512+
  # Don't specify instance_type unless necessary

Cluster stuck in INIT

Error: Cluster stays in INIT state

Solutions:

# Check cluster logs
sky logs mycluster --status

# SSH and check manually
ssh mycluster
journalctl -u sky-supervisor

# Terminate and retry
sky down mycluster
sky launch -c mycluster task.yaml

Setup Command Issues

Setup script fails

Error: Setup commands fail during provisioning

Solutions:

# Add error handling and retries
setup: |
  set -e  # Exit on error

  # Retry pip installs
  for i in {1..3}; do
    pip install torch transformers && break
    echo "Retry $i..."
    sleep 10
  done

  # Verify installation
  python -c "import torch; print(torch.__version__)"

Conda environment issues

Error: Conda not found or environment issues

Solutions:

setup: |
  # Initialize conda for bash
  source ~/.bashrc

  # Or use full path
  ~/miniconda3/bin/conda create -n myenv python=3.10 -y
  ~/miniconda3/bin/conda activate myenv

CUDA version mismatch

Error: CUDA driver version is insufficient

Solutions:

setup: |
  # Install specific CUDA version
  pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html

  # Verify CUDA
  python -c "import torch; print(torch.cuda.is_available())"

Distributed Training Issues

Nodes can't communicate

Error: Connection refused between nodes

Solutions:

run: |
  # Debug: Print all node IPs
  echo "All nodes: $SKYPILOT_NODE_IPS"
  echo "My rank: $SKYPILOT_NODE_RANK"

  # Wait for all nodes to be ready
  sleep 30

  # Use correct master address
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Master: $MASTER_ADDR"

torchrun fails

Error: torch.distributed errors

Solutions:

run: |
  # Ensure correct environment variables
  export NCCL_DEBUG=INFO
  export NCCL_IB_DISABLE=1  # Try if InfiniBand issues

  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    --rdzv_backend=c10d \
    train.py

DeepSpeed hostfile errors

Error: Invalid hostfile or connection errors

Solutions:

run: |
  # Create proper hostfile
  echo "$SKYPILOT_NODE_IPS" | while read ip; do
    echo "$ip slots=$SKYPILOT_NUM_GPUS_PER_NODE"
  done > /tmp/hostfile

  cat /tmp/hostfile  # Debug

  deepspeed --hostfile=/tmp/hostfile train.py

File Mount Issues

Mount fails

Error: Failed to mount storage

Solutions:

# Verify bucket exists and credentials are valid
file_mounts:
  /data:
    source: s3://my-bucket/data
    mode: MOUNT

# Check bucket access
# aws s3 ls s3://my-bucket/

Slow file access

Problem: Reading from mount is very slow

Solutions:

# Use COPY mode for small datasets
file_mounts:
  /data:
    source: s3://bucket/data
    mode: COPY  # Pre-fetch to local disk

# Use MOUNT_CACHED for outputs
file_mounts:
  /outputs:
    name: outputs
    store: s3
    mode: MOUNT_CACHED  # Cached writes

Storage not persisting

Error: Data lost after cluster restart

Solutions:

# Use named storage (persists across clusters)
file_mounts:
  /persistent:
    name: my-persistent-storage
    store: s3
    mode: MOUNT

# Data in ~/sky_workdir is NOT persisted
# Always use file_mounts for persistent data

Managed Job Issues

Job keeps failing

Error: Job fails and doesn't recover

Solutions:

# Enable spot recovery
resources:
  use_spot: true
  spot_recovery: FAILOVER

# Add retry logic
max_restarts_on_errors: 5

# Implement checkpointing
run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

Job stuck in pending

Error: Job stays in PENDING state

Solutions:

# Check job controller status
sky jobs controller status

# View controller logs
sky jobs controller logs

# Restart controller if needed
sky jobs controller restart

Checkpoint not resuming

Error: Training restarts from beginning

Solutions:

file_mounts:
  /checkpoints:
    name: training-checkpoints
    store: s3
    mode: MOUNT_CACHED

run: |
  # Check for existing checkpoint
  if [ -d "/checkpoints/latest" ]; then
    RESUME_FLAG="--resume /checkpoints/latest"
  else
    RESUME_FLAG=""
  fi

  python train.py $RESUME_FLAG --checkpoint-dir /checkpoints

Sky Serve Issues

Service not accessible

Error: Cannot reach service endpoint

Solutions:

# Check service status
sky serve status my-service

# View replica logs
sky serve logs my-service

# Check readiness probe
sky serve status my-service --endpoint

Replicas keep crashing

Error: Replicas fail health checks

Solutions:

service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 120  # Increase for slow model loading
    period_seconds: 30
    timeout_seconds: 10

run: |
  # Ensure health endpoint exists
  python -c "
  from fastapi import FastAPI
  app = FastAPI()

  @app.get('/health')
  def health():
      return {'status': 'ok'}
  "

Autoscaling not working

Problem: Service doesn't scale up/down

Solutions:

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 30   # Faster scale up
    downscale_delay_seconds: 60  # Faster scale down

# Monitor metrics
# sky serve status my-service

SSH and Access Issues

Cannot SSH to cluster

Error: Connection refused or timeout

Solutions:

# Verify cluster is running
sky status

# Try with verbose output
ssh -v mycluster

# Check SSH key
ls -la ~/.ssh/sky-key*

# Regenerate SSH key if needed
sky launch -c test --dryrun  # Regenerates key

Port forwarding fails

Error: Cannot forward ports

Solutions:

# Correct syntax
ssh -L 8080:localhost:8080 mycluster

# For Jupyter
ssh -L 8888:localhost:8888 mycluster

# Multiple ports
ssh -L 8080:localhost:8080 -L 6006:localhost:6006 mycluster

Cost and Billing Issues

Unexpected charges

Problem: Higher than expected costs

Solutions:

# Always terminate unused clusters
sky down --all

# Set autostop
sky autostop mycluster -i 30 --down

# Use spot instances
resources:
  use_spot: true

Spot instance preempted

Error: Instance terminated unexpectedly

Solutions:

# Use managed jobs for automatic recovery
# sky jobs launch instead of sky launch

resources:
  use_spot: true
  spot_recovery: FAILOVER  # Auto-failover to another region/cloud

# Always checkpoint frequently when using spot

Debugging Commands

View cluster state

# Cluster status
sky status
sky status -a  # Show all details

# Cluster resources
sky show-gpus

# Cloud credentials
sky check

View logs

# Task logs
sky logs mycluster
sky logs mycluster 1  # Specific job

# Managed job logs
sky jobs logs my-job
sky jobs logs my-job --follow

# Service logs
sky serve logs my-service

Inspect cluster

# SSH to cluster
ssh mycluster

# Check GPU status
nvidia-smi

# Check processes
ps aux | grep python

# Check disk space
df -h

Common Error Messages

Error	Cause	Solution
`No launchable resources`	No available instances	Try different region/cloud
`Quota exceeded`	Cloud quota limit	Request increase or use different cloud
`Setup failed`	Script error	Check logs, add error handling
`Connection refused`	Network/firewall	Check security groups, wait for init
`CUDA OOM`	Out of GPU memory	Use larger GPU or reduce batch size
`Spot preempted`	Spot instance reclaimed	Use managed jobs for auto-recovery
`Mount failed`	Storage access issue	Check credentials and bucket exists

Getting Help

1. Documentation: https://docs.skypilot.co 2. GitHub Issues: https://github.com/skypilot-org/skypilot/issues 3. Slack: https://slack.skypilot.co 4. Examples: https://github.com/skypilot-org/skypilot/tree/master/examples

Reporting Issues

Include:

SkyPilot version: sky --version
Python version: python --version
Cloud provider and region
Full error traceback
Task YAML (sanitized)
Output of sky check

Related skills

Azure AiIntegrates Azure AI Content Safety, Document Intelligence, Speech, and Search services into Java-based agents and applications.479k1.3k

Azure PrepareGenerate the exact Azure infrastructure files, Dockerfiles, and azure.yaml configuration needed before deploying any new or modernized application.479k1.3k

Azure StorageConnect agents and applications to Azure Blob Storage, File Shares, Queues, Tables, and Data Lake without leaving the coding environment.478k1.3k

Appinsights InstrumentationAutomatically instrument web applications running on Azure App Service with Application Insights for observability without manual configuration.478k1.3k

Azure Resource LookupInstantly list, query, and discover any Azure resources across subscriptions without leaving the agent chat.478k1.3k

Azure AigatewayConfigure Azure API Management as a secure, governed gateway for routing traffic to LLMs, MCP servers, and agent tools.478k1.3k

How it compares

Use skypilot-multi-cloud-orchestration for cross-cloud GPU failover YAML; use lambda-labs-gpu-cloud when training stays on Lambda Labs instances with manual torchrun DDP.

FAQ

How does SkyPilot choose between cloud providers?

skypilot-multi-cloud-orchestration uses resources.any_of YAML lists to prefer clouds in order—such as GCP us-central1, then AWS us-west-2, then Azure westus2—falling through when accelerators like A100:8 are unavailable.

Can SkyPilot fall back from Kubernetes to public cloud?

skypilot-multi-cloud-orchestration documents any_of blocks listing kubernetes first, then aws and gcp, so GPU jobs retry on public clouds when cluster capacity is exhausted.

Is Skypilot Multi Cloud Orchestration safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Cloud & Infrastructureautomationllm