
Skypilot Multi Cloud Orchestration
Run and failover GPU training jobs across GCP, AWS, Azure, and Kubernetes with SkyPilot YAML patterns from your agent.
Overview
skypilot-multi-cloud-orchestration is an agent skill most often used in Operate (also Build/integrations) that encodes SkyPilot multi-cloud GPU job YAML, spot failover, and controller scaling.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill skypilot-multi-cloud-orchestrationWhat is this skill?
- Cloud fallback chains with any_of for GCP, AWS, Azure, and Kubernetes
- Wildcard regions (e.g. us-*) and instance-type or CPU/memory/accelerator constraints
- Production managed jobs with spot recovery FAILOVER and max_restarts_on_errors
- Disk tier, network tier, and controller memory scaling for hundreds of jobs
- Static credential guidance for long-lived SkyPilot controllers
- Example configs reference up to 8x A100 or H100 accelerators
- max_restarts_on_errors: 3 in sample production job YAML
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a training script but one cloud region is out of capacity or spot instances keep preempting without a documented failover path.
Who is it for?
Indie researchers shipping fine-tunes or batch inference who already use SkyPilot and need copy-paste multi-cloud and managed-job hardening.
Skip if: Beginners running a single local notebook GPU or teams standardized on one cloud with no spot or Kubernetes needs.
When should I use this skill?
User needs SkyPilot multi-cloud selection, wildcard regions, Kubernetes fallback, managed jobs, spot recovery, or controller scaling patterns.
What do I get? / Deliverables
You deploy SkyPilot job specs with ordered cloud fallbacks, spot recovery, and controller sizing suited to production-scale GPU queues.
- Multi-cloud SkyPilot resource and job YAML snippets
- Production managed-job template with spot and restart policy
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Multi-cloud job orchestration is where trained models actually land in production infra—canonical shelf is Operate/infra even though configs are authored during Build. Infra subphase covers cluster choice, spot recovery, controllers, and credential patterns SkyPilot documents for managed jobs.
Where it fits
Author a SkyPilot YAML with LangChain-free GPU training resources and any_of cloud ordering before the first sky launch.
Turn on spot_recovery FAILOVER and max_restarts_on_errors after preempted A100 jobs blocked an overnight fine-tune.
Bump controller memory when the job queue crosses hundreds of concurrent SkyPilot managed jobs.
How it compares
Complements generic cloud CLI skills with SkyPilot-specific orchestration YAML rather than hand-rolled boto3/gcloud scripts per provider.
Common Questions / FAQ
Who is skypilot-multi-cloud-orchestration for?
Solo AI builders and small research teams orchestrating GPU workloads who want agent-assisted SkyPilot configs across several clouds.
When should I use skypilot-multi-cloud-orchestration?
Use in Operate when hardening training infra with spot FAILOVER and controller scaling, and in Build when first wiring SkyPilot resources and any_of cloud chains for H100 or A100 jobs.
Is skypilot-multi-cloud-orchestration safe to install?
It implies cloud credentials and spend; review the Security Audits panel on this Prism page and scope IAM keys before agents launch costly GPU fleets.
SKILL.md
READMESKILL.md - Skypilot Multi Cloud Orchestration
# SkyPilot Advanced Usage Guide ## Multi-Cloud Strategies ### Cloud selection patterns ```yaml # Prefer specific clouds in order resources: accelerators: A100:8 any_of: - cloud: gcp region: us-central1 - cloud: aws region: us-west-2 - cloud: azure region: westus2 ``` ### Wildcard regions ```yaml resources: cloud: aws region: us-* # Any US region accelerators: A100:8 ``` ### Kubernetes + Cloud fallback ```yaml resources: accelerators: A100:8 any_of: - cloud: kubernetes - cloud: aws - cloud: gcp ``` ## Advanced Resource Configuration ### Instance type constraints ```yaml resources: instance_type: p4d.24xlarge # Specific instance # OR cpus: 32+ memory: 128+ accelerators: A100:8 ``` ### Disk configuration ```yaml resources: disk_size: 500 # GB disk_tier: best # low, medium, high, ultra, best ``` ### Network tier ```yaml resources: network_tier: best # High-performance networking ``` ## Production Managed Jobs ### Job configuration ```yaml name: production-training resources: accelerators: H100:8 use_spot: true spot_recovery: FAILOVER # Retry configuration max_restarts_on_errors: 3 ``` ### Controller scaling For large-scale deployments (hundreds of jobs): ```bash # Increase controller memory sky jobs launch --controller-resources memory=32 ``` ### Static credentials Use non-expiring credentials for controllers: ```bash # AWS: Use IAM role or long-lived access keys # GCP: Use service account JSON key # Azure: Use service principal ``` ## Advanced File Mounts ### Git repository workdir ```yaml workdir: url: https://github.com/user/repo.git ref: main # For private repos, set GIT_TOKEN env var ``` ### Multiple storage backends ```yaml file_mounts: /data/s3: source: s3://my-bucket/data mode: MOUNT /data/gcs: source: gs://my-bucket/data mode: MOUNT /outputs: name: training-outputs store: s3 mode: MOUNT_CACHED ``` ### Rsync exclude patterns ```yaml workdir: . # Use .skyignore or .gitignore for excludes ``` Create `.skyignore`: ``` __pycache__/ *.pyc .git/ .env node_modules/ ``` ## Distributed Training Patterns ### PyTorch DDP ```yaml num_nodes: 4 resources: accelerators: A100:8 run: | torchrun \ --nnodes=$SKYPILOT_NUM_NODES \ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ --node_rank=$SKYPILOT_NODE_RANK \ --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \ --master_port=12355 \ train.py ``` ### DeepSpeed ```yaml num_nodes: 4 resources: accelerators: A100:8 setup: | pip install deepspeed run: | # Create hostfile echo "$SKYPILOT_NODE_IPS" | awk '{print $1 " slots=8"}' > /tmp/hostfile deepspeed --hostfile=/tmp/hostfile \ --num_nodes=$SKYPILOT_NUM_NODES \ --num_gpus=$SKYPILOT_NUM_GPUS_PER_NODE \ train.py --deepspeed ds_config.json ``` ### Ray Train ```yaml num_nodes: 4 resources: accelerators: A100:8 run: | # Head node starts Ray head if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then ray start --head --port=6379 # Wait for workers sleep 30 python train_ray.py else ray start --address=$(echo "$SKYPILOT_NODE_IPS" | head -n1):6379 fi ``` ## Sky Serve Advanced ### Multi-replica serving ```yaml service: readiness_probe: path: /health initial_delay_seconds: 60 period_seconds: 10 replica_policy: min_replicas: 2 max_replicas: 20 target_qps_per_replica: 5.0 upscale_delay_seconds: 60 downscale_delay_seconds: 300 load_balancing_policy: round_robin # or least_connections ``` ### Blue-green deployment ```bash # Deploy new version sky serve up -n my-service-v2 service_v2.yaml # Test new version curl https://my-service-v2.skypilot.cloud/health # Switch traffic (update DNS/load balancer) # Then terminate old version sky serve down my-service-v1 ``` ### Service with multiple accelerator options ```yaml service: replica_policy: min_replicas: 1 max_replic