
Lambda Labs Gpu Cloud
Spin up Lambda Labs GPU instances and wire PyTorch distributed data parallel training across multiple nodes with torchrun.
Overview
Lambda Labs GPU Cloud is an agent skill for the Build phase that documents multi-node PyTorch DDP setup and torchrun launch commands on Lambda Labs GPU instances.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill lambda-labs-gpu-cloudWhat is this skill?
- PyTorch DDP helper with RANK, WORLD_SIZE, and LOCAL_RANK from the launcher environment
- Checkpoint saves restricted to rank 0 to avoid corrupt multi-writer artifacts
- torchrun recipes for 2+ nodes with MASTER_ADDR, MASTER_PORT, nnodes, and node_rank
- NCCL backend initialization pattern for GPU clusters on Lambda Labs
- Documents a 2-node torchrun layout with 8 processes per node (nproc_per_node=8) in the launch example
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You rented multiple Lambda Labs GPU nodes but do not have a verified distributed training bootstrap with correct environment variables and rank-0 checkpoints.
Who is it for?
Indie ML builders prototyping or fine-tuning on 2+ Lambda Labs GPU instances who already know PyTorch basics.
Skip if: Teams that only need single-GPU notebooks or who want a managed orchestrator (Slurm, Kubernetes) instead of manual torchrun across VMs.
When should I use this skill?
You are launching or debugging multi-node PyTorch training on Lambda Labs GPU instances.
What do I get? / Deliverables
You get runnable DDP training code and matching multi-node torchrun invocations so gradients sync across nodes and checkpoints land in one place.
- DDP training module with setup_distributed and rank-0 checkpoint saves
- Per-node torchrun launch commands with MASTER_ADDR and MASTER_PORT
Recommended Skills
Journey fit
Canonical shelf is Build because the skill teaches how to implement and launch multi-node training jobs while you are developing ML backends—not day-two incident response. Backend fits distributed training setup, NCCL process groups, checkpoints, and node launcher scripts that live in your training codebase.
How it compares
Infrastructure runbook for bare multi-node torchrun—not a serverless training SaaS or a hyperparameter search framework.
Common Questions / FAQ
Who is lambda-labs-gpu-cloud for?
Solo builders and small teams training PyTorch models on Lambda Labs who need distributed data parallel across several GPU instances without rewriting boilerplate from scratch.
When should I use lambda-labs-gpu-cloud?
During Build when standing up multi-node fine-tuning or pretraining; during Operate when re-launching the same cluster topology after scaling instance counts; and during Ship prep when you need a reproducible launcher script before locking training configs.
Is lambda-labs-gpu-cloud safe to install?
Review the Security Audits panel on this Prism page and treat any skill that suggests shell and network access as requiring your own account hygiene and secret handling on Lambda Labs.
SKILL.md
READMESKILL.md - Lambda Labs Gpu Cloud
# Lambda Labs Advanced Usage Guide ## Multi-Node Distributed Training ### PyTorch DDP across nodes ```python # train_multi_node.py import os import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup_distributed(): # Environment variables set by launcher rank = int(os.environ["RANK"]) world_size = int(os.environ["WORLD_SIZE"]) local_rank = int(os.environ["LOCAL_RANK"]) dist.init_process_group( backend="nccl", rank=rank, world_size=world_size ) torch.cuda.set_device(local_rank) return rank, world_size, local_rank def main(): rank, world_size, local_rank = setup_distributed() model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop with synchronized gradients for epoch in range(num_epochs): train_one_epoch(model, dataloader) # Save checkpoint on rank 0 only if rank == 0: torch.save(model.module.state_dict(), f"checkpoint_{epoch}.pt") dist.destroy_process_group() if __name__ == "__main__": main() ``` ### Launch on multiple instances ```bash # On Node 0 (master) export MASTER_ADDR=<NODE0_PRIVATE_IP> export MASTER_PORT=29500 torchrun \ --nnodes=2 \ --nproc_per_node=8 \ --node_rank=0 \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ train_multi_node.py # On Node 1 export MASTER_ADDR=<NODE0_PRIVATE_IP> export MASTER_PORT=29500 torchrun \ --nnodes=2 \ --nproc_per_node=8 \ --node_rank=1 \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ train_multi_node.py ``` ### FSDP for large models ```python from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy from transformers.models.llama.modeling_llama import LlamaDecoderLayer # Wrap policy for transformer models auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls={LlamaDecoderLayer} ) model = FSDP( model, auto_wrap_policy=auto_wrap_policy, mixed_precision=MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ), device_id=local_rank, ) ``` ### DeepSpeed ZeRO ```python # ds_config.json { "train_batch_size": 64, "gradient_accumulation_steps": 4, "fp16": {"enabled": true}, "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"} } } ``` ```bash # Launch with DeepSpeed deepspeed --num_nodes=2 \ --num_gpus=8 \ --hostfile=hostfile.txt \ train.py --deepspeed ds_config.json ``` ### Hostfile for multi-node ```bash # hostfile.txt node0_ip slots=8 node1_ip slots=8 ``` ## API Automation ### Auto-launch training jobs ```python import os import time import lambda_cloud_client from lambda_cloud_client.models import LaunchInstanceRequest class LambdaJobManager: def __init__(self, api_key: str): self.config = lambda_cloud_client.Configuration( host="https://cloud.lambdalabs.com/api/v1", access_token=api_key ) def find_available_gpu(self, gpu_types: list[str], regions: list[str] = None): """Find first available GPU type across regions.""" with lambda_cloud_client.ApiClient(self.config) as client: api = lambda_cloud_client.DefaultApi(client) types = api.instance_types() for gpu_type in gpu_types: if gpu_type in types.data: info = types.data[gpu_type] for region in info.regions_with_capacity_available: if regions is None or region.name in regions: return gpu_type, region.name return None, None def launch_and_wait(self, instance_type: str, region: str,