
Ray Train
Stand up a multi-node Ray cluster and run distributed PyTorch training with TorchTrainer when you are scaling ML jobs beyond a single machine.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill ray-trainWhat is this skill?
- Documents head vs worker roles and the shared object store (Arrow/Plasma)
- Manual `ray start --head` and worker join commands with `ray.init(address='auto')`
- TorchTrainer example with num_workers, GPU, and SPREAD placement across nodes
- `ray status` and Python APIs for inspecting cluster CPUs, GPUs, and memory
- Local multi-node workflow before moving to cloud Ray deployments
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Distributed training setup is part of building the ML/backend stack before you ship model artifacts to production. Ray head/worker topology, ScalingConfig, and cluster init are backend infrastructure patterns, not frontend or launch work.
Common Questions / FAQ
Is Ray Train safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Ray Train
# Ray Train Multi-Node Setup ## Ray Cluster Architecture Ray Train runs on a **Ray cluster** with one head node and multiple worker nodes. **Components**: - **Head node**: Coordinates workers, runs scheduling - **Worker nodes**: Execute training tasks - **Object store**: Shared memory across nodes (using Apache Arrow/Plasma) ## Local Multi-Node Setup ### Manual Cluster Setup **Head node**: ```bash # Start Ray head ray start --head --port=6379 --dashboard-host=0.0.0.0 # Output: # Started Ray on this node with: # - Head node IP: 192.168.1.100 # - Dashboard: http://192.168.1.100:8265 ``` **Worker nodes**: ```bash # Connect to head node ray start --address=192.168.1.100:6379 # Output: # Started Ray on this node. # Connected to Ray cluster. ``` **Training script**: ```python import ray from ray.train.torch import TorchTrainer from ray.train import ScalingConfig # Connect to cluster ray.init(address='auto') # Auto-detects cluster # Train across all nodes trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=16, # Total workers across all nodes use_gpu=True, placement_strategy="SPREAD" # Spread across nodes ) ) result = trainer.fit() ``` ### Check Cluster Status ```bash # View cluster status ray status # Output: # ======== Cluster Status ======== # Nodes: 4 # Total CPUs: 128 # Total GPUs: 32 # Total memory: 512 GB ``` **Python API**: ```python import ray ray.init(address='auto') # Get cluster resources print(ray.cluster_resources()) # {'CPU': 128.0, 'GPU': 32.0, 'memory': 549755813888, 'node:192.168.1.100': 1.0, ...} # Get available resources print(ray.available_resources()) ``` ## Cloud Deployments ### AWS EC2 Cluster **Cluster config** (`cluster.yaml`): ```yaml cluster_name: ray-train-cluster max_workers: 3 # 3 worker nodes provider: type: aws region: us-west-2 availability_zone: us-west-2a auth: ssh_user: ubuntu head_node_type: head_node available_node_types: head_node: node_config: InstanceType: p3.2xlarge # V100 GPU ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI resources: {"CPU": 8, "GPU": 1} min_workers: 0 max_workers: 0 worker_node: node_config: InstanceType: p3.8xlarge # 4× V100 ImageId: ami-0a2363a9cff180a64 resources: {"CPU": 32, "GPU": 4} min_workers: 3 max_workers: 3 setup_commands: - pip install -U ray[train] torch transformers head_setup_commands: - pip install -U "ray[default]" ``` **Launch cluster**: ```bash # Start cluster ray up cluster.yaml # SSH to head node ray attach cluster.yaml # Run training python train.py # Teardown ray down cluster.yaml ``` **Auto-submit job**: ```bash # Submit job from local machine ray job submit \ --address http://<head-node-ip>:8265 \ --working-dir . \ -- python train.py ``` ### GCP Cluster **Cluster config** (`gcp-cluster.yaml`): ```yaml cluster_name: ray-train-gcp provider: type: gcp region: us-central1 availability_zone: us-central1-a project_id: my-project-id auth: ssh_user: ubuntu head_node_type: head_node available_node_types: head_node: node_config: machineType: n1-standard-8 disks: - boot: true autoDelete: true type: PERSISTENT initializeParams: diskSizeGb: 50 sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu guestAccelerators: - acceleratorType: nvidia-tesla-v100 acceleratorCount: 1 resources: {"CPU": 8, "GPU": 1} worker_node: node_config: machineType: n1-highmem-16 disks: - boot: true autoDelete: true type: PERSISTENT initializeParams: diskSizeGb: 100 sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu guestAccelerators: - acceleratorType: nvidia-tesla-v100 acceleratorCount: 4 resou