Verl Rl Training

Name: Verl Rl Training
Author: orchestra-research

orchestra-research/ai-research-skills

433 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

verl-rl-training is a coding-agent skill that documents VERL Ray PPO trainer setup, GPU resource pools, and rollout backends for developers configuring distributed RL fine-tuning on LLMs.

About

verl-rl-training is an orchestra-research/ai-research-skills API reference for VERL distributed reinforcement learning on large language models. The skill covers RayPPOTrainer as the central training loop controller with init_workers and fit calls, plus ResourcePoolManager allocating GPUs across worker groups via Ray PlacementGroups. Example resource_pool_spec maps actor_rollout_ref to 4 GPUs and critic to 2 GPUs. Developers reach for verl-rl-training when configuring PPO-based RL fine-tuning pipelines that need Ray-coordinated rollout, actor, and critic workers rather than single-GPU supervised training. The skill fits ML backend engineers wiring VERL configs during LLM alignment experiments.

Documents RayPPOTrainer lifecycle: init_workers() and fit() for the PPO loop
ResourcePoolManager maps GPU placement groups across actor_rollout_ref and critic pools
RayWorkerGroup dispatches methods to distributed ActorRolloutRefWorker actors
RolloutReplica backends: vLLM, SGLang, TensorRT-LLM, and HuggingFace via config
Hybrid engine mode switching on the actor–rollout–reference worker path

Verl Rl Training by the numbers

433 all-time installs (skills.sh)
+30 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #470 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill verl-rl-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/verl-rl-training.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/verl-rl-training)

Installs	433
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you configure VERL Ray PPO training for LLM RL fine-tuning?

Look up VERL’s Ray PPO trainer, GPU pools, and rollout backends while you configure distributed RL fine-tuning for an LLM.

Who is it for?

ML engineers setting up VERL distributed PPO fine-tuning with Ray GPU pools and multi-worker rollout backends.

Skip if: Standard supervised fine-tuning without RL, or teams not running Ray-based distributed training infrastructure.

When should I use this skill?

An LLM alignment project needs VERL RayPPOTrainer setup, GPU pool allocation, or rollout worker configuration.

What you get

RayPPOTrainer configs, GPU resource pool specs, and initialized Ray worker groups for PPO training runs.

RayPPOTrainer configuration
GPU resource pool spec
Initialized training workers

By the numbers

Example resource_pool_spec: actor_rollout_ref 4 GPUs, critic 2 GPUs

Files

SKILL.mdMarkdownGitHub ↗

verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

When to Use verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

Key Features

Training backends: FSDP, FSDP2, Megatron-LM
Rollout engines: vLLM, SGLang, HuggingFace Transformers
Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

Installation

# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

Quick Start: GRPO Training

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

---

Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

Prerequisites Checklist

[ ] GPU cluster with 8+ GPUs (H100 recommended)
[ ] Dataset in parquet format with prompt and reward_model columns
[ ] Base model from HuggingFace Hub

Step 1: Prepare Dataset

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Step 3: Create Training Config

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

Step 4: Launch Training

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

[ ] Check WandB/TensorBoard for loss curves
[ ] Verify reward is increasing over steps
[ ] Run evaluation on held-out test set

---

Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

Key Differences from GRPO

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

Configuration

algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping

Launch with Critic

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

---

Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

Prerequisites

[ ] Install Megatron-LM bridge: pip install mbridge
[ ] Convert model to Megatron format
[ ] Multi-node cluster with NVLink/InfiniBand

Configuration for 70B+ Models

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

---

Configuration Reference

Algorithm Selection

Algorithm	`adv_estimator`	Use Case
GRPO	`grpo`	Critic-free, math/reasoning
PPO/GAE	`gae`	Dense rewards, value estimation
REINFORCE++	`reinforce_plus_plus`	Variance reduction
RLOO	`rloo`	Leave-one-out baseline
ReMax	`remax`	Maximum reward baseline
OPO	`opo`	Optimal policy optimization

Key Parameters

# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control

---

Common Issues and Solutions

Issue: OOM During Rollout

Symptoms: CUDA out of memory during generation phase

Solutions:

# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

Issue: Training Instability

Symptoms: Loss spikes, reward collapse

Solutions:

# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0

Issue: Slow Weight Sync

Symptoms: Long pauses between rollout and training

Solutions:

# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true

Issue: vLLM Version Mismatch

Symptoms: Import errors or generation failures

Solution: Use compatible versions:

pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)

---

Advanced Topics

Multi-Turn Tool Calling

See references/multi-turn.md for agentic workflows with tool use.

Vision-Language Models

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

---

Resources

Documentation: https://verl.readthedocs.io/
Paper: https://arxiv.org/abs/2409.19256
GitHub: https://github.com/volcengine/verl
Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community: Slack at verl-project

verl API Reference

Core Classes

RayPPOTrainer

The central controller for the training loop. Manages resource allocation and coordinates worker groups.

from verl import RayPPOTrainer

trainer = RayPPOTrainer(
    config=config,
    resource_pool_manager=resource_manager,
    ray_worker_group_cls=RayWorkerGroup,
)
trainer.init_workers()
trainer.fit()

ResourcePoolManager

Manages GPU allocation across different worker groups using Ray PlacementGroups.

from verl.trainer.ppo.resource_pool import ResourcePoolManager

manager = ResourcePoolManager(
    resource_pool_spec={
        "actor_rollout_ref": {"gpu": 4},
        "critic": {"gpu": 2},
    }
)

RayWorkerGroup

Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls.

from verl.trainer.ppo.ray_worker_group import RayWorkerGroup

worker_group = RayWorkerGroup(
    num_workers=8,
    worker_cls=ActorRolloutRefWorker,
    resource_pool=pool,
)

ActorRolloutRefWorker

Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching.

# Typically configured via YAML, not instantiated directly
# See configuration section below

RolloutReplica

Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace.

from verl.workers.rollout import RolloutReplica

# Backend selection via config
rollout:
  name: vllm  # or: sglang, hf, tensorrt-llm

Configuration Schema

PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`)

# Data configuration
data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256        # Global batch size of prompts
  max_prompt_length: 512
  max_response_length: 2048

# Algorithm configuration
algorithm:
  adv_estimator: gae           # gae, grpo, rloo, reinforce_plus_plus
  gamma: 0.99                  # Discount factor
  lam: 0.95                    # GAE lambda
  use_kl_in_reward: false      # Add KL term to reward

# Actor configuration
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
    backend: fsdp              # fsdp, fsdp2, megatron
  actor:
    ppo_mini_batch_size: 64    # Mini-batch for actor updates
    ppo_epochs: 1              # Number of actor update epochs
    clip_ratio: 0.2            # PPO clip range
    use_kl_loss: true          # Use KL loss in actor
    kl_loss_coef: 0.001        # KL loss coefficient
    kl_loss_type: low_var      # KL divergence calculation method
    loss_agg_mode: token-mean  # token-mean or sequence-mean
    gradient_checkpointing: true
    max_grad_norm: 1.0         # Gradient clipping
    lr: 1e-6                   # Learning rate
  rollout:
    name: vllm                 # vllm, sglang, hf
    n: 8                       # Samples per prompt
    temperature: 0.7
    top_p: 0.95
    log_prob_micro_batch_size: 8

# Critic configuration (PPO only)
critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  ppo_mini_batch_size: 64
  ppo_epochs: 1                # Defaults to actor epochs

# Trainer configuration
trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  nnodes: 1
  save_freq: 100
  experiment_name: my_experiment
  async_weight_update: false

GRPO Configuration (`docs/algo/grpo.md`)

algorithm:
  adv_estimator: grpo          # Enable GRPO
  gamma: 1.0
  lam: 1.0

actor_rollout_ref:
  rollout:
    n: 8                       # Must be > 1 for GRPO
  actor:
    use_kl_loss: true          # Required for GRPO
    kl_loss_coef: 0.001
    kl_loss_type: low_var      # or: k1, k2, k3
    loss_agg_mode: token-mean

Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`)

actor_rollout_ref:
  rollout:
    name: sglang               # Required for multi-turn
    multi_turn:
      enable: true
      tool_config_path: /path/to/tools.yaml
      interaction_config_path: /path/to/interaction.yaml

Reward Functions

Built-in Reward Types

# Model-based reward
reward_model:
  path: OpenRLHF/Llama-3-8b-rm-700k

# Custom function-based reward
custom_reward_function:
  path: /path/to/reward.py
  name: compute_score          # Function name, default: compute_score

Custom Reward Function Signature

# reward.py
def compute_score(responses: list[str], ground_truths: list[str], **kwargs) -> list[float]:
    """
    Compute rewards for a batch of responses.

    Args:
        responses: Generated completions
        ground_truths: Expected answers from data
        **kwargs: Additional metadata

    Returns:
        List of reward scores (floats)
    """
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Your reward logic
        score = 1.0 if correct(response, gt) else 0.0
        rewards.append(score)
    return rewards

Backend-Specific Configuration

FSDP Configuration

actor_rollout_ref:
  actor:
    strategy: fsdp
    fsdp_config:
      mixed_precision: bf16
      sharding_strategy: FULL_SHARD
      offload_policy: false

FSDP2 Configuration

actor_rollout_ref:
  actor:
    strategy: fsdp2
    fsdp_config:
      offload_policy: true     # CPU offloading
      reshard_after_forward: true

Megatron Configuration

actor_rollout_ref:
  model:
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
    megatron:
      use_mbridge: true        # Required for format conversion

vLLM Rollout Configuration

actor_rollout_ref:
  rollout:
    name: vllm
    tensor_parallel_size: 2
    gpu_memory_utilization: 0.9
    max_num_seqs: 256
    enforce_eager: false

SGLang Rollout Configuration

actor_rollout_ref:
  rollout:
    name: sglang
    tp_size: 2
    mem_fraction_static: 0.8
    context_length: 8192

Algorithm Reference

Algorithm	`adv_estimator`	Requires Critic	Best For
PPO	`gae`	Yes	Dense rewards, value estimation
GRPO	`grpo`	No	Sparse rewards, math/reasoning
RLOO	`rloo`	No	Leave-one-out baseline
REINFORCE++	`reinforce_plus_plus`	No	Variance reduction
DAPO	`dapo`	No	Doubly-adaptive optimization

Vision-Language Model Support

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true
    max_model_len: 32768

LoRA Configuration

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
      dropout: 0.05

Resources

Documentation: https://verl.readthedocs.io/
GitHub: https://github.com/volcengine/verl
Paper: https://arxiv.org/abs/2409.19256 (HybridFlow)

verl Troubleshooting Guide

Common Issues and Solutions

OOM (Out of Memory) Issues

Issue: OOM During Rollout

Symptoms: CUDA out of memory during generation phase

Solutions:

1. Reduce log prob batch size:

actor_rollout_ref:
  rollout:
    log_prob_micro_batch_size: 4  # Reduce from 8

2. Enable gradient checkpointing:

actor_rollout_ref:
  actor:
    gradient_checkpointing: true

3. Use FSDP2 with CPU offloading:

actor_rollout_ref:
  actor:
    strategy: fsdp2
    fsdp_config:
      offload_policy: true

4. Reduce vLLM memory utilization:

actor_rollout_ref:
  rollout:
    gpu_memory_utilization: 0.7  # Reduce from 0.9

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

1. Reduce batch sizes:

actor_rollout_ref:
  actor:
    ppo_mini_batch_size: 32  # Reduce from 64

2. Use gradient accumulation:

actor_rollout_ref:
  actor:
    gradient_accumulation_steps: 4

3. Enable mixed precision:

actor_rollout_ref:
  actor:
    fsdp_config:
      mixed_precision: bf16

Training Stability Issues

Issue: Training Instability / Loss Spikes

Symptoms: Loss spikes, reward collapse, divergence

Solutions:

1. Reduce learning rate:

actor_rollout_ref:
  actor:
    lr: 5e-7  # Reduce from 1e-6

2. Increase KL penalty:

actor_rollout_ref:
  actor:
    kl_loss_coef: 0.01  # Increase from 0.001

3. Enable gradient clipping:

actor_rollout_ref:
  actor:
    max_grad_norm: 1.0

4. Use smaller PPO clip range:

actor_rollout_ref:
  actor:
    clip_ratio: 0.1  # Reduce from 0.2

Issue: Policy Collapse (Entropy Drops to Zero)

Symptoms: Model outputs become deterministic, entropy approaches zero

Solutions:

1. Increase temperature during rollout:

actor_rollout_ref:
  rollout:
    temperature: 0.9  # Increase from 0.7

2. Add entropy bonus:

algorithm:
  entropy_coef: 0.01

3. Reduce KL penalty:

actor_rollout_ref:
  actor:
    kl_loss_coef: 0.0001  # Reduce

Weight Synchronization Issues

Issue: Slow Weight Sync

Symptoms: Long pauses between rollout and training phases

Solutions:

1. Use FSDP2 for faster resharding:

actor_rollout_ref:
  actor:
    strategy: fsdp2

2. Enable async weight transfer:

trainer:
  async_weight_update: true

3. Reduce sync frequency:

trainer:
  weight_sync_interval: 2  # Sync every 2 steps

Issue: Weight Sync Timeout

Symptoms: Ray actor timeouts during weight synchronization

Solutions:

1. Increase Ray timeout:

import ray
ray.init(num_gpus=8, timeout=3600)  # 1 hour timeout

2. Use colocated mode (if memory allows):

trainer:
  colocate_actor_ref: true

vLLM Version Issues

Issue: vLLM Import Errors or Generation Failures

Symptoms: Import errors, generation hangs, incorrect outputs

Solutions:

1. Use compatible vLLM version:

pip install vllm>=0.8.2,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)

2. For vLLM 0.8.x issues:

actor_rollout_ref:
  rollout:
    enforce_eager: true  # Disable CUDA graphs

3. Check CUDA version compatibility:

# vLLM 0.11+ requires CUDA 12.1+
nvidia-smi  # Check CUDA version

Ray Issues

Issue: Ray Cluster Connection Failures

Symptoms: Cannot connect to Ray cluster

Solutions:

1. Check Ray head node:

ray status

2. Restart Ray cluster:

ray stop
ray start --head --port=6379 --num-gpus=8

3. Verify network connectivity:

ping head_node_ip

Issue: Ray Actor OOM

Symptoms: Ray actors killed due to OOM

Solutions:

1. Increase Ray object store memory:

ray start --head --object-store-memory=10000000000  # 10GB

2. Enable spilling to disk:

export RAY_object_spilling_config='{"type":"filesystem","params":{"directory_path":"/tmp/ray_spill"}}'

Multi-Node Issues

Issue: NCCL Timeout

Symptoms: NCCL operations timeout on multi-node

Solutions:

1. Set NCCL environment variables:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0  # Enable InfiniBand if available

2. Increase NCCL timeout:

export NCCL_TIMEOUT=1800  # 30 minutes

3. Check network interface:

ifconfig  # Verify correct interface

Issue: DeepSpeed GPU Index Out of Range

Symptoms: "GPU index out of range" error with DeepSpeed

Solutions:

export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

Data Issues

Issue: Empty Batches

Symptoms: Training receives empty batches

Solutions:

1. Verify data format:

import pandas as pd
df = pd.read_parquet("train.parquet")
print(df.columns)  # Should include 'prompt', 'reward_model'

2. Check data loading:

data:
  train_files: /absolute/path/to/train.parquet  # Use absolute path

Issue: Tokenization Errors

Symptoms: Tokenizer errors, sequence length mismatches

Solutions:

1. Set padding token:

tokenizer.pad_token = tokenizer.eos_token

2. Verify max length configuration:

data:
  max_prompt_length: 512
  max_response_length: 2048
# Total should not exceed model's max length

Megatron-Specific Issues

Issue: Megatron Checkpoint Loading Fails

Symptoms: Cannot load Megatron checkpoints

Solutions:

1. Enable mbridge conversion:

actor_rollout_ref:
  actor:
    megatron:
      use_mbridge: true

2. Convert HuggingFace to Megatron format:

python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint

Issue: Megatron on AMD GPUs

Current Limitation: Megatron-LM backend is not supported on AMD GPUs. Use FSDP backend instead:

actor_rollout_ref:
  model:
    backend: fsdp

Debugging Tips

Enable Verbose Logging

trainer:
  logging_level: DEBUG

export VERL_DEBUG=1
export RAY_DEDUP_LOGS=0

Check GPU Utilization

watch -n 1 nvidia-smi

Profile Training

# Add profiling to training loop
import torch.profiler

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    trainer.fit()
prof.export_chrome_trace("trace.json")

Resources

GitHub Issues: https://github.com/volcengine/verl/issues
Documentation: https://verl.readthedocs.io/
Community Slack: verl-project

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use verl-rl-training for Ray PPO RL fine-tuning; use pytorch-lightning for conventional supervised Lightning experiment loops.

FAQ

What is the central VERL training class in verl-rl-training?

The verl-rl-training skill documents RayPPOTrainer as the central controller managing resource allocation, worker group coordination, init_workers setup, and the fit training loop for PPO-based LLM fine-tuning.

How does verl-rl-training allocate GPUs across workers?

The verl-rl-training skill uses ResourcePoolManager with resource_pool_spec entries such as actor_rollout_ref at 4 GPUs and critic at 2 GPUs, backed by Ray PlacementGroups for worker group scheduling.

Is Verl Rl Training safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch

About

Verl Rl Training by the numbers

Add your badge

How do you configure VERL Ray PPO training for LLM RL fine-tuning?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

verl: Volcano Engine Reinforcement Learning for LLMs

When to Use verl

Key Features

Installation

Quick Start: GRPO Training

Core Architecture

Workflow 1: Math Reasoning with GRPO

Prerequisites Checklist

Step 1: Prepare Dataset

Step 2: Define Reward Function

Step 3: Create Training Config

Step 4: Launch Training

Step 5: Monitor and Validate

Workflow 2: PPO with Critic Model

Key Differences from GRPO

Configuration

Launch with Critic

Workflow 3: Large-Scale Training with Megatron

Prerequisites

Configuration for 70B+ Models

Launch Multi-Node

Configuration Reference

Algorithm Selection

Key Parameters

Common Issues and Solutions

Issue: OOM During Rollout

Issue: Training Instability

Issue: Slow Weight Sync

Issue: vLLM Version Mismatch

Advanced Topics

Multi-Turn Tool Calling

Vision-Language Models

LoRA Training

Resources

verl API Reference

Core Classes

RayPPOTrainer

ResourcePoolManager

RayWorkerGroup

ActorRolloutRefWorker

RolloutReplica

Configuration Schema

PPO Configuration (verl/trainer/config/ppo_trainer.yaml)

GRPO Configuration (docs/algo/grpo.md)

Multi-Turn Configuration (verl/trainer/config/rollout/rollout.yaml)

Reward Functions

Built-in Reward Types

Custom Reward Function Signature

Backend-Specific Configuration

FSDP Configuration

FSDP2 Configuration

Megatron Configuration

vLLM Rollout Configuration

SGLang Rollout Configuration

Algorithm Reference

Vision-Language Model Support

LoRA Configuration

Resources

verl Troubleshooting Guide

Common Issues and Solutions

OOM (Out of Memory) Issues

Issue: OOM During Rollout

Issue: OOM During Training

Training Stability Issues

Issue: Training Instability / Loss Spikes

Issue: Policy Collapse (Entropy Drops to Zero)

Weight Synchronization Issues

Issue: Slow Weight Sync

Issue: Weight Sync Timeout

vLLM Version Issues

Issue: vLLM Import Errors or Generation Failures

PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`)

GRPO Configuration (`docs/algo/grpo.md`)

Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`)