
Verl Rl Training
Look up VERL’s Ray PPO trainer, GPU pools, and rollout backends while you configure distributed RL fine-tuning for an LLM.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill verl-rl-trainingWhat is this skill?
- Documents RayPPOTrainer lifecycle: init_workers() and fit() for the PPO loop
- ResourcePoolManager maps GPU placement groups across actor_rollout_ref and critic pools
- RayWorkerGroup dispatches methods to distributed ActorRolloutRefWorker actors
- RolloutReplica backends: vLLM, SGLang, TensorRT-LLM, and HuggingFace via config
- Hybrid engine mode switching on the actor–rollout–reference worker path
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
RL training infrastructure is implemented and wired during the Build phase when you are assembling model-training systems, not when you are only ideating or shipping a no-ML product. Distributed trainers, worker groups, and inference rollouts are backend/ML systems work—canonical shelf is build → backend.
Common Questions / FAQ
Is Verl Rl Training safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Verl Rl Training
# verl API Reference ## Core Classes ### RayPPOTrainer The central controller for the training loop. Manages resource allocation and coordinates worker groups. ```python from verl import RayPPOTrainer trainer = RayPPOTrainer( config=config, resource_pool_manager=resource_manager, ray_worker_group_cls=RayWorkerGroup, ) trainer.init_workers() trainer.fit() ``` ### ResourcePoolManager Manages GPU allocation across different worker groups using Ray PlacementGroups. ```python from verl.trainer.ppo.resource_pool import ResourcePoolManager manager = ResourcePoolManager( resource_pool_spec={ "actor_rollout_ref": {"gpu": 4}, "critic": {"gpu": 2}, } ) ``` ### RayWorkerGroup Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls. ```python from verl.trainer.ppo.ray_worker_group import RayWorkerGroup worker_group = RayWorkerGroup( num_workers=8, worker_cls=ActorRolloutRefWorker, resource_pool=pool, ) ``` ### ActorRolloutRefWorker Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching. ```python # Typically configured via YAML, not instantiated directly # See configuration section below ``` ### RolloutReplica Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace. ```python from verl.workers.rollout import RolloutReplica # Backend selection via config rollout: name: vllm # or: sglang, hf, tensorrt-llm ``` ## Configuration Schema ### PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`) ```yaml # Data configuration data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 # Global batch size of prompts max_prompt_length: 512 max_response_length: 2048 # Algorithm configuration algorithm: adv_estimator: gae # gae, grpo, rloo, reinforce_plus_plus gamma: 0.99 # Discount factor lam: 0.95 # GAE lambda use_kl_in_reward: false # Add KL term to reward # Actor configuration actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct backend: fsdp # fsdp, fsdp2, megatron actor: ppo_mini_batch_size: 64 # Mini-batch for actor updates ppo_epochs: 1 # Number of actor update epochs clip_ratio: 0.2 # PPO clip range use_kl_loss: true # Use KL loss in actor kl_loss_coef: 0.001 # KL loss coefficient kl_loss_type: low_var # KL divergence calculation method loss_agg_mode: token-mean # token-mean or sequence-mean gradient_checkpointing: true max_grad_norm: 1.0 # Gradient clipping lr: 1e-6 # Learning rate rollout: name: vllm # vllm, sglang, hf n: 8 # Samples per prompt temperature: 0.7 top_p: 0.95 log_prob_micro_batch_size: 8 # Critic configuration (PPO only) critic: model: path: Qwen/Qwen2.5-7B-Instruct ppo_mini_batch_size: 64 ppo_epochs: 1 # Defaults to actor epochs # Trainer configuration trainer: total_epochs: 3 n_gpus_per_node: 8 nnodes: 1 save_freq: 100 experiment_name: my_experiment async_weight_update: false ``` ### GRPO Configuration (`docs/algo/grpo.md`) ```yaml algorithm: adv_estimator: grpo # Enable GRPO gamma: 1.0 lam: 1.0 actor_rollout_ref: rollout: n: 8 # Must be > 1 for GRPO actor: use_kl_loss: true # Required for GRPO kl_loss_coef: 0.001 kl_loss_type: low_var # or: k1, k2, k3 loss_agg_mode: token-mean ``` ### Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`) ```yaml actor_rollout_ref: rollout: name: sglang # Required for multi-turn multi_turn: enable: true tool_config_path: /path/to/tools.yaml interaction_config_path: /path/to/intera