
Openrlhf Training
Pick and configure an OpenRLHF RL algorithm (PPO, GRPO, REINFORCE++, RLOO) with the right flags for your RLHF training run.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill openrlhf-trainingWhat is this skill?
- Compares 6 advantage estimators: gae (PPO), reinforce, reinforce_baseline, group_norm (GRPO), dr_grpo, rloo
- Documents PPO clipped objective, critic requirement, and typical clip_eps and learning-rate knobs
- Maps stability, memory, and speed tradeoffs per algorithm family
- Supplies copy-paste bash flags for --advantage_estimator and related hyperparameters
- Clarifies when critic-based GAE vs critic-free REINFORCE++/GRPO variants fit
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 0/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Build is where you implement and run model training pipelines; this skill is algorithm selection and hyperparameter wiring for OpenRLHF jobs. Backend subphase fits training orchestration, distributed RL loops, and CLI-driven jobs rather than UI or marketplace packaging.
Common Questions / FAQ
Is Openrlhf Training safe to install?
skills.sh reports 0 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Openrlhf Training
# Algorithm Comparison Complete guide to RL algorithms in OpenRLHF: PPO, REINFORCE++, GRPO, RLOO, and their variants. ## Overview OpenRLHF supports 6 RL algorithms selectable via `--advantage_estimator`: - **gae** - PPO with Generalized Advantage Estimation - **reinforce** - REINFORCE++ (PPO optimizations without critic) - **reinforce_baseline** - REINFORCE++ with baseline - **group_norm** - GRPO (Group Normalized Policy Optimization) - **dr_grpo** - Dr. GRPO (GRPO without std normalization) - **rloo** - Reinforcement Learning with Online Off-policy Correction ## Algorithm Details ### PPO (Proximal Policy Optimization) **Formula**: ``` loss = -min(ratio * advantages, clip(ratio, 1-ε, 1+ε) * advantages) ratio = π_new(a|s) / π_old(a|s) ``` **Characteristics**: - **Stability**: High (clipped objective prevents large updates) - **Memory**: High (stores actor + critic experiences) - **Speed**: Medium (critic training overhead) - **Requires**: Critic network for value estimation **Implementation**: ```python surr1 = ratio * advantages surr2 = ratio.clamp(1 - clip_eps_low, 1 + clip_eps_high) * advantages loss = -torch.min(surr1, surr2) ``` **When to use**: - General-purpose RLHF - Complex reward functions - Need stable training **Hyperparameters**: ```bash --advantage_estimator gae # Enable PPO --clip_eps_low 0.2 # Clipping lower bound --clip_eps_high 0.2 # Clipping upper bound --actor_learning_rate 1e-6 --critic_learning_rate 9e-6 --init_kl_coef 0.01 ``` ### REINFORCE++ **Formula**: ``` loss = -ratio * advantages (with PPO-clip) advantages = cumulative_returns - baseline ``` **Characteristics**: - **Stability**: Higher than GRPO - **Memory**: Lower (no critic network) - **Speed**: Faster than PPO - **Requires**: No critic network **Key innovation**: Integrates PPO optimizations (advantage normalization, PPO-clip loss) into REINFORCE while eliminating critic network overhead. **When to use**: - Want PPO stability without critic - Limited memory budget - Fast training priority **Hyperparameters**: ```bash --advantage_estimator reinforce --critic_pretrain None # No critic needed --init_kl_coef 0.01 --actor_learning_rate 1e-6 ``` ### REINFORCE++-baseline **Formula**: ``` rewards = rewards - mean(rewards_same_prompt) ``` **Characteristics**: - **Stability**: Very high - **Memory**: Lower (no critic) - **Speed**: Faster than PPO - **Requires**: Multiple samples per prompt **Key innovation**: Uses mean reward of multiple samples from same prompt as baseline to reshape rewards. **When to use**: - RLVR (Reinforcement Learning via Verifier Rewards) settings - Reward patterns vary (0/1/-0.5) - Multiple samples per prompt available **Hyperparameters**: ```bash --advantage_estimator reinforce_baseline --n_samples_per_prompt 4 # Must be > 1 --init_kl_coef 0.01 ``` ### GRPO (Group Normalized Policy Optimization) **Formula**: ``` rewards = (rewards - mean(rewards)) / (std(rewards) + 1e-9) loss = -ratio * normalized_advantages KL loss (optional): k1, k2, or k3 estimator ``` **Characteristics**: - **Stability**: Lower than REINFORCE++ - **Memory**: Lower (no critic) - **Speed**: Fast - **Requires**: Group reward normalization **Key innovation**: Group-based advantage normalization with optional KL loss. **When to use**: - Exploring policy optimization variants - Need reward normalization - Memory-constrained **Hyperparameters**: ```bash --advantage_estimator group_norm --use_kl_loss # Enable KL loss --kl_estimator k3 # k3 for loss, k2 ≈ k1 --init_kl_coef 0.01 --no_advantage_std_norm # Optional: disable std norm ``` **KL estimator variance**: - **k3**: Larger variance under categorical distribution - **k1, k2**: Similar variance, k2 ≈ k1 for loss ### Dr. GRPO **Formula**: ``` rewards = rewards - mean(rewards) # No std normalization ``` **Characteristics**: - **Stability**: Similar to GRPO - **Memory**: Lower (no critic) - **Speed**: Fast - **Requires**: Group mean norma