Grpo Rl Training

Name: Grpo Rl Training
Author: orchestra-research

orchestra-research/ai-research-skills

433 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

Copy battle-tested GRPO reward functions into your RL fine-tuning loop so group-relative policy optimization scores completions on correctness, format, length, and style.

About

grpo-rl-training is a procedural library of Group Relative Policy Optimization (GRPO) reward functions for solo builders and small teams fine-tuning language models with verifiable or structured outputs. Instead of inventing reward logic from scratch, you adapt pre-defined correctness rewards (exact and fuzzy match), format penalties, length controls, and style signals that mirror battle-tested training setups. The skill fits when you already have a GRPO trainer and need consistent, weighted objectives for math, Q&A, summarization, or formatted agent responses. It matters because mis-specified rewards silently waste GPU time and produce models that look fluent but fail grading or schema checks. Treat it as reference code to wire into your training script—not a full training orchestration skill.

Library of reward functions across correctness, format, length, style, and combined multi-objective scoring
Includes exact-match and fuzzy-match correctness rewards with documented weight guidance (e.g. 2.0 for verifiable tasks)
Designed for common GRPO training scenarios with copy-paste adaptation hooks
Python implementations with extract_answer integration points for structured outputs

Grpo Rl Training by the numbers

433 all-time installs (skills.sh)
+30 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #470 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill grpo-rl-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/grpo-rl-training.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/grpo-rl-training)

Installs	433
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

What it does

Copy battle-tested GRPO reward functions into your RL fine-tuning loop so group-relative policy optimization scores completions on correctness, format, length, and style.

Files

SKILL.mdMarkdownGitHub ↗

GRPO/RL Training with TRL

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.

When to Use This Skill

Use GRPO training when you need to:

Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
Improve reasoning capabilities by rewarding chain-of-thought patterns
Align models to domain-specific behaviors without labeled preference data
Optimize for multiple objectives simultaneously (format + correctness + style)

Do NOT use GRPO for:

Simple supervised fine-tuning tasks (use SFT instead)
Tasks without clear reward signals
When you already have high-quality preference pairs (use DPO/PPO instead)

---

Core Concepts

1. GRPO Algorithm Fundamentals

Key Mechanism:

Generates multiple completions for each prompt (group size: 4-16)
Compares completions within each group using reward functions
Updates policy to favor higher-rewarded responses relative to the group

Critical Difference from PPO:

No separate reward model needed
More sample-efficient (learns from within-group comparisons)
Simpler to implement and debug

Mathematical Intuition:

For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

2. Reward Function Design Philosophy

Golden Rules: 1. Compose multiple reward functions - Each handles one aspect (format, correctness, style) 2. Scale rewards appropriately - Higher weight = stronger signal 3. Use incremental rewards - Partial credit for partial compliance 4. Test rewards independently - Debug each reward function in isolation

Reward Function Types:

Type	Use Case	Example Weight
Correctness	Verifiable tasks (math, code)	2.0 (highest)
Format	Strict structure enforcement	0.5-1.0
Length	Encourage verbosity/conciseness	0.1-0.5
Style	Penalize unwanted patterns	-0.5 to 0.5

---

Implementation Workflow

Step 1: Dataset Preparation

Critical Requirements:

Prompts in chat format (list of dicts with 'role' and 'content')
Include system prompts to set expectations
For verifiable tasks, include ground truth answers as additional columns

Example Structure:

from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })

Pro Tips:

Use one-shot or few-shot examples in system prompt for complex formats
Keep prompts concise (max_prompt_length: 256-512 tokens)
Validate data quality before training (garbage in = garbage out)

Step 2: Reward Function Implementation

Template Structure:

def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards

Example 1: Correctness Reward (Math/Coding)

def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]

Example 2: Format Reward (Structured Output)

import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]

Example 3: Incremental Format Reward (Partial Credit)

def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards

Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.

Step 3: Training Configuration

Memory-Optimized Config (Small GPU)

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)

High-Performance Config (Large GPU)

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)

Critical Hyperparameters:

Parameter	Impact	Tuning Advice
`num_generations`	Group size for comparison	Start with 8, increase to 16 if GPU allows
`learning_rate`	Convergence speed/stability	5e-6 (safe), 1e-5 (faster, riskier)
`max_completion_length`	Output verbosity	Match your task (512 for reasoning, 256 for short answers)
`gradient_accumulation_steps`	Effective batch size	Increase if GPU memory limited

Step 4: Model Setup and Training

Standard Setup (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 2-3x faster
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
    r=16,                         # Rank (higher = more capacity)
    lora_alpha=32,               # Scaling factor (typically 2*r)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
)

# Initialize trainer
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        incremental_format_reward,
        format_reward,
        correctness_reward,
    ],
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,      # Remove for full fine-tuning
)

# Train
trainer.train()

# Save
trainer.save_model("final_model")

Unsloth Setup (2-3x Faster)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

# Rest is identical to standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()

---

Critical Training Insights

1. Loss Behavior (EXPECTED PATTERN)

Loss starts near 0 and INCREASES during training
This is CORRECT - loss measures KL divergence from initial policy
Model is learning (diverging from original behavior to optimize rewards)
Monitor reward metrics instead of loss for progress

2. Reward Tracking

Key metrics to watch:

reward: Average across all completions
reward_std: Diversity within groups (should remain > 0)
kl: KL divergence from reference (should grow moderately)

Healthy Training Pattern:

Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12

Warning Signs:

Reward std → 0 (model collapsing to single response)
KL exploding (> 0.5) (diverging too much, reduce LR)
Reward stuck (reward functions too harsh or model capacity issue)

3. Common Pitfalls and Solutions

Problem	Symptom	Solution
Mode collapse	All completions identical	Increase `num_generations`, add diversity penalty
No learning	Flat rewards	Check reward function logic, increase LR
OOM errors	GPU memory exceeded	Reduce `num_generations`, enable gradient checkpointing
Slow training	< 1 it/s	Enable `use_vllm=True`, use Unsloth, reduce seq length
Format ignored	Model doesn't follow structure	Increase format reward weight, add incremental rewards

---

Advanced Patterns

1. Multi-Stage Training

For complex tasks, train in stages:

# Stage 1: Format compliance (epochs=1)
trainer_stage1 = GRPOTrainer(
    model=model,
    reward_funcs=[incremental_format_reward, format_reward],
    ...
)
trainer_stage1.train()

# Stage 2: Correctness (epochs=1)
trainer_stage2 = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, correctness_reward],
    ...
)
trainer_stage2.train()

2. Adaptive Reward Scaling

class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

3. Custom Dataset Integration

def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

---

Deployment and Inference

Save and Merge LoRA

# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
    merged_model = trainer.model.merge_and_unload()
    merged_model.save_pretrained("production_model")
    tokenizer.save_pretrained("production_model")

Inference Example

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

---

Best Practices Checklist

Before Training:

[ ] Validate dataset format (prompts as List[Dict])
[ ] Test reward functions on sample data
[ ] Calculate expected max_prompt_length from data
[ ] Choose appropriate num_generations based on GPU memory
[ ] Set up logging (wandb recommended)

During Training:

[ ] Monitor reward progression (should increase)
[ ] Check reward_std (should stay > 0.1)
[ ] Watch for OOM errors (reduce batch size if needed)
[ ] Sample generations every 50-100 steps
[ ] Validate format compliance on holdout set

After Training:

[ ] Merge LoRA weights if using PEFT
[ ] Test on diverse prompts
[ ] Compare to baseline model
[ ] Document reward weights and hyperparameters
[ ] Save reproducibility config

---

Troubleshooting Guide

Debugging Workflow

1. Isolate reward functions - Test each independently 2. Check data distribution - Ensure diversity in prompts 3. Reduce complexity - Start with single reward, add gradually 4. Monitor generations - Print samples every N steps 5. Validate extraction logic - Ensure answer parsing works

Quick Fixes

# Debug reward function
def debug_reward(completions, **kwargs):
    responses = [comp[0]['content'] for comp in completions]
    for i, r in enumerate(responses[:2]):  # Print first 2
        print(f"Response {i}: {r[:200]}...")
    return [1.0] * len(responses)  # Dummy rewards

# Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1])  # Generate without updating

---

References and Resources

Official Documentation:

TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
Unsloth Docs: https://docs.unsloth.ai/

Example Repositories:

Open R1 Implementation: https://github.com/huggingface/open-r1
TRL Examples: https://github.com/huggingface/trl/tree/main/examples

Recommended Reading:

Progressive Disclosure Pattern for agent instructions
Reward shaping in RL (Ng et al.)
LoRA paper (Hu et al., 2021)

---

Usage Instructions for Agents

When this skill is loaded:

1. Read this entire file before implementing GRPO training 2. Start with the simplest reward function (e.g., length-based) to validate setup 3. Use the templates in templates/ directory as starting points 4. Reference examples in examples/ for task-specific implementations 5. Follow the workflow sequentially (don't skip steps) 6. Debug incrementally - add one reward function at a time

Critical Reminders:

Always use multiple reward functions (3-5 is optimal)
Monitor reward metrics, not loss
Test reward functions before training
Start small (num_generations=4), scale up gradually
Save checkpoints frequently (every 100 steps)

This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.

"""
GRPO Reward Functions Library
===============================

A collection of battle-tested reward functions for common GRPO training scenarios.
Copy and adapt these for your specific use case.

Categories:
- Correctness rewards (verifiable tasks)
- Format rewards (structured output)
- Length rewards (verbosity control)
- Style rewards (quality and tone)
- Combined rewards (multi-objective)
"""

import re
from typing import List, Any

# ==================== CORRECTNESS REWARDS ====================

def exact_match_reward(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Binary reward for exact answer match.
    Use for: Math problems, factual Q&A, code output

    Weight: 2.0 (highest priority)
    """
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_answer(r) for r in responses]
    return [2.0 if ans.strip() == gt.strip() else 0.0
            for ans, gt in zip(extracted, answer)]

def fuzzy_match_reward(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Partial credit for similar answers.
    Use for: Open-ended answers, summaries

    Weight: 1.0
    """
    from difflib import SequenceMatcher

    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_answer(r) for r in responses]

    rewards = []
    for ans, gt in zip(extracted, answer):
        similarity = SequenceMatcher(None, ans.lower(), gt.lower()).ratio()
        rewards.append(similarity)

    return rewards

def numeric_correctness_reward(prompts, completions, answer, tolerance=0.01, **kwargs) -> List[float]:
    """
    Reward numeric answers within tolerance.
    Use for: Math, physics, engineering problems

    Weight: 2.0
    """
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_answer(r) for r in responses]

    rewards = []
    for ans, gt in zip(extracted, answer):
        try:
            ans_num = float(ans.replace(',', ''))
            gt_num = float(gt.replace(',', ''))
            if abs(ans_num - gt_num) / max(abs(gt_num), 1e-8) <= tolerance:
                rewards.append(2.0)
            else:
                rewards.append(0.0)
        except:
            rewards.append(0.0)

    return rewards

def code_execution_reward(prompts, completions, test_cases, **kwargs) -> List[float]:
    """
    Execute code and verify against test cases.
    Use for: Code generation tasks

    Weight: 2.0
    """
    responses = [comp[0]['content'] for comp in completions]
    extracted_code = [extract_code_block(r) for r in responses]

    rewards = []
    for code in extracted_code:
        try:
            # Execute code (sandboxed!)
            passed = run_test_cases(code, test_cases)
            rewards.append(2.0 if passed else 0.0)
        except:
            rewards.append(0.0)

    return rewards

# ==================== FORMAT REWARDS ====================

def strict_xml_format_reward(completions, **kwargs) -> List[float]:
    """
    Strict XML format: exact newlines and spacing.
    Use for: When format must be EXACTLY specified

    Weight: 0.5
    """
    pattern = r'^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$'
    responses = [comp[0]['content'] for comp in completions]
    matches = [re.match(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_xml_format_reward(completions, **kwargs) -> List[float]:
    """
    Relaxed XML format: allows whitespace variations.
    Use for: When structure matters more than exact spacing

    Weight: 0.5
    """
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    matches = [re.search(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def json_format_reward(completions, **kwargs) -> List[float]:
    """
    Reward valid JSON output.
    Use for: Structured data extraction, API responses

    Weight: 0.5
    """
    import json

    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        try:
            json.loads(r)
            rewards.append(0.5)
        except:
            rewards.append(0.0)

    return rewards

def incremental_format_reward(completions, tags=['reasoning', 'answer'], **kwargs) -> List[float]:
    """
    Partial credit for each required tag.
    Use for: Training models to gradually learn format

    Weight: sum(0.125 * num_tags * 2) = up to 0.5 for 2 tags
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        for tag in tags:
            if f'<{tag}>' in r:
                score += 0.125
            if f'</{tag}>' in r:
                score += 0.125

        # Penalize extra content after final closing tag
        if f'</{tags[-1]}>' in r:
            extra = r.split(f'</{tags[-1]}>')[-1].strip()
            score -= len(extra) * 0.001

        rewards.append(score)

    return rewards

# ==================== LENGTH REWARDS ====================

def ideal_length_reward(completions, ideal_tokens=100, **kwargs) -> List[float]:
    """
    Reward responses near ideal length.
    Use for: Controlling verbosity

    Weight: 0.3
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        length = len(r.split())
        distance = abs(length - ideal_tokens)
        # Gaussian-like reward peaking at ideal length
        reward = 0.3 * max(0, 1 - distance / ideal_tokens)
        rewards.append(reward)

    return rewards

def min_length_reward(completions, min_tokens=50, **kwargs) -> List[float]:
    """
    Penalize responses that are too short.
    Use for: Ensuring detailed explanations

    Weight: 0.2
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        length = len(r.split())
        reward = 0.2 if length >= min_tokens else -0.2
        rewards.append(reward)

    return rewards

def max_length_penalty(completions, max_tokens=500, **kwargs) -> List[float]:
    """
    Penalize excessively long responses.
    Use for: Preventing rambling

    Weight: -0.3 when violated
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        length = len(r.split())
        reward = -0.3 if length > max_tokens else 0.0
        rewards.append(reward)

    return rewards

# ==================== STYLE REWARDS ====================

def reasoning_quality_reward(completions, **kwargs) -> List[float]:
    """
    Reward detailed reasoning with logical connectors.
    Use for: Improving chain-of-thought quality

    Weight: 0.3
    """
    logical_words = ['therefore', 'thus', 'because', 'since', 'consequently',
                     'first', 'second', 'next', 'finally', 'however']

    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        reasoning = extract_xml_tag(r, 'reasoning').lower()
        # Count logical connectors
        count = sum(1 for word in logical_words if word in reasoning)
        # Normalize by length
        score = min(0.3, count * 0.05)
        rewards.append(score)

    return rewards

def citation_reward(completions, **kwargs) -> List[float]:
    """
    Reward responses with citations or references.
    Use for: Research tasks, fact-checking

    Weight: 0.2
    """
    citation_patterns = [
        r'\[\d+\]',           # [1], [2]
        r'\([A-Z][a-z]+,?\s+\d{4}\)',  # (Smith, 2020)
        r'according to',
        r'as stated in',
    ]

    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        has_citation = any(re.search(pattern, r) for pattern in citation_patterns)
        rewards.append(0.2 if has_citation else 0.0)

    return rewards

def no_repetition_penalty(completions, **kwargs) -> List[float]:
    """
    Penalize repetitive text (same phrase repeated).
    Use for: Improving output diversity

    Weight: -0.3 when repetitive
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        words = r.lower().split()
        # Check for repeated trigrams
        trigrams = [' '.join(words[i:i+3]) for i in range(len(words)-2)]
        unique_ratio = len(set(trigrams)) / max(len(trigrams), 1)

        reward = -0.3 if unique_ratio < 0.7 else 0.0
        rewards.append(reward)

    return rewards

# ==================== COMBINED REWARDS ====================

def math_problem_reward(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Combined reward for math problems: format + correctness.
    Automatically balances multiple objectives.

    Weight: 2.5 total
    """
    format_rewards = soft_xml_format_reward(completions)
    correctness_rewards = exact_match_reward(prompts, completions, answer)

    return [f + c for f, c in zip(format_rewards, correctness_rewards)]

def code_generation_reward(prompts, completions, test_cases, **kwargs) -> List[float]:
    """
    Combined reward for code: format + execution + style.

    Weight: 2.7 total
    """
    code_format_rewards = code_block_format_reward(completions)
    execution_rewards = code_execution_reward(prompts, completions, test_cases)
    no_error_rewards = no_syntax_error_reward(completions)

    return [f + e + s for f, e, s in zip(code_format_rewards, execution_rewards, no_error_rewards)]

# ==================== HELPER FUNCTIONS ====================

def extract_answer(text: str) -> str:
    """Extract content from <answer> tags."""
    return extract_xml_tag(text, 'answer')

def extract_xml_tag(text: str, tag: str) -> str:
    """Generic XML tag extraction."""
    pattern = f'<{tag}>(.*?)</{tag}>'
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_code_block(text: str) -> str:
    """Extract code from markdown code blocks."""
    pattern = r'```(?:python)?\n(.*?)\n```'
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1) if match else ""

def run_test_cases(code: str, test_cases: List[tuple]) -> bool:
    """
    Execute code with test cases (MUST be sandboxed in production!).

    Args:
        code: Python code string
        test_cases: List of (input, expected_output) tuples

    Returns:
        True if all tests pass
    """
    # WARNING: This is a simplified example
    # In production, use proper sandboxing (e.g., docker, pypy sandbox)
    try:
        exec_globals = {}
        exec(code, exec_globals)

        for input_val, expected in test_cases:
            result = exec_globals['solution'](input_val)
            if result != expected:
                return False
        return True
    except:
        return False

# ==================== REWARD FUNCTION PRESETS ====================

# Preset for math/reasoning tasks
MATH_REASONING_REWARDS = [
    incremental_format_reward,
    soft_xml_format_reward,
    exact_match_reward,
    reasoning_quality_reward,
]

# Preset for code generation
CODE_GENERATION_REWARDS = [
    code_block_format_reward,
    code_execution_reward,
    no_syntax_error_reward,
]

# Preset for summarization
SUMMARIZATION_REWARDS = [
    ideal_length_reward,
    fuzzy_match_reward,
    no_repetition_penalty,
]

# Preset for Q&A
QA_REWARDS = [
    exact_match_reward,
    min_length_reward,
    citation_reward,
]

GRPO/RL Training Skill

Expert-level guidance for Group Relative Policy Optimization with TRL

📁 Skill Structure

grpo-rl-training/
├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
├── README.md                             # This file
├── templates/
│   └── basic_grpo_training.py            # Production-ready training template
└── examples/
    └── reward_functions_library.py       # 20+ reward function examples

🚀 Quick Start

1. Read SKILL.md - Comprehensive guide with all concepts and patterns 2. Copy `templates/basic_grpo_training.py` - Start with working code 3. Browse `examples/reward_functions_library.py` - Pick reward functions for your task 4. Modify for your use case - Adapt dataset, rewards, and config

💡 What's Inside

SKILL.md (Main Documentation)

Core GRPO concepts and algorithm fundamentals
Complete implementation workflow (dataset → rewards → training → deployment)
10+ reward function examples with code
Hyperparameter tuning guide
Training insights (loss behavior, metrics, debugging)
Troubleshooting guide
Production best practices

Templates

basic_grpo_training.py: Minimal, production-ready training script
Uses Qwen 2.5 1.5B Instruct
3 reward functions (format + correctness)
LoRA for efficient training
Fully documented and ready to run

Examples

reward_functions_library.py: 20+ battle-tested reward functions
Correctness rewards (exact match, fuzzy match, numeric, code execution)
Format rewards (XML, JSON, strict/soft)
Length rewards (ideal length, min/max)
Style rewards (reasoning quality, citations, repetition penalty)
Combined rewards (multi-objective optimization)
Preset collections for common tasks

📖 Usage for Agents

When this skill is loaded in your agent's context:

1. Always read SKILL.md first before implementing 2. Start simple - Use length-based reward to validate setup 3. Build incrementally - Add one reward function at a time 4. Reference examples - Copy patterns from reward_functions_library.py 5. Monitor training - Watch reward metrics (not loss!)

🎯 Common Use Cases

Task Type	Recommended Rewards	Template
Math reasoning	`MATH_REASONING_REWARDS` preset	basic_grpo_training.py
Code generation	`CODE_GENERATION_REWARDS` preset	Modify dataset in template
Summarization	`SUMMARIZATION_REWARDS` preset	Adjust prompts + rewards
Q&A	`QA_REWARDS` preset	Use fuzzy match + citations

⚠️ Critical Reminders

Loss goes UP during training - This is normal (it's KL divergence)
Use 3-5 reward functions - Single rewards often fail
Test rewards before training - Debug each function independently
Monitor reward_std - Should stay > 0.1 (avoid mode collapse)
Start with num_generations=4-8 - Scale up if GPU allows

🔗 External Resources

📝 Version

v1.0.0 - Initial release (January 2025)

👨‍💻 Maintained By

Orchestra Research For questions or improvements, see https://orchestra.com

---

License: MIT Last Updated: January 2025

"""
Basic GRPO Training Template
=============================

A minimal, production-ready template for GRPO training with TRL.
Adapt this for your specific task by modifying:
1. Dataset loading (get_dataset function)
2. Reward functions (reward_*_func)
3. System prompt (SYSTEM_PROMPT)
4. Hyperparameters (GRPOConfig)
"""

import torch
import re
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer, GRPOConfig

# ==================== CONFIGURATION ====================

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
OUTPUT_DIR = "outputs/grpo-model"
MAX_PROMPT_LENGTH = 256
MAX_COMPLETION_LENGTH = 512

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

# ==================== DATASET ====================

def get_dataset(split="train"):
    """
    Load and prepare your dataset.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content
    - 'answer': str (ground truth, optional)
    """
    # Example: GSM8K math dataset
    data = load_dataset('openai/gsm8k', 'main')[split]

    def process_example(x):
        # Extract ground truth answer
        answer = x['answer'].split('####')[1].strip() if '####' in x['answer'] else None

        return {
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': x['question']}
            ],
            'answer': answer
        }

    return data.map(process_example)

# ==================== HELPER FUNCTIONS ====================

def extract_xml_tag(text: str, tag: str) -> str:
    """Extract content between XML tags."""
    pattern = f'<{tag}>(.*?)</{tag}>'
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_answer(text: str) -> str:
    """Extract the final answer from structured output."""
    return extract_xml_tag(text, 'answer')

# ==================== REWARD FUNCTIONS ====================

def correctness_reward_func(prompts, completions, answer, **kwargs):
    """
    Reward correct answers.
    Weight: 2.0 (highest priority)
    """
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0 for ans, gt in zip(extracted, answer)]

def format_reward_func(completions, **kwargs):
    """
    Reward proper XML format.
    Weight: 0.5
    """
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [0.5 if re.search(pattern, r, re.DOTALL) else 0.0 for r in responses]

def incremental_format_reward_func(completions, **kwargs):
    """
    Incremental reward for partial format compliance.
    Weight: up to 0.5
    """
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.125
        if '</reasoning>' in r:
            score += 0.125
        if '<answer>' in r:
            score += 0.125
        if '</answer>' in r:
            score += 0.125

        # Penalize extra content after closing tag
        if '</answer>' in r:
            extra = r.split('</answer>')[-1].strip()
            score -= len(extra) * 0.001

        rewards.append(score)

    return rewards

# ==================== MODEL SETUP ====================

def setup_model_and_tokenizer():
    """Load model and tokenizer with optimizations."""
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto"
    )

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

def get_peft_config():
    """LoRA configuration for parameter-efficient training."""
    return LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        task_type="CAUSAL_LM",
        lora_dropout=0.05,
    )

# ==================== TRAINING ====================

def main():
    """Main training function."""

    # Load data
    print("Loading dataset...")
    dataset = get_dataset()
    print(f"Dataset size: {len(dataset)}")

    # Setup model
    print("Loading model...")
    model, tokenizer = setup_model_and_tokenizer()

    # Training configuration
    training_args = GRPOConfig(
        output_dir=OUTPUT_DIR,
        run_name="grpo-training",

        # Learning rate
        learning_rate=5e-6,
        adam_beta1=0.9,
        adam_beta2=0.99,
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type='cosine',

        # Batch settings
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,

        # GRPO specific
        num_generations=8,
        max_prompt_length=MAX_PROMPT_LENGTH,
        max_completion_length=MAX_COMPLETION_LENGTH,

        # Training duration
        num_train_epochs=1,

        # Optimization
        bf16=True,
        optim="adamw_8bit",
        max_grad_norm=0.1,

        # Logging
        logging_steps=1,
        save_steps=100,
        report_to="wandb",  # Change to "none" to disable logging
    )

    # Initialize trainer
    trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=[
            incremental_format_reward_func,
            format_reward_func,
            correctness_reward_func,
        ],
        args=training_args,
        train_dataset=dataset,
        peft_config=get_peft_config(),
    )

    # Train
    print("Starting training...")
    trainer.train()

    # Save final model
    print(f"Saving model to {OUTPUT_DIR}/final")
    trainer.save_model(f"{OUTPUT_DIR}/final")

    print("Training complete!")

if __name__ == "__main__":
    main()

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

FAQ

Is Grpo Rl Training safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmautomation