
Grpo Rl Training
Copy battle-tested GRPO reward functions into your RL fine-tuning loop so group-relative policy optimization scores completions on correctness, format, length, and style.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill grpo-rl-trainingWhat is this skill?
- Library of reward functions across correctness, format, length, style, and combined multi-objective scoring
- Includes exact-match and fuzzy-match correctness rewards with documented weight guidance (e.g. 2.0 for verifiable tasks)
- Designed for common GRPO training scenarios with copy-paste adaptation hooks
- Python implementations with extract_answer integration points for structured outputs
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Reward shaping and training loops sit in the build phase when you implement model fine-tuning and evaluation pipelines. GRPO training code is backend/ML infrastructure—not UI—so backend is the canonical shelf for RL reward libraries.
Common Questions / FAQ
Is Grpo Rl Training safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Grpo Rl Training
""" GRPO Reward Functions Library =============================== A collection of battle-tested reward functions for common GRPO training scenarios. Copy and adapt these for your specific use case. Categories: - Correctness rewards (verifiable tasks) - Format rewards (structured output) - Length rewards (verbosity control) - Style rewards (quality and tone) - Combined rewards (multi-objective) """ import re from typing import List, Any # ==================== CORRECTNESS REWARDS ==================== def exact_match_reward(prompts, completions, answer, **kwargs) -> List[float]: """ Binary reward for exact answer match. Use for: Math problems, factual Q&A, code output Weight: 2.0 (highest priority) """ responses = [comp[0]['content'] for comp in completions] extracted = [extract_answer(r) for r in responses] return [2.0 if ans.strip() == gt.strip() else 0.0 for ans, gt in zip(extracted, answer)] def fuzzy_match_reward(prompts, completions, answer, **kwargs) -> List[float]: """ Partial credit for similar answers. Use for: Open-ended answers, summaries Weight: 1.0 """ from difflib import SequenceMatcher responses = [comp[0]['content'] for comp in completions] extracted = [extract_answer(r) for r in responses] rewards = [] for ans, gt in zip(extracted, answer): similarity = SequenceMatcher(None, ans.lower(), gt.lower()).ratio() rewards.append(similarity) return rewards def numeric_correctness_reward(prompts, completions, answer, tolerance=0.01, **kwargs) -> List[float]: """ Reward numeric answers within tolerance. Use for: Math, physics, engineering problems Weight: 2.0 """ responses = [comp[0]['content'] for comp in completions] extracted = [extract_answer(r) for r in responses] rewards = [] for ans, gt in zip(extracted, answer): try: ans_num = float(ans.replace(',', '')) gt_num = float(gt.replace(',', '')) if abs(ans_num - gt_num) / max(abs(gt_num), 1e-8) <= tolerance: rewards.append(2.0) else: rewards.append(0.0) except: rewards.append(0.0) return rewards def code_execution_reward(prompts, completions, test_cases, **kwargs) -> List[float]: """ Execute code and verify against test cases. Use for: Code generation tasks Weight: 2.0 """ responses = [comp[0]['content'] for comp in completions] extracted_code = [extract_code_block(r) for r in responses] rewards = [] for code in extracted_code: try: # Execute code (sandboxed!) passed = run_test_cases(code, test_cases) rewards.append(2.0 if passed else 0.0) except: rewards.append(0.0) return rewards # ==================== FORMAT REWARDS ==================== def strict_xml_format_reward(completions, **kwargs) -> List[float]: """ Strict XML format: exact newlines and spacing. Use for: When format must be EXACTLY specified Weight: 0.5 """ pattern = r'^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$' responses = [comp[0]['content'] for comp in completions] matches = [re.match(pattern, r, re.DOTALL) for r in responses] return [0.5 if match else 0.0 for match in matches] def soft_xml_format_reward(completions, **kwargs) -> List[float]: """ Relaxed XML format: allows whitespace variations. Use for: When structure matters more than exact spacing Weight: 0.5 """ pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>' responses = [comp[0]['content'] for comp in completions] matches = [re.search(pattern, r, re.DOTALL) for r in responses] return [0.5 if match else 0.0 for match in matches] def json_format_reward(completions, **kwargs) -> List[float]: """ Reward valid JSON output. Use for: Structured data extraction, API resp