
Openclaw Rl Training
Stand up OpenClaw-RL so live agent chats become GRPO/OPD training signal without blocking your serving API.
Overview
OpenClaw-RL Training is an agent skill for the Build phase that configures asynchronous GRPO/OPD reinforcement learning from live multi-turn conversations served via OpenClaw.
Install
npx skills add https://github.com/aradotso/trending-skills --skill openclaw-rl-trainingWhat is this skill?
- Four independent async loops: agent serving, rollout collection, PRM/judge scoring, and background policy training
- GRPO, OPD, and combined training paths via slime or Tinker with optional local GPU deploy
- OpenClaw-compatible OpenAI API wrapper for multi-turn trajectories from real conversations
- Scalable agentic RL for terminal, GUI, SWE, and tool-call agents
- Next-state feedback judging with optional majority voting on turns
- Four independent async loops that do not block each other
Adoption & trust: 1.2k installs on skills.sh; 31 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a conversational agent in production but no pipeline to turn real chats and implicit feedback into continuous policy improvements.
Who is it for?
Indie builders self-hosting agents who already use or plan OpenClaw, slime/Tinker, and GPU capacity for personalized RL.
Skip if: Teams that only need prompt tuning, lack GPU/training infra, or are not ready to operate an OpenAI-compatible agent API.
When should I use this skill?
train an agent with OpenClaw-RL, set up reinforcement learning for my AI agent, configure async RL training pipeline, or deploy OpenClaw-RL with Tinker or local GPU
What do I get? / Deliverables
You get a documented async serving, rollout, judge, and training stack so the agent keeps serving while the policy updates from conversation signal.
- OpenClaw-served rollout API with conversation capture
- Judge/PRM scoring and GRPO/OPD training configuration
Recommended Skills
Journey fit
Training and serving personalized agents is core product engineering once you already ship an agent—not discovery or launch work. Agent-tooling is the canonical shelf for RL loops, rollout capture, and policy updates on self-hosted models.
How it compares
Training pipeline skill for live RL—not a generic MCP server or a static skills.sh prompt pack.
Common Questions / FAQ
Who is openclaw-rl-training for?
Solo and small-team builders shipping personalized AI agents who want on-policy or GRPO-style updates from real user conversations rather than offline-only fine-tuning.
When should I use openclaw-rl-training?
During Build when wiring agent serving, async rollout collection, judge/PRM scoring, and background training—or when evaluating terminal, GUI, or tool-agent RL with OpenClaw and Tinker/local GPU.
Is openclaw-rl-training safe to install?
Review the Security Audits panel on this Prism page and inspect the skill repo before granting shell, network, and secrets access needed for training and deploy.
SKILL.md
READMESKILL.md - Openclaw Rl Training
# OpenClaw-RL Training > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via [OpenClaw](https://openclaw.ai), intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents. ## Architecture Overview Four independent async loops that never block each other: 1. **Agent Serving** — OpenClaw-compatible API serving rollouts 2. **Rollout Collection** — Captures multi-turn conversations as training trajectories 3. **PRM/Judge Evaluation** — Scores turns using next-state feedback (majority voting optional) 4. **Policy Training** — GRPO/OPD/Combine training via [slime](https://github.com/THUDM/slime) or [Tinker](https://thinkingmachines.ai/tinker/) ## Installation ```bash git clone https://github.com/Gen-Verse/OpenClaw-RL cd OpenClaw-RL # Install core dependencies pip install -r requirements.txt # Install slime (training backend) cd slime && pip install -e . && cd .. # Optional: install SGLang for fast inference pip install sglang ``` ## Project Structure ``` OpenClaw-RL/ ├── openclaw-rl/ # Binary RL (GRPO) method ├── openclaw-opd/ # On-Policy Distillation method ├── openclaw-combine/ # Combined Binary RL + OPD ├── openclaw-test/ # Evaluation utilities ├── terminal-rl/ # Track 2: Terminal agent RL ├── gui-rl/ # Track 2: GUI agent RL ├── swe-rl/ # Track 2: SWE agent RL ├── toolcall-rl/ # Track 2: Tool-call agent RL ├── slime/ # Core training framework └── openclaw/ # Runtime / API server ``` ## Three Learning Paradigms ### 1. Binary RL (GRPO) A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss. ### 2. On-Policy Distillation (OPD) When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal. ### 3. Combination Method (Recommended) Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization. ## Quick Start — Personal Agent (Track 1) ### Binary RL Launch Script ```bash # openclaw-rl/run_qwen3_7b_openclaw_rl.sh export MODEL_PATH=/path/to/qwen3-7b export DATA_PATH=/path/to/conversation/data export CKPT_SAVE_DIR=/path/to/checkpoints bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh ``` ### OPD Launch Script ```bash export MODEL_PATH=/path/to/qwen3-7b export JUDGE_MODEL_PATH=/path/to/judge-model export DATA_PATH=/path/to/conversation/data bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh ``` ### Combination Method (One Line) ```bash # Launch with combined Binary RL + OPD bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh ``` ## Configuration — Key Environment Variables ```bash # Model configuration export MODEL_PATH=/path/to/base/model export JUDGE_MODEL_PATH=/path/to/judge/model # For OPD export PRM_MODEL_PATH=/path/to/prm/model # For Binary RL # Training configuration export CKPT_SAVE_DIR=./checkpoints export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR" # Rollout conf