Openclaw Rl Training

Name: Openclaw Rl Training
Author: aradotso

aradotso/trending-skills

1.3k installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

openclaw-rl-training is an agent skill for train personalized agents with openclaw-rl reinforcement learning from conversation feedback.

About

The openclaw-rl-training skill is designed for train personalized agents with OpenClaw-RL reinforcement learning from conversation feedback. OpenClaw-RL Training > Skill by ara.so — Daily 2026 Skills collection. OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. Invoke when the user explores OpenClaw-RL, agent personalization, or RL from conversation feedback.

main turns: Multi-turn interactions that form training trajectories.
side turns: Non-trainable system/utility turns excluded from training.
Verify next_state fields contain meaningful feedback signals.
Check judge model prompt template matches expected format.
Try increasing majority vote N: --majority-vote-n 7.

Openclaw Rl Training by the numbers

1,264 all-time installs (skills.sh)
+6 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #883 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

openclaw-rl-training capabilities & compatibility

Capabilities: main turns: multi turn interactions that form tr · side turns: non trainable system/utility turns e · verify next_state fields contain meaningful feed · check judge model prompt template matches expect

From the docs

What openclaw-rl-training says it does

OpenClaw-RL framework for training personalized AI agents via reinforcement learning from natural conversation feedback

SKILL.md

OpenClaw-RL framework for training personalized AI agents via reinforcement learning from natural conversation feedback

SKILL.md

npx skills add https://github.com/aradotso/trending-skills --skill openclaw-rl-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/openclaw-rl-training.svg)](https://skillselion.com/skills/aradotso/trending-skills/openclaw-rl-training)

Installs	1.3k
repo stars	★ 66
Security audit	1 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do I train personalized agents with openclaw-rl reinforcement learning from conversation feedback?

Train personalized agents with OpenClaw-RL reinforcement learning from conversation feedback.

Who is it for?

Researchers implementing RL personalization for conversational agents.

Skip if: Skip for static prompt tuning without RL training infrastructure.

When should I use this skill?

User explores OpenClaw-RL, agent personalization, or RL from conversation feedback.

What you get

Completed openclaw-rl-training workflow with documented commands, files, and expected deliverables.

async RL pipeline config
conversation reward datasets
trained agent checkpoints

Files

SKILL.mdMarkdownGitHub ↗

OpenClaw-RL Training

Skill by ara.so — Daily 2026 Skills collection.

OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw, intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.

Architecture Overview

Four independent async loops that never block each other: 1. Agent Serving — OpenClaw-compatible API serving rollouts 2. Rollout Collection — Captures multi-turn conversations as training trajectories 3. PRM/Judge Evaluation — Scores turns using next-state feedback (majority voting optional) 4. Policy Training — GRPO/OPD/Combine training via slime or Tinker

Installation

git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL

# Install core dependencies
pip install -r requirements.txt

# Install slime (training backend)
cd slime && pip install -e . && cd ..

# Optional: install SGLang for fast inference
pip install sglang

Project Structure

OpenClaw-RL/
├── openclaw-rl/          # Binary RL (GRPO) method
├── openclaw-opd/         # On-Policy Distillation method
├── openclaw-combine/     # Combined Binary RL + OPD
├── openclaw-test/        # Evaluation utilities
├── terminal-rl/          # Track 2: Terminal agent RL
├── gui-rl/               # Track 2: GUI agent RL
├── swe-rl/               # Track 2: SWE agent RL
├── toolcall-rl/          # Track 2: Tool-call agent RL
├── slime/                # Core training framework
└── openclaw/             # Runtime / API server

Three Learning Paradigms

1. Binary RL (GRPO)

A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.

2. On-Policy Distillation (OPD)

When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.

3. Combination Method (Recommended)

Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.

Quick Start — Personal Agent (Track 1)

Binary RL Launch Script

# openclaw-rl/run_qwen3_7b_openclaw_rl.sh
export MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints

bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh

OPD Launch Script

export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data

bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh

Combination Method (One Line)

# Launch with combined Binary RL + OPD
bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

Configuration — Key Environment Variables

# Model configuration
export MODEL_PATH=/path/to/base/model
export JUDGE_MODEL_PATH=/path/to/judge/model   # For OPD
export PRM_MODEL_PATH=/path/to/prm/model       # For Binary RL

# Training configuration
export CKPT_SAVE_DIR=./checkpoints
export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"

# Rollout configuration
export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"

# Optimizer configuration
export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"

# GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)
export TRAIN_GPUS="0,1,2,3"
export ROLLOUT_GPUS="4,5,6,7"

# LoRA (optional, reduces GPU memory)
export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"

LoRA Training

# Add LoRA args to any launch script
export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"

# Example: LoRA Binary RL
bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh

Custom Loss / Rollout Functions (Plugin API)

The slime framework exposes extension points without modifying core code:

# Custom loss function
--custom-loss-function-path ./my_method/custom_loss.py

# Custom rollout function  
--rollout-function-path ./my_method/custom_rollout.py

# Custom generation function
--custom-generate-function-path ./my_method/custom_generate.py

# Custom reward model
--custom-rm-path ./my_method/custom_rm.py

Example Custom Loss (TypeScript-style config, Python implementation)

# my_method/custom_loss.py
import torch
from typing import Dict, Any

def compute_loss(
    policy_logits: torch.Tensor,
    reference_logits: torch.Tensor,
    rewards: torch.Tensor,
    advantages: torch.Tensor,
    config: Dict[str, Any]
) -> torch.Tensor:
    """
    Custom GRPO-style loss with clipped surrogate objective.
    """
    # Log-ratio between policy and reference
    log_ratio = policy_logits - reference_logits
    ratio = torch.exp(log_ratio)
    
    clip_range = config.get("clip_range", 0.2)
    
    # PPO-style clipped objective
    clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
    loss = -torch.min(ratio * advantages, clipped * advantages).mean()
    
    # KL penalty
    kl_coeff = config.get("kl_coeff", 0.01)
    kl_penalty = kl_coeff * log_ratio.mean()
    
    return loss + kl_penalty

Example Custom Reward Model

# my_method/custom_rm.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class CustomPRM:
    def __init__(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        self.model.eval()

    def score(self, prompt: str, response: str, next_state: str) -> float:
        """
        Score a turn given prompt, response, and next-state feedback.
        """
        combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
        inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
        
        with torch.no_grad():
            logits = self.model(**inputs).logits
        
        # Binary reward: positive class probability
        return torch.softmax(logits, dim=-1)[0, 1].item()


def get_reward_model(config):
    return CustomPRM(config["prm_model_path"])

Deploying on Tinker (Cloud)

# One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported
export TINKER_API_KEY=$TINKER_API_KEY
export TINKER_ENDPOINT=$TINKER_ENDPOINT

# Submit job via Ray
ray job submit --address $TINKER_ENDPOINT \
  --working-dir . \
  -- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

Track 2 — General Agentic RL

Terminal Agent RL

export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32   # Number of parallel environment instances

bash terminal-rl/run_terminal_rl.sh

GUI Agent RL

export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright   # or selenium
export PARALLEL_ENVS=16

bash gui-rl/run_gui_rl.sh

Tool-Call Agent RL

export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64

bash toolcall-rl/run_toolcall_rl.sh

SWE Agent RL

export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8   # SWE environments are heavier

bash swe-rl/run_swe_rl.sh

Data Format — Conversation Trajectories

OpenClaw-RL automatically classifies API messages. Manual format for custom data:

{
  "session_id": "user_session_abc123",
  "turns": [
    {
      "type": "main",
      "prompt": "Help me refactor this function to use async/await",
      "response": "Here's the refactored version: ...",
      "next_state": "User accepted the change and said 'perfect, thanks!'",
      "trainable": true
    },
    {
      "type": "side", 
      "prompt": "What is 2+2?",
      "response": "4",
      "trainable": false
    }
  ]
}

`main` turns: Multi-turn interactions that form training trajectories
`side` turns: Non-trainable system/utility turns excluded from training

OpenClaw API Server Setup

# Start OpenClaw-compatible API server wrapping your model
export BASE_MODEL_PATH=/path/to/your/model
export OPENCLAW_PORT=8000
export OPENCLAW_HOST=0.0.0.0

# Using SGLang backend (recommended for speed)
python -m openclaw.server \
  --model-path $BASE_MODEL_PATH \
  --port $OPENCLAW_PORT \
  --backend sglang \
  --enable-rl-intercept          # Enable conversation capture for RL
  --rl-buffer-dir ./rl_buffer    # Where to store captured trajectories

// Using the server as OpenAI-compatible API in TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "user", content: "Help me write a sorting algorithm" }
  ],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Majority Voting for Robust PRM Scoring

# Enable majority voting for more robust reward estimation
export MAJORITY_VOTE_N=5   # Number of judge calls per turn
export MAJORITY_VOTE_THRESHOLD=0.6

# Add to your launch script args:
--majority-vote-n $MAJORITY_VOTE_N \
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD

Adding a New Method (Contribution Pattern)

# 1. Create a new top-level folder
mkdir my-new-method
cd my-new-method

# 2. Required files
touch README.md                           # Document what, how, env vars
touch run_qwen3_7b_my_method.sh          # Launch script
touch custom_loss.py                      # If custom loss needed
touch custom_rollout.py                   # If custom rollout needed

# run_qwen3_7b_my_method.sh — follow existing conventions
#!/bin/bash
set -e

MODEL_SIZE="7b"
MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}
CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}

CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"
ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"
OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"

ray job submit --working-dir .. -- \
  python slime/train.py \
    --model-path $MODEL_PATH \
    --custom-loss-function-path my-new-method/custom_loss.py \
    $CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS

Common Patterns

Monitor Training Progress

# View Ray dashboard
ray dashboard  # Opens at http://localhost:8265

# Watch checkpoint saves
watch -n 10 ls -la $CKPT_SAVE_DIR

# Stream training logs
tail -f ./logs/training.log

Resume from Checkpoint

export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500
# Add to launch script:
--resume-from-checkpoint $RESUME_CKPT

Evaluate Trained Checkpoints

bash openclaw-test/run_eval.sh \
  --model-path $CKPT_SAVE_DIR/checkpoint-latest \
  --eval-tasks "conversation,coding,tool-use"

Troubleshooting

Out of GPU memory during rollout + training:

# Use LoRA to reduce memory footprint
export LORA_ARGS="--use-lora --lora-rank 32"
# Or reduce parallel environments
export PARALLEL_ENVS=8
# Or use offloading
--offload-optimizer-state

Async loop falling behind (buffer overflow):

# Reduce rollout batch size or increase judge throughput
export ROLLOUT_ARGS="--rollout-batch-size 16"
# Or add more judge workers
--num-judge-workers 4

PRM scores all near 0.5 (reward collapse):

Verify next_state fields contain meaningful feedback signals
Check judge model prompt template matches expected format
Try increasing majority vote N: --majority-vote-n 7

SGLang server not starting:

# Check SGLang version compatibility
pip install sglang==0.4.x  # Check slime/requirements.txt for pinned version
# Fallback to vLLM backend
--backend vllm

Ray job submission fails:

# Start Ray cluster first
ray start --head --num-gpus=$(nvidia-smi -L | wc -l)
# Then submit job
ray job submit --address auto -- bash run.sh

Key References

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick openclaw-rl-training for continuous RL from live chats; use static eval or prompt-only skills when you are not running OpenClaw-RL infrastructure.

FAQ

What does openclaw-rl-training do?