Slime Rl Training

Name: Slime Rl Training
Author: orchestra-research

orchestra-research/ai-research-skills

396 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

slime-rl-training is a Claude Code skill that guides developers through configuring Ray-orchestrated reinforcement learning loops with Megatron-LM actors, SGLang rollouts, and slime sample buffers when fine-tuning agent

About

slime-rl-training is an AI research skill for Ray-orchestrated RL fine-tuning of agent policies using the slime framework. The architecture splits into three modules: a Data Buffer for prompt initialization, custom data generation, filtering, and rollout sample storage; a Megatron-LM Training module for actor model updates; and a SGLang Rollout module with router for response generation during rollouts. Ray coordinates data flow between buffer, trainer, and rollout workers so RL loops scale across GPUs. Developers reach for slime-rl-training when building agent RL pipelines that need Megatron-scale training integrated with SGLang inference rollouts instead of hand-wiring separate trainer and sampler scripts.

Documents slime’s three-module Ray layout: data buffer, Megatron-LM training, and SGLang rollout with router
Defines the core Sample dataclass fields (prompt, tokens, response, group_index) from slime.utils.types
Covers actor training, optional critic, and weight sync from training into rollout workers
Explains rollout generation, reward/verifier outputs, and multi-turn response support

Slime Rl Training by the numbers

396 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #1,953 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill slime-rl-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/slime-rl-training.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/slime-rl-training)

Installs	396
repo stars	★ 11.2k
Security audit	3 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you set up slime RL training with Ray?

Configure Ray-orchestrated RL training loops with Megatron-LM actors, SGLang rollouts, and slime Sample buffers when fine-tuning agent policies.

Who is it for?

ML engineers building agent RL pipelines who need Ray, Megatron-LM, SGLang, and slime buffer modules wired into one training loop.

Skip if: Simple supervised fine-tuning with Hugging Face Trainer or RL projects without Ray, Megatron, or SGLang dependencies.

When should I use this skill?

User asks to configure slime RL training, Ray Megatron SGLang rollouts, or agent policy RL fine-tuning loops.

What you get

Ray-coordinated RL training config, Megatron actor checkpoints, SGLang rollout pipeline, and slime sample buffer artifacts.

RL training configuration
Actor model checkpoints
Rollout sample buffer dataset

By the numbers

Three-module architecture: Data Buffer, Megatron-LM Training, SGLang Rollout

Files

SKILL.mdMarkdownGitHub ↗

slime: LLM Post-Training Framework for RL Scaling

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

When to Use slime

Choose slime when you need:

Megatron-LM native training with SGLang inference
Custom data generation workflows with flexible data buffers
Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
Research-grade framework with production backing (Z.ai)

Consider alternatives when:

You need enterprise-grade stability features → use miles
You want flexible backend swapping → use verl
You need PyTorch-native abstractions → use torchforge

Key Features

Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
Rollout: SGLang-based high-throughput generation with router
Data Buffer: Flexible prompt management and sample storage
Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

From Source

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

---

Workflow 1: Standard GRPO Training

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

[ ] Docker environment or Megatron-LM + SGLang installed
[ ] Model checkpoint (HuggingFace or Megatron format)
[ ] Training data in JSONL format

Step 1: Prepare Data

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

Choose a pre-configured model script:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

Step 3: Launch Training

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

[ ] Check TensorBoard: tensorboard --logdir outputs/
[ ] Verify reward curves are increasing
[ ] Monitor GPU utilization across nodes

---

Workflow 2: Asynchronous Training

Use async mode for higher throughput by overlapping rollout and training.

When to Use Async

Large models with long generation times
High GPU idle time in synchronous mode
Sufficient memory for buffering

Launch Async Training

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

---

Workflow 3: Multi-Turn Agentic Training

Use this workflow for training agents with tool use or multi-step reasoning.

Prerequisites

[ ] Custom generate function for multi-turn logic
[ ] Tool/environment interface

Step 1: Define Custom Generate Function

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

Step 2: Launch with Custom Function

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.

---

Configuration Reference

Three Argument Categories

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

Key Constraints

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

---

Data Buffer System

slime's data buffer enables flexible data management:

Basic Data Source

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

---

Common Issues and Solutions

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training

Solutions:

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout

Solutions:

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch

Solutions:

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

---

Supported Models

Model Family	Configurations
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
Others	Kimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.

---

Advanced Topics

Co-location Mode

Share GPUs between training and inference to reduce memory:

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()

--custom-rm-path custom_rm.py

Evaluation Multi-Task

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

---

Resources

Documentation: https://thudm.github.io/slime/
GitHub: https://github.com/THUDM/slime
Blog: https://lmsys.org/blog/2025-07-09-slime/
Examples: See examples/ directory for 14+ worked examples

slime API Reference

Architecture Overview

slime operates with a three-module architecture orchestrated by Ray:

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Core Data Structures

Sample Object

The Sample object is the core data structure defined in slime/utils/types.py:

from slime.utils.types import Sample

@dataclass
class Sample:
    # Core fields
    group_index: Optional[int]              # Group index for batching
    index: Optional[int]                    # Sample index
    prompt: str | list[dict] = ""           # Input prompt or chat history
    tokens: list[int] = field(default_factory=list)  # Token IDs
    response: str = ""                      # Generated response
    response_length: int = 0                # Response length in tokens
    label: Optional[str] = None             # Ground truth label
    reward: Optional[float | dict] = None   # RL reward signal
    loss_mask: Optional[list[int]] = None   # 1=compute loss, 0=mask
    status: Status = Status.PENDING         # Sample status
    metadata: dict = field(default_factory=dict)  # Custom data

    # Multimodal support
    multimodal_inputs: Optional[Any] = None       # Raw multimodal data (images, videos)
    multimodal_train_inputs: Optional[Any] = None # Processed multimodal data (pixel_values)

    # Rollout tracking
    weight_versions: list[str] = field(default_factory=list)
    rollout_log_probs: Optional[list[float]] = None    # Log probs from SGLang
    rollout_routed_experts: Optional[list[list[int]]] = None  # Expert routing (MoE)

    # Control fields
    remove_sample: bool = False
    generate_function_path: Optional[str] = None
    train_metadata: Optional[dict] = None
    non_generation_time: float = 0.0

    # Speculative decoding info (nested dataclass)
    @dataclass
    class SpecInfo:
        spec_accept_token_num: int = 0
        spec_draft_token_num: int = 0
        spec_verify_ct: int = 0
        completion_token_num: int = 0

Status Enum

class Status(Enum):
    PENDING = "pending"           # Not yet processed
    COMPLETED = "completed"       # Successfully generated
    TRUNCATED = "truncated"       # Hit max length
    ABORTED = "aborted"           # Failed generation
    FAILED = "failed"             # Generation failed

Configuration System

slime uses three categories of command-line arguments:

1. Megatron Arguments

All Megatron-LM arguments are supported directly:

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096
--num-attention-heads 32
--seq-length 4096
--micro-batch-size 1
--global-batch-size 256

2. SGLang Arguments

SGLang arguments are prefixed with --sglang-:

--sglang-mem-fraction-static 0.8   # GPU memory for KV cache
--sglang-context-length 8192       # Maximum context length
--sglang-log-level INFO            # Logging verbosity
--sglang-tp-size 2                 # Tensor parallelism
--sglang-disable-cuda-graph        # Disable CUDA graphs

3. slime-Specific Arguments

Defined in slime/utils/arguments.py:

# Resource Allocation
--actor-num-nodes 1                # Training nodes
--actor-num-gpus-per-node 8        # GPUs per training node
--rollout-num-gpus 8               # Total rollout GPUs
--rollout-num-gpus-per-engine 2    # GPUs per SGLang engine
--colocate                         # Share GPUs for train/inference

# Data Configuration
--prompt-data /path/to/data.jsonl  # Training data path
--input-key prompt                 # Key for prompts in JSON
--label-key label                  # Key for labels in JSON
--apply-chat-template              # Apply chat formatting

# Training Loop
--num-rollout 3000                 # Total rollout iterations
--rollout-batch-size 32            # Prompts per rollout
--n-samples-per-prompt 8           # Responses per prompt
--global-batch-size 256            # Training batch size
--num-steps-per-rollout 1          # Training steps per rollout

# RL Algorithm
--advantage-estimator grpo         # grpo, gspo, ppo, reinforce_plus_plus
--use-kl-loss                      # Enable KL loss
--kl-loss-coef 0.001               # KL coefficient
--calculate-per-token-loss         # Token-level loss

# Off-Policy Options
--use-tis                          # Truncated Importance Sampling
--tis-threshold 0.9                # TIS threshold
--true-on-policy-mode              # Force on-policy training

Data Buffer System

RolloutDataSource (Base Class)

from slime.data import RolloutDataSource

class RolloutDataSource:
    def __init__(self, dataset, args):
        self.dataset = dataset
        self.args = args

    def get_samples(self, num_samples: int) -> list[Sample]:
        """Fetch prompts from dataset."""
        return [Sample(prompt=p) for p in self.dataset.sample(num_samples)]

    def add_samples(self, samples: list[Sample]) -> None:
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

from slime.data import RolloutDataSourceWithBuffer

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self, dataset, args):
        super().__init__(dataset, args)
        self.buffer = []

    def add_samples(self, samples: list[Sample]) -> None:
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples) -> list[Sample]:
        """Custom selection logic."""
        # Example: prioritized sampling based on reward
        sorted_buffer = sorted(buffer, key=lambda s: s.reward, reverse=True)
        return sorted_buffer[:num_samples]

Custom Functions

Custom Generate Function

For multi-turn or tool-calling scenarios:

# custom_generate.py
from slime.data import Sample

async def custom_generate(args, samples: list[Sample], evaluation: bool = False) -> list[Sample]:
    """
    Custom generation function for multi-turn interactions.

    Args:
        args: Training arguments
        samples: List of Sample objects with prompts
        evaluation: Whether this is an evaluation run

    Returns:
        List of Sample objects with responses and rewards
    """
    for sample in samples:
        conversation = sample.prompt if isinstance(sample.prompt, list) else [
            {"role": "user", "content": sample.prompt}
        ]

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                # Execute tool
                tool_result = await execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                # Final response
                sample.response = response
                break

        # Compute reward
        sample.reward = compute_reward(sample)

        # Set loss mask (1 for model tokens, 0 for tool responses)
        sample.loss_mask = build_loss_mask(sample)

    return samples

Usage:

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5

Custom Reward Function

# custom_rm.py
from slime.data import Sample

async def reward_func(args, sample: Sample, **kwargs) -> float:
    """
    Compute reward for a single sample.

    Args:
        args: Training arguments
        sample: Sample object with response

    Returns:
        Reward score (float)
    """
    response = sample.response
    ground_truth = sample.label or sample.metadata.get("answer", "")

    # Example: exact match reward
    if response.strip() == ground_truth.strip():
        return 1.0
    return 0.0

# For batched processing (more efficient)
async def batched_custom_rm(args, samples: list[Sample]) -> list[float]:
    """Batch reward computation."""
    rewards = []
    for sample in samples:
        reward = await reward_func(args, sample)
        rewards.append(reward)
    return rewards

Usage:

python train.py \
    --custom-rm-path custom_rm.py \
    --group-rm  # Enable batched processing

Model Configuration

Pre-configured Model Scripts

Located in scripts/models/:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh

# Source model configuration
source scripts/models/qwen3-4B.sh
# This sets MODEL_ARGS and CKPT_ARGS arrays

Example Model Script

# scripts/models/qwen3-4B.sh
export MODEL_ARGS=(
    --num-layers 36
    --hidden-size 2560
    --num-attention-heads 20
    --num-query-groups 4
    --ffn-hidden-size 6912
    --max-position-embeddings 32768
    --rotary-percent 1.0
    --rotary-base 1000000
    --swiglu
    --untie-embeddings-and-output-weights
    --no-position-embedding
    --normalization RMSNorm
    --tokenizer-type HuggingFaceTokenizer
    --bf16
)

export CKPT_ARGS=(
    --hf-checkpoint /path/to/qwen3-4b-hf
    --initial-megatron-checkpoint /path/to/megatron/ckpt
)

Async Training

Enabling Async Mode

python train_async.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --async-buffer-size 4 \
    --update-weights-interval 2 \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4            # Number of rollouts to buffer
--update-weights-interval 2      # Sync weights every N rollouts

Note: Colocated mode (--colocate) is NOT supported with async training.

Evaluation

Multi-Task Evaluation

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16 \
--eval-interval 50

Evaluation Configuration

--eval-interval 50               # Evaluate every N rollouts
--n-samples-per-eval-prompt 16   # Samples for evaluation
--eval-temperature 0.0           # Greedy decoding for eval

Supported Models

Model Family	Configurations
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
Others	Kimi K2, Moonlight-16B

Resources

Documentation: https://thudm.github.io/slime/
GitHub: https://github.com/THUDM/slime
Blog: https://lmsys.org/blog/2025-07-09-slime/
Examples: examples/ directory (14+ worked examples)

slime Troubleshooting Guide

Common Issues and Solutions

SGLang Issues

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training, connection errors

Solutions:

1. Enable fault tolerance:

--use-fault-tolerance

2. Increase memory allocation:

--sglang-mem-fraction-static 0.85  # Increase from 0.8

3. Reduce batch size:

--rollout-batch-size 16  # Reduce from 32

4. Disable CUDA graphs (for debugging):

--sglang-disable-cuda-graph

Issue: SGLang Router Load Imbalance

Symptoms: Some SGLang engines overloaded while others idle

Solutions:

1. Adjust routing strategy:

--sglang-router-strategy round_robin

2. Increase number of engines:

--rollout-num-gpus-per-engine 1  # More engines, less GPUs each

Weight Synchronization Issues

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout, timeout errors

Solutions:

1. Increase sync interval (async mode):

--update-weights-interval 5  # Increase from 2

2. Use colocated mode (eliminates network transfer):

--colocate

3. Check network bandwidth:

# Verify InfiniBand is enabled
ibstat

Issue: Weight Sync Failures in Multi-Node

Symptoms: Nodes fail to receive updated weights

Solutions:

1. Set NCCL environment:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0

2. Increase timeout:

export NCCL_TIMEOUT=1800

Memory Issues

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

1. Enable gradient checkpointing:

--recompute-activations

2. Reduce micro-batch size:

--micro-batch-size 1

3. Enable sequence parallelism:

--sequence-parallel

4. Reduce global batch size:

--global-batch-size 128  # Reduce from 256

Issue: OOM in Colocated Mode

Symptoms: OOM when both training and inference run on same GPUs

Solutions:

1. Reduce SGLang memory:

--sglang-mem-fraction-static 0.4  # Reduce from 0.8

2. Enable offloading:

--offload-optimizer-states

3. Use smaller sequence length:

--seq-length 2048  # Reduce from 4096

Data Loading Issues

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch, low GPU utilization

Solutions:

1. Increase data workers:

--num-data-workers 4

2. Use streaming dataset:

--streaming-data

3. Pre-tokenize data:

# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data

Issue: Data Format Errors

Symptoms: KeyError, missing fields, parsing failures

Solutions:

1. Verify data format:

import json
with open("data.jsonl") as f:
    for line in f:
        data = json.loads(line)
        assert "prompt" in data, "Missing prompt field"
        assert "label" in data, "Missing label field"

2. Check key names:

--input-key prompt  # Must match your data
--label-key label   # Must match your data

Training Stability Issues

Issue: Loss Explosion / NaN

Symptoms: Loss becomes NaN or explodes

Solutions:

1. Reduce learning rate:

--lr 1e-6  # Reduce from 5e-6

2. Enable gradient clipping:

--clip-grad 1.0

3. Check for data issues:

# Verify no empty prompts or responses
for sample in dataset:
    assert len(sample["prompt"]) > 0

4. Use BF16 instead of FP16:

--bf16  # More numerically stable

Issue: Reward Collapse

Symptoms: Reward drops to zero, model outputs garbage

Solutions:

1. Increase KL penalty:

--kl-loss-coef 0.01  # Increase from 0.001

2. Reduce number of samples:

--n-samples-per-prompt 4  # Reduce from 8

3. Verify reward function:

# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}")  # Should be reasonable

Async Training Issues

Issue: Async Training Not Supported with Colocate

Symptoms: Error when using --colocate with train_async.py

Solution: Colocated mode is NOT supported for async training. Use separate GPUs:

# Remove --colocate flag
python train_async.py \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    # No --colocate

Issue: Stale Weights in Async Mode

Symptoms: Policy divergence, inconsistent behavior

Solutions:

1. Reduce async buffer size:

--async-buffer-size 2  # Reduce from 4

2. Increase weight update frequency:

--update-weights-interval 1  # Sync every rollout

Multi-Turn Training Issues

Issue: Tool Responses Included in Loss

Symptoms: Model learns to output tool responses verbatim

Solution: Properly set loss mask in custom generate function:

def build_loss_mask(sample):
    """Create loss mask that excludes tool responses."""
    mask = []
    for i, token in enumerate(sample.tokens):
        if is_tool_response(token, sample.metadata):
            mask.append(0)  # Don't compute loss
        else:
            mask.append(1)  # Compute loss
    return mask

Issue: Multi-Turn Context Too Long

Symptoms: OOM or truncation in multi-turn conversations

Solutions:

1. Limit conversation history:

# In custom generate function
conversation = sample.prompt[-10:]  # Keep last 10 turns

2. Increase context length:

--sglang-context-length 16384

Checkpoint Issues

Issue: Checkpoint Loading Fails

Symptoms: Cannot load saved checkpoint

Solutions:

1. Verify checkpoint path:

ls -la /path/to/checkpoint/

2. Check parallelism matches:

# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2

3. Convert HuggingFace to Megatron (if needed):

python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export SLIME_DEBUG=1

Check GPU Utilization

watch -n 1 nvidia-smi

Monitor Training

tensorboard --logdir outputs/

Test Custom Functions Independently

# Test reward function
import asyncio
from custom_rm import reward_func

async def test():
    sample = Sample(prompt="test", response="test", label="expected")
    reward = await reward_func(args, sample)
    print(f"Reward: {reward}")

asyncio.run(test())

Constraint Reference

Key constraint to remember:

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

Resources

GitHub Issues: https://github.com/THUDM/slime/issues
Documentation: https://thudm.github.io/slime/
Examples: examples/ directory

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Use slime-rl-training for Ray-orchestrated agent RL with Megatron and SGLang; use peft-fine-tuning for supervised adapter tuning without rollout-based RL loops.

FAQ

What three modules does slime RL training coordinate?

slime-rl-training orchestrates a Data Buffer for prompts and rollout samples, a Megatron-LM Training module for actor updates, and a SGLang Rollout module with router for response generation, all connected by Ray.

What does the slime Data Buffer manage during RL?

slime-rl-training uses the Data Buffer for prompt initialization, custom data generation and filtering, and storing rollout samples fed between Megatron-LM training and SGLang generation workers.

Is Slime Rl Training safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingllmresearchautomation