Miles Rl Training

Name: Miles Rl Training
Author: orchestra-research

orchestra-research/ai-research-skills

397 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

miles-rl-training is an agent skill that configures and runs large-scale GRPO and MoE reinforcement-learning jobs on miles—with FP8 training, routing replay, and speculative rollout—for developers operating slime-based R

About

miles-rl-training is an enterprise RL configuration skill from orchestra-research/ai-research-skills for the miles framework built on slime. It documents unified FP8 training and inference, INT4 quantization-aware training, Rollout Routing Replay (R3), and speculative RL training atop slime's configuration system and Sample dataclass with `rollout_routed_experts` for MoE routing replay. Developers use it when launching GRPO advantage-estimator jobs on models such as qwen3-30b-a3b with Hugging Face checkpoints. The quick-start CLI example shows `python train.py --advantage-estimator grpo --model-name qwen3-30b-a3b`, making the skill a reference for miles-specific flags beyond base slime arguments.

Documents miles as an enterprise RL layer on slime with unified FP8 training and inference
Covers MoE-oriented features: expert parallelism, rollout routing replay (R3), and speculative RL
Inherits slime’s Megatron, SGLang (`--sglang-`), and slime-specific CLI argument families
Includes a GRPO quick-start example with `--model-name`, HF checkpoint path, and rollout batch sizing
Lists verified SGLang speculative-decoding flags such as EAGLE, step count, and eagle top-k

Miles Rl Training by the numbers

397 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #503 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill miles-rl-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/miles-rl-training.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/miles-rl-training)

Installs	397
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you run GRPO RL training on MoE models?

Configure and run large-scale GRPO/MoE reinforcement-learning training jobs on top of slime with miles-specific FP8, routing replay, and speculative rollout options.

Who is it for?

Distributed ML engineers running large MoE RL experiments who already use slime and need miles-specific FP8 and routing replay options.

Skip if: Small-scale SFT or DPO fine-tuning jobs that do not require distributed GRPO, MoE routing replay, or FP8 RL infrastructure.

When should I use this skill?

An agent must configure miles GRPO/MoE RL training with FP8, R3 routing replay, or speculative rollout on slime.

What you get

Configured miles training job, MoE routing-replay samples, and FP8-enabled GRPO checkpoint outputs.

GRPO training configuration
MoE RL checkpoint

By the numbers

Documents 4 miles extensions: FP8 training/inference, INT4 QAT, R3 routing replay, and speculative RL

Files

SKILL.mdMarkdownGitHub ↗

miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Speculative RL for maximum throughput
Production stability with enterprise support

Consider alternatives when:

You want the research-grade original → use slime
You need flexible backend swapping → use verl
You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

Unified FP8: End-to-end FP8 for both inference and training
INT4 QAT: 1TB models on single-machine VRAM (H200)
Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

Performance Optimizations

Speculative RL: 25%+ rollout speedup with online SFT draft models
Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
Partial Rollout: Recycle half-finished trajectories

Train-Inference Alignment

TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
Kernel-level optimization: FlashAttention-3, DeepGEMM integration

Installation

# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it radixark/miles:latest /bin/bash

# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

---

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

[ ] H100/H200 GPUs with FP8 support
[ ] MoE model (DeepSeek V3, Qwen3-MoE)
[ ] Docker environment with miles

Step 1: Environment Setup

# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

[ ] Model loads without errors
[ ] Routing decisions are consistent
[ ] No NaN/Inf in loss values

---

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

1. Small draft model generates candidate tokens 2. Target model verifies in parallel 3. Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

Standard rollout: Baseline
Speculative RL: 25-40% faster rollout
With partial rollout: Additional 10-15% throughput

---

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

Speculative Decoding (miles-specific)

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

---

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works: 1. During SGLang inference, expert routing decisions are recorded 2. Routing decisions stored in sample.rollout_routed_experts 3. During Megatron training, routing is replayed instead of recomputed 4. Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model Size	BF16 VRAM	INT4 VRAM	Reduction
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
torch.compile integration

---

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3

See slime API Reference for the complete Sample definition.

---

Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values

Solutions:

Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
Reduce learning rate: --lr 5e-7
Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms: Low acceptance rate over time

Solutions:

Enable online MTP training to keep draft model aligned
Reduce speculative steps: --sglang-speculative-num-steps 2
Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms: Policy divergence, reward collapse

Solutions:

Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
Verify log probs match between SGLang and Megatron
Enable R3 for MoE models

---

Supported Models

Family	Models	MoE Support
DeepSeek	R1, V3, V3.2	Full
Qwen	2, 2.5, 3 (including MoE)	Full
Llama	3, 3.1, 3.3, 4	Dense only
Gemma	2, 3, 3N	Dense only
GLM	4.5, 4.6, 4.7	Dense only
MiniMax	M2, M2.1	Full

---

Resources

GitHub: https://github.com/radixark/miles
Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
Slime (upstream): https://github.com/THUDM/slime
SGLang: https://github.com/sgl-project/sglang

miles API Reference

Overview

miles is an enterprise-grade RL framework built on slime, adding advanced features for large-scale MoE training:

Unified FP8 training and inference
INT4 Quantization-Aware Training
Rollout Routing Replay (R3)
Speculative RL training

Note: miles inherits slime's configuration system. See slime API Reference for base arguments.

Core Data Structures

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay.

Quick Start

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Configuration Options

miles inherits slime's three argument categories (Megatron, SGLang with --sglang- prefix, and slime-specific). Key additions:

Cluster Resources (inherited from slime)

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (inherited from slime)

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

Speculative Decoding

Verified flags from miles documentation:

# Basic speculative decoding
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup

# Draft model path
--sglang-speculative-draft-model-path /your/draft/model/path

# Online SFT for draft model (MTP)
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace to torch dist format.

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags are not publicly documented. Consult the miles repository for latest configuration options.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model Size	BF16 VRAM	INT4 VRAM	Reduction
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through infrastructure optimizations:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
torch.compile integration

Truncated/Masked Importance Sampling (TIS/MIS)

Algorithmic corrections for off-policy training. See slime documentation for --use-tis flag.

Custom Functions

Same interface as slime:

--custom-generate-function-path generate.py
--custom-rm-path reward.py

Supported Models

Family	Models	MoE Support
DeepSeek	R1, V3, V3.2	Full
Qwen	2, 2.5, 3 (including MoE)	Full
Llama	3, 3.1, 3.3, 4	Dense only
Gemma	2, 3, 3N	Dense only
GLM	4.5, 4.6, 4.7	Dense only
MiniMax	M2, M2.1	Full

Resources

GitHub: https://github.com/radixark/miles
Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
Slime (upstream): https://github.com/THUDM/slime
SGLang: https://github.com/sgl-project/sglang

miles Troubleshooting Guide

FP8 Training Issues

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values, reward collapses

Solutions:

1. Use block scaling:

--fp8-recipe blockwise
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

2. Enable R3 for MoE models:

--use-r3

3. Reduce learning rate:

--lr 5e-7  # Reduce from 1e-6

4. Warm up from BF16:

--warmup-steps 100
--warmup-precision bf16

Issue: FP8 vs BF16 Accuracy Gap

Symptoms: FP8 model underperforms BF16 baseline

Solutions:

1. Use E4M3 format for activations:

--fp8-format e4m3

2. Enable dynamic scaling:

--fp8-dynamic-scaling

3. Skip sensitive layers:

--fp8-skip-layers "lm_head,embed"

Train-Inference Mismatch Issues

Issue: Policy Divergence

Symptoms: Model behavior differs between training and inference

Solutions:

1. Enable Rollout Routing Replay:

--use-r3

2. Use importance sampling correction:

--use-tis --tis-threshold 0.9

3. Verify log probs match:

--verify-logprobs

Issue: Expert Routing Mismatch (MoE)

Symptoms: Different experts activated during train vs inference

Solutions:

1. Enable R3:

--use-r3
--r3-buffer-size 1000

2. Use deterministic routing:

--deterministic-expert-routing

INT4 Training Issues

Issue: INT4 Accuracy Degradation

Symptoms: Worse performance than BF16 or FP8

Solutions:

1. Increase group size:

--int4-group-size 256  # Increase from 128

2. Use mixed precision for sensitive layers:

--int4-skip-layers "lm_head,embed,layer_norm"

3. Warm start from BF16:

--warmup-steps 100
--warmup-precision bf16

4. Increase learning rate (INT4 often needs higher LR):

--lr 2e-6  # Increase from 1e-6

Issue: INT4 OOM Despite Expected Savings

Symptoms: Still running out of memory with INT4

Solutions:

1. Verify environment variable:

export OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1

2. Check group size alignment:

# Group size must divide hidden dimension evenly
--int4-group-size 128  # Must divide hidden_size

Speculative RL Issues

Issue: Low Acceptance Rate

Symptoms: Draft model tokens frequently rejected

Solutions:

1. Reduce lookahead:

--spec-lookahead 3  # Reduce from 5

2. Update draft more frequently:

--online-sft-interval 5  # Reduce from 10

3. Increase draft learning rate:

--draft-lr 1e-5  # Increase

Issue: Draft Model Drift

Symptoms: Acceptance rate drops over time

Solutions:

1. Enable online SFT:

--online-sft-interval 5

2. Use EMA for draft updates:

--draft-ema-decay 0.99

3. Reinitialize draft periodically:

--reinit-draft-interval 1000

Issue: Speculative Training Slower Than Expected

Symptoms: Not achieving expected 25%+ speedup

Solutions:

1. Verify draft model is small enough:

# Draft should be 1/4 to 1/10 size of target

2. Check lookahead is optimal:

--spec-lookahead 5  # Sweet spot for most models

3. Profile to find bottleneck:

--profile-speculative

Weight Synchronization Issues

Issue: Zero-Copy Sync Failures

Symptoms: Errors with CUDA IPC, weight corruption

Solutions:

1. Verify CUDA IPC support:

nvidia-smi topo -m  # Check GPU topology

2. Fall back to standard sync:

# Remove --use-zero-copy-sync

3. Increase bucket size:

--sync-bucket-size 2147483648  # 2GB

Issue: Slow Weight Sync Despite Zero-Copy

Symptoms: Weight sync still slow

Solutions:

1. Use colocated mode:

--colocate

2. Enable async weight transfer:

--async-weight-sync

MoE-Specific Issues

Issue: Expert Load Imbalance

Symptoms: Some experts heavily loaded, others unused

Solutions:

1. Enable load balancing loss:

--aux-loss-coef 0.01

2. Use capacity factor:

--moe-capacity-factor 1.25

Issue: Expert Parallelism OOM

Symptoms: OOM with large MoE models

Solutions:

1. Increase expert parallelism:

--expert-model-parallel-size 8  # Increase from 4

2. Reduce batch size per GPU:

--micro-batch-size 1

3. Enable expert offloading:

--offload-experts

Multi-Agent Issues

Issue: Co-Evolution Instability

Symptoms: Agents oscillate or one dominates

Solutions:

1. Use alternating updates:

co_evolution:
  strategy: alternating

2. Reduce co-evolution frequency:

--co-evolution-interval 20  # Increase from 10

3. Add population diversity:

co_evolution:
  population_size: 4

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export MILES_DEBUG=1

Check FP8 Tensors

# Verify FP8 is active
for name, param in model.named_parameters():
    print(f"{name}: {param.dtype}")

Profile Training

--profile
--profile-dir /path/to/profile

Verify R3 Is Working

# Check routing is being recorded
sample = samples[0]
assert sample.rollout_routed_experts is not None
assert len(sample.rollout_routed_experts) > 0

Monitor GPU Memory

watch -n 1 nvidia-smi

Resources

GitHub Issues: https://github.com/radixark/miles/issues
Unified FP8 Blog: https://lmsys.org/blog/2025-11-25-fp8-rl/
Train-Inference Mismatch Tutorial: https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/mismatch/blog-en.md
SGLang Discord: Community support

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use miles-rl-training for slime-based MoE GRPO with FP8 and routing replay; use torchforge-rl-training for Monarch/TorchTitan asynchronous Forge actor setups.

FAQ

What does miles add on top of slime for RL?

miles-rl-training documents unified FP8 training and inference, INT4 quantization-aware training, Rollout Routing Replay (R3), and speculative RL training while inheriting slime's configuration system and Sample dataclass fields.

How do you start a GRPO job in miles?

miles-rl-training shows a quick-start CLI: `python train.py --advantage-estimator grpo --model-name qwen3-30b-a3b --hf-checkpoint /path/to/checkpoint`, with additional miles-specific flags documented in the API reference.

Is Miles Rl Training safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch