miles-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

miles: Enterprise-Grade RL for Large-Scale Model Training

miles:面向大规模模型训练的企业级RL框架

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
miles是一款高性能、适用于企业场景的RL框架,专为大规模模型后训练优化。作为slime的生产级衍生版本,它解决了MoE训练稳定性、低精度训练以及训练-推理对齐中的关键挑战。

When to Use miles

何时使用miles

Choose miles when you need:
  • Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
  • FP8 or INT4 quantization-aware training
  • Bit-wise identical train-inference alignment
  • Speculative RL for maximum throughput
  • Production stability with enterprise support
Consider alternatives when:
  • You want the research-grade original → use slime
  • You need flexible backend swapping → use verl
  • You want PyTorch-native abstractions → use torchforge
推荐使用miles的场景:
  • 训练1TB及以上规模的MoE模型(如DeepSeek V3、Qwen3-MoE)
  • 采用FP8或INT4量化感知训练
  • 需要比特级完全一致的训练-推理对齐
  • 需通过推测式RL实现最大吞吐量
  • 要求具备企业级支持的生产环境稳定性
考虑替代方案的场景:
  • 若需要研究级的原版框架 → 使用slime
  • 若需要灵活的后端切换 → 使用verl
  • 若需要PyTorch原生抽象 → 使用torchforge

Key Features

核心特性

Low-Precision Training

低精度训练

  • Unified FP8: End-to-end FP8 for both inference and training
  • INT4 QAT: 1TB models on single-machine VRAM (H200)
  • Rollout Routing Replay (R3): Bit-wise expert alignment for MoE
  • 统一FP8: 推理与训练全流程采用FP8
  • INT4 QAT: 单机器VRAM即可运行1TB模型(如H200)
  • Rollout Routing Replay (R3): 为MoE实现比特级专家对齐

Performance Optimizations

性能优化

  • Speculative RL: 25%+ rollout speedup with online SFT draft models
  • Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
  • Partial Rollout: Recycle half-finished trajectories
  • 推测式RL: 结合在线SFT草稿模型,使rollout速度提升25%以上
  • 零拷贝权重同步: CUDA IPC零拷贝映射
  • 部分Rollout: 回收未完成的轨迹

Train-Inference Alignment

训练-推理对齐

  • TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
  • Kernel-level optimization: FlashAttention-3, DeepGEMM integration
  • TIS/MIS: 截断/掩码重要性采样,用于离线策略修正
  • 内核级优化: 集成FlashAttention-3、DeepGEMM

Installation

安装

bash
undefined
bash
undefined

Recommended: Docker

推荐方式:Docker

docker pull radixark/miles:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it radixark/miles:latest /bin/bash
docker pull radixark/miles:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it radixark/miles:latest /bin/bash

From source

从源码安装

git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .
undefined
git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .
undefined

Quick Start

快速开始

miles inherits slime's configuration system. Basic training:
bash
python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

miles继承了slime的配置系统。基础训练命令如下:
bash
python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Workflow 1: Large MoE Training

工作流1:大型MoE模型训练

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
本工作流适用于训练DeepSeek V3或Qwen3-MoE等大型MoE模型。

Prerequisites Checklist

前置检查清单

  • H100/H200 GPUs with FP8 support
  • MoE model (DeepSeek V3, Qwen3-MoE)
  • Docker environment with miles
  • 支持FP8的H100/H200 GPU
  • MoE模型(如DeepSeek V3、Qwen3-MoE)
  • 部署有miles的Docker环境

Step 1: Environment Setup

步骤1:环境配置

bash
undefined
bash
undefined

FP8 block scaling (recommended for stability)

FP8块缩放(推荐用于提升稳定性)

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1
undefined
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1
undefined

Step 2: Configure Training

步骤2:配置训练

bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000
bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

验证检查清单

  • Model loads without errors
  • Routing decisions are consistent
  • No NaN/Inf in loss values

  • 模型加载无错误
  • 路由决策一致
  • 损失值中无NaN/Inf

Workflow 2: Speculative RL Training

工作流2:推测式RL训练

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
本工作流结合EAGLE推测式解码,实现最大rollout吞吐量。

How Speculative RL Works

推测式RL的工作原理

  1. Small draft model generates candidate tokens
  2. Target model verifies in parallel
  3. Draft model updated via online SFT to track policy
  1. 小型草稿模型生成候选token
  2. 目标模型并行验证候选token
  3. 通过在线SFT更新草稿模型以跟踪策略

Step 1: Enable Speculative Decoding

步骤1:启用推测式解码

miles supports EAGLE speculative decoding via SGLang:
bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl
miles通过SGLang支持EAGLE推测式解码:
bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

步骤2:启用在线MTP训练(可选)

For online SFT of draft model during training:
bash
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add
--mtp-num-layers 1
during checkpoint conversion from HuggingFace.
若要在训练过程中对草稿模型进行在线SFT:
bash
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
注意:在线MTP训练需要包含MTP权重的torch dist checkpoint。从HuggingFace转换 checkpoint时,需添加
--mtp-num-layers 1
参数。

Expected Speedup

预期速度提升

  • Standard rollout: Baseline
  • Speculative RL: 25-40% faster rollout
  • With partial rollout: Additional 10-15% throughput

  • 标准rollout: 基准速度
  • 推测式RL: rollout速度提升25-40%
  • 结合部分rollout: 吞吐量额外提升10-15%

Configuration Reference

配置参考

miles inherits all slime arguments. See slime API Reference for the complete list.
miles继承了slime的所有参数。完整参数列表请参考slime API参考

Cluster Resources (from slime)

集群资源(继承自slime)

bash
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate
bash
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

Megatron并行配置(继承自slime)

bash
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism
bash
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE专家并行

Speculative Decoding (miles-specific)

推测式解码(miles专属)

bash
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path
bash
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

在线MTP训练(miles专属)

bash
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

bash
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

核心特性(概念层面)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
以下特性已在miles中实现,但具体CLI参数可能有所变化。请参考miles仓库获取最新配置。

Unified FP8 Pipeline

统一FP8流水线

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
端到端的FP8采样与训练,消除了导致MoE模型中RL崩溃的量化诱导差异。

Rollout Routing Replay (R3)

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
  1. During SGLang inference, expert routing decisions are recorded
  2. Routing decisions stored in
    sample.rollout_routed_experts
  3. During Megatron training, routing is replayed instead of recomputed
  4. Ensures identical expert selection between train and inference
在SGLang推理过程中记录专家路由决策,并在Megatron训练时重放这些决策,以实现比特级专家对齐。
R3工作原理:
  1. 在SGLang推理期间,记录专家路由决策
  2. 路由决策存储在
    sample.rollout_routed_experts
  3. 在Megatron训练期间,重放路由决策而非重新计算
  4. 确保训练与推理阶段的专家选择完全一致

INT4 Quantization-Aware Training

INT4量化感知训练

Enables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
Model SizeBF16 VRAMINT4 VRAMReduction
70B140GB45GB3.1x
235B470GB150GB3.1x
671B1.3TB420GB3.1x
支持在单机器上部署1TB及以上规模的模型(如H200)。
INT4带来的内存节省:
模型规模BF16显存占用INT4显存占用压缩比
70B140GB45GB3.1x
235B470GB150GB3.1x
671B1.3TB420GB3.1x

Train-Inference Alignment

训练-推理对齐

miles achieves "exactly 0 KL divergence" between training and inference through:
  • Flash Attention 3
  • DeepGEMM
  • Batch-invariant kernels from Thinking Machines Lab
  • torch.compile
    integration

miles通过以下方式实现训练与推理之间“精确0 KL散度”:
  • Flash Attention 3
  • DeepGEMM
  • 来自Thinking Machines Lab的批次不变内核
  • torch.compile
    集成

Sample Data Structure

样本数据结构

miles uses the same
Sample
dataclass as slime with the
rollout_routed_experts
field for MoE routing replay:
python
@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3
See slime API Reference for the complete Sample definition.

miles使用与slime相同的
Sample
数据类,并新增
rollout_routed_experts
字段用于MoE路由重放:
python
@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE路由重放用
完整的Sample定义请参考slime API参考

Common Issues and Solutions

常见问题与解决方案

Issue: FP8 Training Collapse

问题:FP8训练崩溃

Symptoms: Loss explodes, NaN values
Solutions:
  • Use block scaling:
    export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
  • Reduce learning rate:
    --lr 5e-7
  • Ensure MoE routing is consistent between train/inference
症状:损失值激增、出现NaN值
解决方案:
  • 使用块缩放:
    export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
  • 降低学习率:
    --lr 5e-7
  • 确保训练与推理阶段的MoE路由一致

Issue: Speculative Draft Drift

问题:推测式草稿模型漂移

Symptoms: Low acceptance rate over time
Solutions:
  • Enable online MTP training to keep draft model aligned
  • Reduce speculative steps:
    --sglang-speculative-num-steps 2
  • Use CPU backup:
    --sglang-enable-draft-weights-cpu-backup
症状:随着时间推移,token接受率降低
解决方案:
  • 启用在线MTP训练,保持草稿模型与目标模型对齐
  • 减少推测步数:
    --sglang-speculative-num-steps 2
  • 使用CPU备份:
    --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

问题:训练-推理不匹配

Symptoms: Policy divergence, reward collapse
Solutions:
  • Use TIS for off-policy correction:
    --use-tis --tis-threshold 0.9
  • Verify log probs match between SGLang and Megatron
  • Enable R3 for MoE models

症状:策略偏离、奖励崩溃
解决方案:
  • 使用TIS进行离线策略修正:
    --use-tis --tis-threshold 0.9
  • 验证SGLang与Megatron的对数概率是否匹配
  • 为MoE模型启用R3

Supported Models

支持的模型

FamilyModelsMoE Support
DeepSeekR1, V3, V3.2Full
Qwen2, 2.5, 3 (including MoE)Full
Llama3, 3.1, 3.3, 4Dense only
Gemma2, 3, 3NDense only
GLM4.5, 4.6, 4.7Dense only
MiniMaxM2, M2.1Full

模型系列具体模型MoE支持
DeepSeekR1, V3, V3.2完全支持
Qwen2, 2.5, 3(含MoE版本)完全支持
Llama3, 3.1, 3.3, 4仅支持密集模型
Gemma2, 3, 3N仅支持密集模型
GLM4.5, 4.6, 4.7仅支持密集模型
MiniMaxM2, M2.1完全支持

Resources

相关资源