miles-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

miles: Enterprise-Grade RL for Large-Scale Model Training

miles：面向大规模模型训练的企业级RL框架

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

miles是一款高性能、适用于企业场景的RL框架，专为大规模模型后训练优化。作为slime的生产级衍生版本，它解决了MoE训练稳定性、低精度训练以及训练-推理对齐中的关键挑战。

When to Use miles

何时使用miles

Choose miles when you need:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Speculative RL for maximum throughput
Production stability with enterprise support

Consider alternatives when:

You want the research-grade original → use slime
You need flexible backend swapping → use verl
You want PyTorch-native abstractions → use torchforge

推荐使用miles的场景：

训练1TB及以上规模的MoE模型（如DeepSeek V3、Qwen3-MoE）
采用FP8或INT4量化感知训练
需要比特级完全一致的训练-推理对齐
需通过推测式RL实现最大吞吐量
要求具备企业级支持的生产环境稳定性

考虑替代方案的场景：

若需要研究级的原版框架 → 使用slime
若需要灵活的后端切换 → 使用verl
若需要PyTorch原生抽象 → 使用torchforge

Key Features

核心特性

Low-Precision Training

低精度训练

Unified FP8: End-to-end FP8 for both inference and training
INT4 QAT: 1TB models on single-machine VRAM (H200)
Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

统一FP8: 推理与训练全流程采用FP8
INT4 QAT: 单机器VRAM即可运行1TB模型（如H200）
Rollout Routing Replay (R3): 为MoE实现比特级专家对齐

Performance Optimizations

性能优化

Speculative RL: 25%+ rollout speedup with online SFT draft models
Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
Partial Rollout: Recycle half-finished trajectories

推测式RL: 结合在线SFT草稿模型，使rollout速度提升25%以上
零拷贝权重同步: CUDA IPC零拷贝映射
部分Rollout: 回收未完成的轨迹

Train-Inference Alignment

训练-推理对齐

TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
Kernel-level optimization: FlashAttention-3, DeepGEMM integration

TIS/MIS: 截断/掩码重要性采样，用于离线策略修正
内核级优化: 集成FlashAttention-3、DeepGEMM

Installation

安装

bash

undefined

bash

undefined

Recommended: Docker

推荐方式：Docker

docker pull radixark/miles:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it radixark/miles:latest /bin/bash

From source

从源码安装

git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .

undefined

git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .

undefined

Quick Start

快速开始

miles inherits slime's configuration system. Basic training:

bash

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

miles继承了slime的配置系统。基础训练命令如下：

bash

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Workflow 1: Large MoE Training

工作流1：大型MoE模型训练

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

本工作流适用于训练DeepSeek V3或Qwen3-MoE等大型MoE模型。

Prerequisites Checklist

前置检查清单

H100/H200 GPUs with FP8 support
MoE model (DeepSeek V3, Qwen3-MoE)
Docker environment with miles

支持FP8的H100/H200 GPU
MoE模型（如DeepSeek V3、Qwen3-MoE）
部署有miles的Docker环境

Step 1: Environment Setup

步骤1：环境配置

bash

undefined

bash

undefined

FP8 block scaling (recommended for stability)

FP8块缩放（推荐用于提升稳定性）

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1

undefined

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1

undefined

Step 2: Configure Training

步骤2：配置训练

bash

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

bash

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

验证检查清单

Workflow 2: Speculative RL Training

工作流2：推测式RL训练

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

本工作流结合EAGLE推测式解码，实现最大rollout吞吐量。

How Speculative RL Works

推测式RL的工作原理

Small draft model generates candidate tokens
Target model verifies in parallel
Draft model updated via online SFT to track policy

小型草稿模型生成候选token
目标模型并行验证候选token
通过在线SFT更新草稿模型以跟踪策略

Step 1: Enable Speculative Decoding

步骤1：启用推测式解码

miles supports EAGLE speculative decoding via SGLang:

bash

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

miles通过SGLang支持EAGLE推测式解码：

bash

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

步骤2：启用在线MTP训练（可选）

For online SFT of draft model during training:

bash

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add

--mtp-num-layers 1

during checkpoint conversion from HuggingFace.

若要在训练过程中对草稿模型进行在线SFT：

bash

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

注意：在线MTP训练需要包含MTP权重的torch dist checkpoint。从HuggingFace转换 checkpoint时，需添加

--mtp-num-layers 1

参数。

Expected Speedup

预期速度提升

Standard rollout: Baseline
Speculative RL: 25-40% faster rollout
With partial rollout: Additional 10-15% throughput

标准rollout: 基准速度
推测式RL: rollout速度提升25-40%
结合部分rollout: 吞吐量额外提升10-15%

Configuration Reference

配置参考

miles inherits all slime arguments. See slime API Reference for the complete list.

miles继承了slime的所有参数。完整参数列表请参考slime API参考。

Cluster Resources (from slime)

集群资源（继承自slime）

bash

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

bash

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

Megatron并行配置（继承自slime）

bash

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

bash

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE专家并行

Speculative Decoding (miles-specific)

推测式解码（miles专属）

bash

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

bash

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

在线MTP训练（miles专属）

bash

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

bash

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

核心特性（概念层面）

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

以下特性已在miles中实现，但具体CLI参数可能有所变化。请参考miles仓库获取最新配置。

Unified FP8 Pipeline

统一FP8流水线

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

端到端的FP8采样与训练，消除了导致MoE模型中RL崩溃的量化诱导差异。

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works:

During SGLang inference, expert routing decisions are recorded
Routing decisions stored in
```
sample.rollout_routed_experts
```
During Megatron training, routing is replayed instead of recomputed
Ensures identical expert selection between train and inference

在SGLang推理过程中记录专家路由决策，并在Megatron训练时重放这些决策，以实现比特级专家对齐。

R3工作原理:

在SGLang推理期间，记录专家路由决策
路由决策存储在
```
sample.rollout_routed_experts
```
中
在Megatron训练期间，重放路由决策而非重新计算
确保训练与推理阶段的专家选择完全一致

INT4 Quantization-Aware Training

INT4量化感知训练

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model Size	BF16 VRAM	INT4 VRAM	Reduction
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

支持在单机器上部署1TB及以上规模的模型（如H200）。

INT4带来的内存节省:

模型规模	BF16显存占用	INT4显存占用	压缩比
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

Train-Inference Alignment

训练-推理对齐

miles achieves "exactly 0 KL divergence" between training and inference through:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
```
torch.compile
```
integration

miles通过以下方式实现训练与推理之间“精确0 KL散度”：

Flash Attention 3
DeepGEMM
来自Thinking Machines Lab的批次不变内核
```
torch.compile
```
集成

Sample Data Structure

样本数据结构

miles uses the same

Sample

dataclass as slime with the

rollout_routed_experts

field for MoE routing replay:

python

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3

See slime API Reference for the complete Sample definition.

miles使用与slime相同的

Sample

数据类，并新增

rollout_routed_experts

字段用于MoE路由重放：

python

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE路由重放用

完整的Sample定义请参考slime API参考。

Common Issues and Solutions

常见问题与解决方案

Issue: FP8 Training Collapse

问题：FP8训练崩溃

Symptoms: Loss explodes, NaN values

Solutions:

Use block scaling:

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

Reduce learning rate:
```
--lr 5e-7
```
Ensure MoE routing is consistent between train/inference

症状：损失值激增、出现NaN值

解决方案:

使用块缩放：

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

降低学习率：
```
--lr 5e-7
```
确保训练与推理阶段的MoE路由一致

Issue: Speculative Draft Drift

问题：推测式草稿模型漂移

Symptoms: Low acceptance rate over time

Solutions:

Enable online MTP training to keep draft model aligned
Reduce speculative steps:
```
--sglang-speculative-num-steps 2
```

Use CPU backup:

--sglang-enable-draft-weights-cpu-backup

症状：随着时间推移，token接受率降低

解决方案:

启用在线MTP训练，保持草稿模型与目标模型对齐
减少推测步数：
```
--sglang-speculative-num-steps 2
```

使用CPU备份：

--sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

问题：训练-推理不匹配

Symptoms: Policy divergence, reward collapse

Solutions:

Use TIS for off-policy correction:
```
--use-tis --tis-threshold 0.9
```
Verify log probs match between SGLang and Megatron
Enable R3 for MoE models

症状：策略偏离、奖励崩溃

解决方案:

使用TIS进行离线策略修正：
```
--use-tis --tis-threshold 0.9
```
验证SGLang与Megatron的对数概率是否匹配
为MoE模型启用R3

Supported Models

支持的模型

Family	Models	MoE Support
DeepSeek	R1, V3, V3.2	Full
Qwen	2, 2.5, 3 (including MoE)	Full
Llama	3, 3.1, 3.3, 4	Dense only
Gemma	2, 3, 3N	Dense only
GLM	4.5, 4.6, 4.7	Dense only
MiniMax	M2, M2.1	Full

模型系列	具体模型	MoE支持
DeepSeek	R1, V3, V3.2	完全支持
Qwen	2, 2.5, 3（含MoE版本）	完全支持
Llama	3, 3.1, 3.3, 4	仅支持密集模型
Gemma	2, 3, 3N	仅支持密集模型
GLM	4.5, 4.6, 4.7	仅支持密集模型
MiniMax	M2, M2.1	完全支持

miles-rl-training

Original

Translation

miles: Enterprise-Grade RL for Large-Scale Model Training

miles：面向大规模模型训练的企业级RL框架

When to Use miles

何时使用miles

Key Features

核心特性

Low-Precision Training

低精度训练

Performance Optimizations

性能优化

Train-Inference Alignment

训练-推理对齐

Installation

安装

Recommended: Docker

推荐方式：Docker

From source

从源码安装

Quick Start

快速开始

Workflow 1: Large MoE Training

工作流1：大型MoE模型训练

Prerequisites Checklist

前置检查清单

Step 1: Environment Setup

步骤1：环境配置

FP8 block scaling (recommended for stability)

FP8块缩放（推荐用于提升稳定性）

Step 2: Configure Training

步骤2：配置训练

Verification Checklist

验证检查清单

Workflow 2: Speculative RL Training

工作流2：推测式RL训练

How Speculative RL Works

推测式RL的工作原理

Step 1: Enable Speculative Decoding

步骤1：启用推测式解码

Step 2: Enable Online MTP Training (Optional)

步骤2：启用在线MTP训练（可选）

Expected Speedup

预期速度提升

Configuration Reference

配置参考

Cluster Resources (from slime)

集群资源（继承自slime）

Megatron Parallelism (from slime)

Megatron并行配置（继承自slime）

Speculative Decoding (miles-specific)

推测式解码（miles专属）

Online MTP Training (miles-specific)

在线MTP训练（miles专属）

Key Features (Conceptual)

核心特性（概念层面）

Unified FP8 Pipeline

统一FP8流水线

Rollout Routing Replay (R3)

Rollout Routing Replay (R3)

INT4 Quantization-Aware Training

INT4量化感知训练

Train-Inference Alignment

训练-推理对齐

Sample Data Structure

样本数据结构

Common Issues and Solutions

常见问题与解决方案

Issue: FP8 Training Collapse

问题：FP8训练崩溃

Issue: Speculative Draft Drift

问题：推测式草稿模型漂移

Issue: Train-Inference Mismatch

问题：训练-推理不匹配

Supported Models

支持的模型

Resources

相关资源