slime-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

slime: LLM Post-Training Framework for RL Scaling

slime：用于RL扩展的大语言模型（LLM）后训练框架

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

slime是清华大学THUDM团队开发的LLM后训练框架，为GLM-4.5、GLM-4.6和GLM-4.7提供技术支持。它将用于训练的Megatron-LM与用于高吞吐量rollout生成的SGLang相连接。

When to Use slime

何时使用slime

Choose slime when you need:

Megatron-LM native training with SGLang inference
Custom data generation workflows with flexible data buffers
Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
Research-grade framework with production backing (Z.ai)

Consider alternatives when:

You need enterprise-grade stability features → use miles
You want flexible backend swapping → use verl
You need PyTorch-native abstractions → use torchforge

在以下场景选择slime：

需要Megatron-LM原生训练搭配SGLang推理
需要具备灵活数据缓冲区的自定义数据生成工作流
训练GLM、Qwen3、DeepSeek V3或Llama 3模型
需要兼具研究级特性与生产级支持（由Z.ai提供）的框架

在以下场景考虑替代方案：

需要企业级稳定性特性 → 使用 miles
需要灵活的后端切换 → 使用 verl
需要PyTorch原生抽象 → 使用 torchforge

Key Features

核心特性

Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
Rollout: SGLang-based high-throughput generation with router
Data Buffer: Flexible prompt management and sample storage
Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

训练：支持全并行化（TP、PP、DP、SP）的Megatron-LM
Rollout：基于SGLang的高吞吐量生成，搭配路由机制
数据缓冲区：灵活的提示词管理与样本存储
支持模型：GLM-4.x、Qwen3、DeepSeek V3/R1、Llama 3

Architecture Overview

架构概述

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

安装

bash

undefined

bash

undefined

Recommended: Docker

推荐方式：Docker

docker pull slimerl/slime:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it slimerl/slime:latest /bin/bash

Inside container

在容器内执行

cd /root/slime && pip install -e . --no-deps

undefined

cd /root/slime && pip install -e . --no-deps

undefined

From Source

从源码安装

bash

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

bash

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

快速开始：GRPO训练

bash

undefined

bash

undefined

Source model configuration

加载模型配置

source scripts/models/qwen3-4B.sh

Launch training

启动训练

python train.py
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

---

---

Workflow 1: Standard GRPO Training

工作流1：标准GRPO训练

Use this workflow for training reasoning models with group-relative advantages.

该工作流适用于使用组相对优势训练推理模型。

Prerequisites Checklist

前置检查清单

Docker environment or Megatron-LM + SGLang installed
Model checkpoint (HuggingFace or Megatron format)
Training data in JSONL format

已配置Docker环境，或已安装Megatron-LM + SGLang
模型检查点（HuggingFace或Megatron格式）
JSONL格式的训练数据

Step 1: Prepare Data

步骤1：准备数据

python

undefined

python

undefined

data.jsonl format

data.jsonl格式

{"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}


Or with chat format:
```python
{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

{"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}


或采用对话格式：
```python
{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

步骤2：配置模型

Choose a pre-configured model script:

bash

undefined

选择预配置的模型脚本：

bash

undefined

List available models

列出可用模型

ls scripts/models/

glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

Source your model

加载你的模型配置

source scripts/models/qwen3-4B.sh

undefined

source scripts/models/qwen3-4B.sh

undefined

Step 3: Launch Training

步骤3：启动训练

bash

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

bash

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

步骤4：监控训练

Check TensorBoard:
```
tensorboard --logdir outputs/
```
Verify reward curves are increasing
Monitor GPU utilization across nodes

查看TensorBoard：
```
tensorboard --logdir outputs/
```
验证奖励曲线呈上升趋势
监控多节点的GPU利用率

Workflow 2: Asynchronous Training

工作流2：异步训练

Use async mode for higher throughput by overlapping rollout and training.

通过重叠rollout与训练过程，异步模式可实现更高吞吐量。

When to Use Async

何时使用异步模式

Large models with long generation times
High GPU idle time in synchronous mode
Sufficient memory for buffering

大模型且生成时间较长
同步模式下GPU空闲时间较多
具备足够内存用于数据缓冲

Launch Async Training

启动异步训练

bash

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

bash

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

异步模式专属参数

bash

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

bash

--async-buffer-size 4        # 缓冲的rollout数量
--update-weights-interval 2  # 每N次rollout同步一次权重

Workflow 3: Multi-Turn Agentic Training

工作流3：多轮智能体训练

Use this workflow for training agents with tool use or multi-step reasoning.

该工作流适用于训练具备工具调用或多步推理能力的智能体。

Prerequisites

前置条件

Custom generate function for multi-turn logic
Tool/environment interface

用于多轮逻辑的自定义生成函数
工具/环境接口

Step 1: Define Custom Generate Function

步骤1：定义自定义生成函数

python

undefined

python

undefined

custom_generate.py

async def custom_generate(args, samples, evaluation=False): """Multi-turn generation with tool calling.""" for sample in samples: conversation = sample.prompt

    for turn in range(args.max_turns):
        # Generate response
        response = await generate_single(conversation)

        # Check for tool call
        tool_call = extract_tool_call(response)
        if tool_call:
            tool_result = execute_tool(tool_call)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "tool", "content": tool_result})
        else:
            break

    sample.response = response
    sample.reward = compute_reward(sample)

return samples

undefined

async def custom_generate(args, samples, evaluation=False): """带工具调用的多轮生成。""" for sample in samples: conversation = sample.prompt

    for turn in range(args.max_turns):
        # 生成回复
        response = await generate_single(conversation)

        # 检查是否存在工具调用
        tool_call = extract_tool_call(response)
        if tool_call:
            tool_result = execute_tool(tool_call)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "tool", "content": tool_result})
        else:
            break

    sample.response = response
    sample.reward = compute_reward(sample)

return samples

undefined

Step 2: Launch with Custom Function

步骤2：使用自定义函数启动训练

bash

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See

examples/search-r1/

for a complete multi-turn search example.

bash

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

完整的多轮搜索示例可参考

examples/search-r1/

。

Configuration Reference

配置参考

Three Argument Categories

三类参数

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

bash

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with

--sglang-

bash

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

bash

undefined

slime使用三类参数：

1. Megatron参数（直接传递）：

bash

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang参数（以

--sglang-

为前缀）：

bash

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime参数：

bash

undefined

Resource allocation

资源分配

--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # Share GPUs between training/inference

--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # 训练与推理共享GPU

Data

数据相关

--prompt-data /path/to/data.jsonl --input-key prompt --label-key label

Training loop

训练循环

--num-rollout 3000 --rollout-batch-size 32 --n-samples-per-prompt 8 --global-batch-size 256

Algorithm

算法相关

--advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001

undefined

--advantage-estimator grpo # 可选：gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001

undefined

Key Constraints

核心约束

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

示例：32 × 8 = 256 × 1

Data Buffer System

数据缓冲区系统

slime's data buffer enables flexible data management:

slime的数据缓冲区支持灵活的数据管理：

Basic Data Source

基础数据源

python

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

python

class RolloutDataSource:
    def get_samples(self, num_samples):
        """从数据集中获取提示词。"""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """生成完成后调用（默认无操作）。"""
        pass

Buffered Data Source (Off-Policy)

带缓冲区的数据源（离线策略）

python

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

python

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """存储生成的样本以供复用。"""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """自定义选择逻辑（优先级、分层等）。"""
        return select_best(buffer, num_samples)

Common Issues and Solutions

常见问题与解决方案

Issue: SGLang Engine Crash

问题：SGLang引擎崩溃

Symptoms: Inference engine dies mid-training

Solutions:

bash

undefined

症状：训练过程中推理引擎意外终止

解决方案：

bash

undefined

Enable fault tolerance

启用容错机制

--use-fault-tolerance

Increase memory allocation

增加内存分配比例

--sglang-mem-fraction-static 0.85

Reduce batch size

减小批次大小

--rollout-batch-size 16

undefined

--rollout-batch-size 16

undefined

Issue: Weight Sync Timeout

问题：权重同步超时

Symptoms: Training hangs after rollout

Solutions:

bash

undefined

症状：rollout完成后训练停滞

解决方案：

bash

undefined

Increase sync interval

增加同步间隔

--update-weights-interval 5

Use colocated mode (no network transfer)

使用共置模式（无需网络传输）

--colocate

undefined

--colocate

undefined

Issue: OOM During Training

问题：训练过程中显存不足（OOM）

Symptoms: CUDA OOM in backward pass

Solutions:

bash

undefined

症状：反向传播时出现CUDA OOM错误

解决方案：

bash

undefined

Enable gradient checkpointing

启用梯度检查点

--recompute-activations

Reduce micro-batch size

减小微批次大小

--micro-batch-size 1

Enable sequence parallelism

启用序列并行化

--sequence-parallel

undefined

--sequence-parallel

undefined

Issue: Slow Data Loading

问题：数据加载缓慢

Symptoms: GPU idle during data fetch

Solutions:

bash

undefined

症状：GPU在数据获取阶段处于空闲状态

解决方案：

bash

undefined

Increase data workers

增加数据工作进程数

--num-data-workers 4

Use streaming dataset

使用流式数据集

--streaming-data

---

--streaming-data

---

Supported Models

支持的模型

Model Family	Configurations
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
Others	Kimi K2, Moonlight-16B

Each model has pre-configured scripts in

scripts/models/

模型系列	配置版本
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
其他	Kimi K2, Moonlight-16B

每个模型在

scripts/models/

目录下都有预配置脚本。

Advanced Topics

高级主题

Co-location Mode

共置模式

Share GPUs between training and inference to reduce memory:

bash

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

训练与推理共享GPU以减少内存占用：

bash

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

自定义奖励模型

python

undefined

python

undefined

custom_rm.py

class CustomRewardModel: def init(self, model_path): self.model = load_model(model_path)

def compute_reward(self, prompts, responses):
    inputs = self.tokenize(prompts, responses)
    scores = self.model(inputs)
    return scores.tolist()


```bash
--custom-rm-path custom_rm.py

class CustomRewardModel: def init(self, model_path): self.model = load_model(model_path)

def compute_reward(self, prompts, responses):
    inputs = self.tokenize(prompts, responses)
    scores = self.model(inputs)
    return scores.tolist()


```bash
--custom-rm-path custom_rm.py

Evaluation Multi-Task

多任务评估

bash

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

bash

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

Resources

资源链接

Documentation: https://thudm.github.io/slime/
GitHub: https://github.com/THUDM/slime
Blog: https://lmsys.org/blog/2025-07-09-slime/
Examples: See
```
examples/
```
directory for 14+ worked examples

文档：https://thudm.github.io/slime/
GitHub仓库：https://github.com/THUDM/slime
博客：https://lmsys.org/blog/2025-07-09-slime/
示例：
```
examples/
```
目录下包含14+个完整示例