slime-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

slime: LLM Post-Training Framework for RL Scaling

slime:用于RL扩展的大语言模型(LLM)后训练框架

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
slime是清华大学THUDM团队开发的LLM后训练框架,为GLM-4.5、GLM-4.6和GLM-4.7提供技术支持。它将用于训练的Megatron-LM与用于高吞吐量rollout生成的SGLang相连接。

When to Use slime

何时使用slime

Choose slime when you need:
  • Megatron-LM native training with SGLang inference
  • Custom data generation workflows with flexible data buffers
  • Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
  • Research-grade framework with production backing (Z.ai)
Consider alternatives when:
  • You need enterprise-grade stability features → use miles
  • You want flexible backend swapping → use verl
  • You need PyTorch-native abstractions → use torchforge
在以下场景选择slime:
  • 需要Megatron-LM原生训练搭配SGLang推理
  • 需要具备灵活数据缓冲区的自定义数据生成工作流
  • 训练GLM、Qwen3、DeepSeek V3或Llama 3模型
  • 需要兼具研究级特性与生产级支持(由Z.ai提供)的框架
在以下场景考虑替代方案:
  • 需要企业级稳定性特性 → 使用 miles
  • 需要灵活的后端切换 → 使用 verl
  • 需要PyTorch原生抽象 → 使用 torchforge

Key Features

核心特性

  • Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
  • Rollout: SGLang-based high-throughput generation with router
  • Data Buffer: Flexible prompt management and sample storage
  • Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
  • 训练:支持全并行化(TP、PP、DP、SP)的Megatron-LM
  • Rollout:基于SGLang的高吞吐量生成,搭配路由机制
  • 数据缓冲区:灵活的提示词管理与样本存储
  • 支持模型:GLM-4.x、Qwen3、DeepSeek V3/R1、Llama 3

Architecture Overview

架构概述

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

安装

bash
undefined
bash
undefined

Recommended: Docker

推荐方式:Docker

docker pull slimerl/slime:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it slimerl/slime:latest /bin/bash
docker pull slimerl/slime:latest docker run --rm --gpus all --ipc=host --shm-size=16g
-it slimerl/slime:latest /bin/bash

Inside container

在容器内执行

cd /root/slime && pip install -e . --no-deps
undefined
cd /root/slime && pip install -e . --no-deps
undefined

From Source

从源码安装

bash
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .
bash
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

快速开始:GRPO训练

bash
undefined
bash
undefined

Source model configuration

加载模型配置

source scripts/models/qwen3-4B.sh
source scripts/models/qwen3-4B.sh

Launch training

启动训练

python train.py
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

---
python train.py
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

---

Workflow 1: Standard GRPO Training

工作流1:标准GRPO训练

Use this workflow for training reasoning models with group-relative advantages.
该工作流适用于使用组相对优势训练推理模型。

Prerequisites Checklist

前置检查清单

  • Docker environment or Megatron-LM + SGLang installed
  • Model checkpoint (HuggingFace or Megatron format)
  • Training data in JSONL format
  • 已配置Docker环境,或已安装Megatron-LM + SGLang
  • 模型检查点(HuggingFace或Megatron格式)
  • JSONL格式的训练数据

Step 1: Prepare Data

步骤1:准备数据

python
undefined
python
undefined

data.jsonl format

data.jsonl格式

{"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:
```python
{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}
{"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}

或采用对话格式:
```python
{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

步骤2:配置模型

Choose a pre-configured model script:
bash
undefined
选择预配置的模型脚本:
bash
undefined

List available models

列出可用模型

ls scripts/models/
ls scripts/models/

glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

Source your model

加载你的模型配置

source scripts/models/qwen3-4B.sh
undefined
source scripts/models/qwen3-4B.sh
undefined

Step 3: Launch Training

步骤3:启动训练

bash
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}
bash
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

步骤4:监控训练

  • Check TensorBoard:
    tensorboard --logdir outputs/
  • Verify reward curves are increasing
  • Monitor GPU utilization across nodes

  • 查看TensorBoard:
    tensorboard --logdir outputs/
  • 验证奖励曲线呈上升趋势
  • 监控多节点的GPU利用率

Workflow 2: Asynchronous Training

工作流2:异步训练

Use async mode for higher throughput by overlapping rollout and training.
通过重叠rollout与训练过程,异步模式可实现更高吞吐量。

When to Use Async

何时使用异步模式

  • Large models with long generation times
  • High GPU idle time in synchronous mode
  • Sufficient memory for buffering
  • 大模型且生成时间较长
  • 同步模式下GPU空闲时间较多
  • 具备足够内存用于数据缓冲

Launch Async Training

启动异步训练

bash
python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}
bash
python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

异步模式专属参数

bash
--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

bash
--async-buffer-size 4        # 缓冲的rollout数量
--update-weights-interval 2  # 每N次rollout同步一次权重

Workflow 3: Multi-Turn Agentic Training

工作流3:多轮智能体训练

Use this workflow for training agents with tool use or multi-step reasoning.
该工作流适用于训练具备工具调用或多步推理能力的智能体。

Prerequisites

前置条件

  • Custom generate function for multi-turn logic
  • Tool/environment interface
  • 用于多轮逻辑的自定义生成函数
  • 工具/环境接口

Step 1: Define Custom Generate Function

步骤1:定义自定义生成函数

python
undefined
python
undefined

custom_generate.py

custom_generate.py

async def custom_generate(args, samples, evaluation=False): """Multi-turn generation with tool calling.""" for sample in samples: conversation = sample.prompt
    for turn in range(args.max_turns):
        # Generate response
        response = await generate_single(conversation)

        # Check for tool call
        tool_call = extract_tool_call(response)
        if tool_call:
            tool_result = execute_tool(tool_call)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "tool", "content": tool_result})
        else:
            break

    sample.response = response
    sample.reward = compute_reward(sample)

return samples
undefined
async def custom_generate(args, samples, evaluation=False): """带工具调用的多轮生成。""" for sample in samples: conversation = sample.prompt
    for turn in range(args.max_turns):
        # 生成回复
        response = await generate_single(conversation)

        # 检查是否存在工具调用
        tool_call = extract_tool_call(response)
        if tool_call:
            tool_result = execute_tool(tool_call)
            conversation.append({"role": "assistant", "content": response})
            conversation.append({"role": "tool", "content": tool_result})
        else:
            break

    sample.response = response
    sample.reward = compute_reward(sample)

return samples
undefined

Step 2: Launch with Custom Function

步骤2:使用自定义函数启动训练

bash
python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}
See
examples/search-r1/
for a complete multi-turn search example.

bash
python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}
完整的多轮搜索示例可参考
examples/search-r1/

Configuration Reference

配置参考

Three Argument Categories

三类参数

slime uses three types of arguments:
1. Megatron Arguments (passed directly):
bash
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096
2. SGLang Arguments (prefixed with
--sglang-
):
bash
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO
3. slime Arguments:
bash
undefined
slime使用三类参数:
1. Megatron参数(直接传递):
bash
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096
2. SGLang参数(以
--sglang-
为前缀):
bash
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO
3. slime参数
bash
undefined

Resource allocation

资源分配

--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # Share GPUs between training/inference
--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # 训练与推理共享GPU

Data

数据相关

--prompt-data /path/to/data.jsonl --input-key prompt --label-key label
--prompt-data /path/to/data.jsonl --input-key prompt --label-key label

Training loop

训练循环

--num-rollout 3000 --rollout-batch-size 32 --n-samples-per-prompt 8 --global-batch-size 256
--num-rollout 3000 --rollout-batch-size 32 --n-samples-per-prompt 8 --global-batch-size 256

Algorithm

算法相关

--advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001
undefined
--advantage-estimator grpo # 可选:gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001
undefined

Key Constraints

核心约束

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
示例:32 × 8 = 256 × 1

Data Buffer System

数据缓冲区系统

slime's data buffer enables flexible data management:
slime的数据缓冲区支持灵活的数据管理:

Basic Data Source

基础数据源

python
class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass
python
class RolloutDataSource:
    def get_samples(self, num_samples):
        """从数据集中获取提示词。"""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """生成完成后调用(默认无操作)。"""
        pass

Buffered Data Source (Off-Policy)

带缓冲区的数据源(离线策略)

python
class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

python
class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """存储生成的样本以供复用。"""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """自定义选择逻辑(优先级、分层等)。"""
        return select_best(buffer, num_samples)

Common Issues and Solutions

常见问题与解决方案

Issue: SGLang Engine Crash

问题:SGLang引擎崩溃

Symptoms: Inference engine dies mid-training
Solutions:
bash
undefined
症状:训练过程中推理引擎意外终止
解决方案
bash
undefined

Enable fault tolerance

启用容错机制

--use-fault-tolerance
--use-fault-tolerance

Increase memory allocation

增加内存分配比例

--sglang-mem-fraction-static 0.85
--sglang-mem-fraction-static 0.85

Reduce batch size

减小批次大小

--rollout-batch-size 16
undefined
--rollout-batch-size 16
undefined

Issue: Weight Sync Timeout

问题:权重同步超时

Symptoms: Training hangs after rollout
Solutions:
bash
undefined
症状:rollout完成后训练停滞
解决方案
bash
undefined

Increase sync interval

增加同步间隔

--update-weights-interval 5
--update-weights-interval 5

Use colocated mode (no network transfer)

使用共置模式(无需网络传输)

--colocate
undefined
--colocate
undefined

Issue: OOM During Training

问题:训练过程中显存不足(OOM)

Symptoms: CUDA OOM in backward pass
Solutions:
bash
undefined
症状:反向传播时出现CUDA OOM错误
解决方案
bash
undefined

Enable gradient checkpointing

启用梯度检查点

--recompute-activations
--recompute-activations

Reduce micro-batch size

减小微批次大小

--micro-batch-size 1
--micro-batch-size 1

Enable sequence parallelism

启用序列并行化

--sequence-parallel
undefined
--sequence-parallel
undefined

Issue: Slow Data Loading

问题:数据加载缓慢

Symptoms: GPU idle during data fetch
Solutions:
bash
undefined
症状:GPU在数据获取阶段处于空闲状态
解决方案
bash
undefined

Increase data workers

增加数据工作进程数

--num-data-workers 4
--num-data-workers 4

Use streaming dataset

使用流式数据集

--streaming-data

---
--streaming-data

---

Supported Models

支持的模型

Model FamilyConfigurations
GLMGLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
QwenQwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeekV3, V3.1, R1
LlamaLlama 3 (8B, 70B)
OthersKimi K2, Moonlight-16B
Each model has pre-configured scripts in
scripts/models/
.

模型系列配置版本
GLMGLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
QwenQwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeekV3, V3.1, R1
LlamaLlama 3 (8B, 70B)
其他Kimi K2, Moonlight-16B
每个模型在
scripts/models/
目录下都有预配置脚本。

Advanced Topics

高级主题

Co-location Mode

共置模式

Share GPUs between training and inference to reduce memory:
bash
python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}
训练与推理共享GPU以减少内存占用:
bash
python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

自定义奖励模型

python
undefined
python
undefined

custom_rm.py

custom_rm.py

class CustomRewardModel: def init(self, model_path): self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
    inputs = self.tokenize(prompts, responses)
    scores = self.model(inputs)
    return scores.tolist()

```bash
--custom-rm-path custom_rm.py
class CustomRewardModel: def init(self, model_path): self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
    inputs = self.tokenize(prompts, responses)
    scores = self.model(inputs)
    return scores.tolist()

```bash
--custom-rm-path custom_rm.py

Evaluation Multi-Task

多任务评估

bash
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

bash
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

Resources

资源链接