slime-rl-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseslime: LLM Post-Training Framework for RL Scaling
slime:用于RL扩展的大语言模型(LLM)后训练框架
slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
slime是清华大学THUDM团队开发的LLM后训练框架,为GLM-4.5、GLM-4.6和GLM-4.7提供技术支持。它将用于训练的Megatron-LM与用于高吞吐量rollout生成的SGLang相连接。
When to Use slime
何时使用slime
Choose slime when you need:
- Megatron-LM native training with SGLang inference
- Custom data generation workflows with flexible data buffers
- Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
- Research-grade framework with production backing (Z.ai)
Consider alternatives when:
- You need enterprise-grade stability features → use miles
- You want flexible backend swapping → use verl
- You need PyTorch-native abstractions → use torchforge
在以下场景选择slime:
- 需要Megatron-LM原生训练搭配SGLang推理
- 需要具备灵活数据缓冲区的自定义数据生成工作流
- 训练GLM、Qwen3、DeepSeek V3或Llama 3模型
- 需要兼具研究级特性与生产级支持(由Z.ai提供)的框架
在以下场景考虑替代方案:
- 需要企业级稳定性特性 → 使用 miles
- 需要灵活的后端切换 → 使用 verl
- 需要PyTorch原生抽象 → 使用 torchforge
Key Features
核心特性
- Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
- Rollout: SGLang-based high-throughput generation with router
- Data Buffer: Flexible prompt management and sample storage
- Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
- 训练:支持全并行化(TP、PP、DP、SP)的Megatron-LM
- Rollout:基于SGLang的高吞吐量生成,搭配路由机制
- 数据缓冲区:灵活的提示词管理与样本存储
- 支持模型:GLM-4.x、Qwen3、DeepSeek V3/R1、Llama 3
Architecture Overview
架构概述
┌─────────────────────────────────────────────────────────┐
│ Data Buffer │
│ - Prompt initialization and management │
│ - Custom data generation and filtering │
│ - Rollout sample storage │
└─────────────┬───────────────────────────┬───────────────┘
│ │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │
│ - Actor model training │ │ - Response generation │
│ - Critic (optional) │ │ - Reward/verifier output │
│ - Weight sync to rollout│ │ - Multi-turn support │
└─────────────────────────┘ └─────────────────────────────┘┌─────────────────────────────────────────────────────────┐
│ Data Buffer │
│ - Prompt initialization and management │
│ - Custom data generation and filtering │
│ - Rollout sample storage │
└─────────────┬───────────────────────────┬───────────────┘
│ │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │
│ - Actor model training │ │ - Response generation │
│ - Critic (optional) │ │ - Reward/verifier output │
│ - Weight sync to rollout│ │ - Multi-turn support │
└─────────────────────────┘ └─────────────────────────────┘Installation
安装
bash
undefinedbash
undefinedRecommended: Docker
推荐方式:Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g
-it slimerl/slime:latest /bin/bash
-it slimerl/slime:latest /bin/bash
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g
-it slimerl/slime:latest /bin/bash
-it slimerl/slime:latest /bin/bash
Inside container
在容器内执行
cd /root/slime && pip install -e . --no-deps
undefinedcd /root/slime && pip install -e . --no-deps
undefinedFrom Source
从源码安装
bash
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .bash
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .Quick Start: GRPO Training
快速开始:GRPO训练
bash
undefinedbash
undefinedSource model configuration
加载模型配置
source scripts/models/qwen3-4B.sh
source scripts/models/qwen3-4B.sh
Launch training
启动训练
python train.py
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
---python train.py
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
--actor-num-nodes 1
--actor-num-gpus-per-node 4
--rollout-num-gpus 4
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.001
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--prompt-data /path/to/data.jsonl
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
---Workflow 1: Standard GRPO Training
工作流1:标准GRPO训练
Use this workflow for training reasoning models with group-relative advantages.
该工作流适用于使用组相对优势训练推理模型。
Prerequisites Checklist
前置检查清单
- Docker environment or Megatron-LM + SGLang installed
- Model checkpoint (HuggingFace or Megatron format)
- Training data in JSONL format
- 已配置Docker环境,或已安装Megatron-LM + SGLang
- 模型检查点(HuggingFace或Megatron格式)
- JSONL格式的训练数据
Step 1: Prepare Data
步骤1:准备数据
python
undefinedpython
undefineddata.jsonl format
data.jsonl格式
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}
Or with chat format:
```python
{
"prompt": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 + 27?"}
],
"label": "42"
}{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}
或采用对话格式:
```python
{
"prompt": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 + 27?"}
],
"label": "42"
}Step 2: Configure Model
步骤2:配置模型
Choose a pre-configured model script:
bash
undefined选择预配置的模型脚本:
bash
undefinedList available models
列出可用模型
ls scripts/models/
ls scripts/models/
glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
Source your model
加载你的模型配置
source scripts/models/qwen3-4B.sh
undefinedsource scripts/models/qwen3-4B.sh
undefinedStep 3: Launch Training
步骤3:启动训练
bash
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--use-kl-loss \
--kl-loss-coef 0.001 \
--prompt-data /path/to/train.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--save-interval 100 \
--eval-interval 50 \
${MODEL_ARGS[@]}bash
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--use-kl-loss \
--kl-loss-coef 0.001 \
--prompt-data /path/to/train.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--save-interval 100 \
--eval-interval 50 \
${MODEL_ARGS[@]}Step 4: Monitor Training
步骤4:监控训练
- Check TensorBoard:
tensorboard --logdir outputs/ - Verify reward curves are increasing
- Monitor GPU utilization across nodes
- 查看TensorBoard:
tensorboard --logdir outputs/ - 验证奖励曲线呈上升趋势
- 监控多节点的GPU利用率
Workflow 2: Asynchronous Training
工作流2:异步训练
Use async mode for higher throughput by overlapping rollout and training.
通过重叠rollout与训练过程,异步模式可实现更高吞吐量。
When to Use Async
何时使用异步模式
- Large models with long generation times
- High GPU idle time in synchronous mode
- Sufficient memory for buffering
- 大模型且生成时间较长
- 同步模式下GPU空闲时间较多
- 具备足够内存用于数据缓冲
Launch Async Training
启动异步训练
bash
python train_async.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--async-buffer-size 4 \
--prompt-data /path/to/train.jsonl \
${MODEL_ARGS[@]}bash
python train_async.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--async-buffer-size 4 \
--prompt-data /path/to/train.jsonl \
${MODEL_ARGS[@]}Async-Specific Parameters
异步模式专属参数
bash
--async-buffer-size 4 # Number of rollouts to buffer
--update-weights-interval 2 # Sync weights every N rolloutsbash
--async-buffer-size 4 # 缓冲的rollout数量
--update-weights-interval 2 # 每N次rollout同步一次权重Workflow 3: Multi-Turn Agentic Training
工作流3:多轮智能体训练
Use this workflow for training agents with tool use or multi-step reasoning.
该工作流适用于训练具备工具调用或多步推理能力的智能体。
Prerequisites
前置条件
- Custom generate function for multi-turn logic
- Tool/environment interface
- 用于多轮逻辑的自定义生成函数
- 工具/环境接口
Step 1: Define Custom Generate Function
步骤1:定义自定义生成函数
python
undefinedpython
undefinedcustom_generate.py
custom_generate.py
async def custom_generate(args, samples, evaluation=False):
"""Multi-turn generation with tool calling."""
for sample in samples:
conversation = sample.prompt
for turn in range(args.max_turns):
# Generate response
response = await generate_single(conversation)
# Check for tool call
tool_call = extract_tool_call(response)
if tool_call:
tool_result = execute_tool(tool_call)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "tool", "content": tool_result})
else:
break
sample.response = response
sample.reward = compute_reward(sample)
return samplesundefinedasync def custom_generate(args, samples, evaluation=False):
"""带工具调用的多轮生成。"""
for sample in samples:
conversation = sample.prompt
for turn in range(args.max_turns):
# 生成回复
response = await generate_single(conversation)
# 检查是否存在工具调用
tool_call = extract_tool_call(response)
if tool_call:
tool_result = execute_tool(tool_call)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "tool", "content": tool_result})
else:
break
sample.response = response
sample.reward = compute_reward(sample)
return samplesundefinedStep 2: Launch with Custom Function
步骤2:使用自定义函数启动训练
bash
python train.py \
--custom-generate-function-path custom_generate.py \
--max-turns 5 \
--prompt-data /path/to/agent_data.jsonl \
${MODEL_ARGS[@]}See for a complete multi-turn search example.
examples/search-r1/bash
python train.py \
--custom-generate-function-path custom_generate.py \
--max-turns 5 \
--prompt-data /path/to/agent_data.jsonl \
${MODEL_ARGS[@]}完整的多轮搜索示例可参考。
examples/search-r1/Configuration Reference
配置参考
Three Argument Categories
三类参数
slime uses three types of arguments:
1. Megatron Arguments (passed directly):
bash
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 40962. SGLang Arguments (prefixed with ):
--sglang-bash
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO3. slime Arguments:
bash
undefinedslime使用三类参数:
1. Megatron参数(直接传递):
bash
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 40962. SGLang参数(以为前缀):
--sglang-bash
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO3. slime参数:
bash
undefinedResource allocation
资源分配
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate # Share GPUs between training/inference
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate # 训练与推理共享GPU
Data
数据相关
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label
Training loop
训练循环
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
Algorithm
算法相关
--advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001
undefined--advantage-estimator grpo # 可选:gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001
undefinedKey Constraints
核心约束
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rolloutExample: 32 × 8 = 256 × 1
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout示例:32 × 8 = 256 × 1
Data Buffer System
数据缓冲区系统
slime's data buffer enables flexible data management:
slime的数据缓冲区支持灵活的数据管理:
Basic Data Source
基础数据源
python
class RolloutDataSource:
def get_samples(self, num_samples):
"""Fetch prompts from dataset."""
return self.dataset.sample(num_samples)
def add_samples(self, samples):
"""Called after generation (no-op by default)."""
passpython
class RolloutDataSource:
def get_samples(self, num_samples):
"""从数据集中获取提示词。"""
return self.dataset.sample(num_samples)
def add_samples(self, samples):
"""生成完成后调用(默认无操作)。"""
passBuffered Data Source (Off-Policy)
带缓冲区的数据源(离线策略)
python
class RolloutDataSourceWithBuffer(RolloutDataSource):
def __init__(self):
self.buffer = []
def add_samples(self, samples):
"""Store generated samples for reuse."""
self.buffer.extend(samples)
def buffer_filter(self, args, buffer, num_samples):
"""Custom selection logic (prioritized, stratified, etc.)."""
return select_best(buffer, num_samples)python
class RolloutDataSourceWithBuffer(RolloutDataSource):
def __init__(self):
self.buffer = []
def add_samples(self, samples):
"""存储生成的样本以供复用。"""
self.buffer.extend(samples)
def buffer_filter(self, args, buffer, num_samples):
"""自定义选择逻辑(优先级、分层等)。"""
return select_best(buffer, num_samples)Common Issues and Solutions
常见问题与解决方案
Issue: SGLang Engine Crash
问题:SGLang引擎崩溃
Symptoms: Inference engine dies mid-training
Solutions:
bash
undefined症状:训练过程中推理引擎意外终止
解决方案:
bash
undefinedEnable fault tolerance
启用容错机制
--use-fault-tolerance
--use-fault-tolerance
Increase memory allocation
增加内存分配比例
--sglang-mem-fraction-static 0.85
--sglang-mem-fraction-static 0.85
Reduce batch size
减小批次大小
--rollout-batch-size 16
undefined--rollout-batch-size 16
undefinedIssue: Weight Sync Timeout
问题:权重同步超时
Symptoms: Training hangs after rollout
Solutions:
bash
undefined症状:rollout完成后训练停滞
解决方案:
bash
undefinedIncrease sync interval
增加同步间隔
--update-weights-interval 5
--update-weights-interval 5
Use colocated mode (no network transfer)
使用共置模式(无需网络传输)
--colocate
undefined--colocate
undefinedIssue: OOM During Training
问题:训练过程中显存不足(OOM)
Symptoms: CUDA OOM in backward pass
Solutions:
bash
undefined症状:反向传播时出现CUDA OOM错误
解决方案:
bash
undefinedEnable gradient checkpointing
启用梯度检查点
--recompute-activations
--recompute-activations
Reduce micro-batch size
减小微批次大小
--micro-batch-size 1
--micro-batch-size 1
Enable sequence parallelism
启用序列并行化
--sequence-parallel
undefined--sequence-parallel
undefinedIssue: Slow Data Loading
问题:数据加载缓慢
Symptoms: GPU idle during data fetch
Solutions:
bash
undefined症状:GPU在数据获取阶段处于空闲状态
解决方案:
bash
undefinedIncrease data workers
增加数据工作进程数
--num-data-workers 4
--num-data-workers 4
Use streaming dataset
使用流式数据集
--streaming-data
-----streaming-data
---Supported Models
支持的模型
| Model Family | Configurations |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| Others | Kimi K2, Moonlight-16B |
Each model has pre-configured scripts in .
scripts/models/| 模型系列 | 配置版本 |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| 其他 | Kimi K2, Moonlight-16B |
每个模型在目录下都有预配置脚本。
scripts/models/Advanced Topics
高级主题
Co-location Mode
共置模式
Share GPUs between training and inference to reduce memory:
bash
python train.py \
--colocate \
--actor-num-gpus-per-node 8 \
--sglang-mem-fraction-static 0.4 \
${MODEL_ARGS[@]}训练与推理共享GPU以减少内存占用:
bash
python train.py \
--colocate \
--actor-num-gpus-per-node 8 \
--sglang-mem-fraction-static 0.4 \
${MODEL_ARGS[@]}Custom Reward Model
自定义奖励模型
python
undefinedpython
undefinedcustom_rm.py
custom_rm.py
class CustomRewardModel:
def init(self, model_path):
self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
inputs = self.tokenize(prompts, responses)
scores = self.model(inputs)
return scores.tolist()
```bash
--custom-rm-path custom_rm.pyclass CustomRewardModel:
def init(self, model_path):
self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
inputs = self.tokenize(prompts, responses)
scores = self.model(inputs)
return scores.tolist()
```bash
--custom-rm-path custom_rm.pyEvaluation Multi-Task
多任务评估
bash
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16bash
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16Resources
资源链接
- Documentation: https://thudm.github.io/slime/
- GitHub: https://github.com/THUDM/slime
- Blog: https://lmsys.org/blog/2025-07-09-slime/
- Examples: See directory for 14+ worked examples
examples/
- 文档:https://thudm.github.io/slime/
- GitHub仓库:https://github.com/THUDM/slime
- 博客:https://lmsys.org/blog/2025-07-09-slime/
- 示例:目录下包含14+个完整示例
examples/