verl-rl-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseverl: Volcano Engine Reinforcement Learning for LLMs
verl: 面向大语言模型的火山引擎强化学习框架
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
verl是字节跳动Seed团队推出的一款灵活、高效且可用于生产环境的大语言模型强化学习训练库。它实现了HybridFlow框架(发表于EuroSys 2025),支撑了如豆包-1.5-pro等模型在数学基准测试中达到O1级性能。
When to Use verl
何时使用verl
Choose verl when you need:
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training
Consider alternatives when:
- You need Megatron-native training → use slime or miles
- You want PyTorch-native abstractions with Monarch → use torchforge
- You only need simple SFT/DPO → use TRL or Axolotl
选择verl的场景:
- 可用于生产环境的大规模强化学习训练(已支持最高6710亿参数模型)
- 灵活切换后端(FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- 支持多种强化学习算法(PPO、GRPO、RLOO、REINFORCE++、DAPO)
- 支持多轮rollout及工具调用的智能体工作流
- 支持多模态大语言模型(视觉-语言模型)的强化学习训练
考虑替代方案的场景:
- 需要Megatron原生训练 → 使用slime或miles
- 想要基于Monarch的PyTorch原生抽象 → 使用torchforge
- 仅需简单的SFT/DPO训练 → 使用TRL或Axolotl
Key Features
核心特性
- Training backends: FSDP, FSDP2, Megatron-LM
- Rollout engines: vLLM, SGLang, HuggingFace Transformers
- Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
- 训练后端:FSDP、FSDP2、Megatron-LM
- Rollout引擎:vLLM、SGLang、HuggingFace Transformers
- 算法支持:PPO、GRPO、DAPO、RLOO、ReMax、REINFORCE++、SPIN、SPPO
- 适配模型:Qwen-3、Llama-3.1、DeepSeek、Gemma-2(参数规模0.5B至671B)
- 高级功能:LoRA强化学习、序列并行、专家并行、多轮工具调用
Installation
安装方式
bash
undefinedbash
undefinedOption 1: pip install
选项1:pip安装
pip install verl[vllm] # or verl[sglang] for SGLang backend
pip install verl[vllm] # 若使用SGLang后端则安装verl[sglang]
Option 2: Docker (recommended for production)
选项2:Docker(生产环境推荐)
docker pull verlai/verl:vllm011.latest
docker pull verlai/verl:vllm011.latest
Option 3: From source
选项3:从源码安装
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
undefinedgit clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
undefinedQuick Start: GRPO Training
快速开始:GRPO训练
bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8Core Architecture
核心架构
verl uses a HybridFlow programming model separating control flow from computation:
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray) │
│ - Orchestrates: rollout → reward → train → sync │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers │
│ ├── ActorRolloutRefWorker (policy + generation) │
│ ├── CriticWorker (value estimation, PPO only) │
│ └── RewardManager (model-based or rule-based rewards) │
└─────────────────────────────────────────────────────────┘verl采用HybridFlow编程模型,将控制流与计算流分离:
┌─────────────────────────────────────────────────────────┐
│ 单进程控制器(Ray) │
│ - 编排流程:rollout → 奖励计算 → 训练 → 同步 │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作节点 │
│ ├── ActorRolloutRefWorker(策略+生成) │
│ ├── CriticWorker(价值估计,仅PPO可用) │
│ └── RewardManager(基于模型或规则的奖励计算) │
└─────────────────────────────────────────────────────────┘Workflow 1: Math Reasoning with GRPO
工作流1:基于GRPO的数学推理训练
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
该工作流适用于在GSM8K或MATH等数学任务上训练推理模型。
Prerequisites Checklist
前置检查清单
- GPU cluster with 8+ GPUs (H100 recommended)
- Dataset in parquet format with and
promptcolumnsreward_model - Base model from HuggingFace Hub
- 具备8张及以上GPU的集群(推荐使用H100)
- 格式为parquet的数据集,包含和
prompt列reward_model - 来自HuggingFace Hub的基础模型
Step 1: Prepare Dataset
步骤1:准备数据集
python
import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")python
import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "15加27等于多少?"}],
"reward_model": {"ground_truth": "42"}
},
# ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")Step 2: Define Reward Function
步骤2:定义奖励函数
python
undefinedpython
undefinedreward_function.py
reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# Extract answer from response
match = re.search(r'\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
undefinedimport re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# 从响应中提取答案
match = re.search(r'\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
undefinedStep 3: Create Training Config
步骤3:创建训练配置
yaml
undefinedyaml
undefinedconfig/grpo_math.yaml
config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # samples per prompt
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
undefinedalgorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # 每个prompt的采样数
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
undefinedStep 4: Launch Training
步骤4:启动训练
bash
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7bbash
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7bStep 5: Monitor and Validate
步骤5:监控与验证
- Check WandB/TensorBoard for loss curves
- Verify reward is increasing over steps
- Run evaluation on held-out test set
- 在WandB/TensorBoard中查看损失曲线
- 验证奖励值随训练步骤提升
- 在预留的测试集上运行评估
Workflow 2: PPO with Critic Model
工作流2:带Critic模型的PPO训练
Use this workflow when you need value-based advantage estimation (GAE).
当需要基于价值的优势估计(GAE)时,使用该工作流。
Key Differences from GRPO
与GRPO的核心差异
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards
- 需要独立的Critic模型
- 使用广义优势估计(GAE)
- 更适用于具有密集奖励的任务
Configuration
配置示例
yaml
algorithm:
adv_estimator: gae # Use GAE instead of GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO clippingyaml
algorithm:
adv_estimator: gae # 使用GAE替代GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # 可与Actor模型相同或不同
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO裁剪系数Launch with Critic
启动带Critic的训练
bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8Workflow 3: Large-Scale Training with Megatron
工作流3:基于Megatron的大规模训练
Use this workflow for models >70B parameters or when you need expert parallelism.
该工作流适用于参数规模>70B的模型,或需要专家并行的场景。
Prerequisites
前置条件
- Install Megatron-LM bridge:
pip install mbridge - Convert model to Megatron format
- Multi-node cluster with NVLink/InfiniBand
- 安装Megatron-LM桥接工具:
pip install mbridge - 将模型转换为Megatron格式
- 具备NVLink/InfiniBand的多节点集群
Configuration for 70B+ Models
70B+模型配置示例
yaml
actor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8yaml
actor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8Launch Multi-Node
启动多节点训练
bash
undefinedbash
undefinedOn head node
在主节点执行
ray start --head --port=6379
ray start --head --port=6379
On worker nodes
在工作节点执行
ray start --address='head_ip:6379'
ray start --address='head_ip:6379'
Launch training
启动训练任务
python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8
trainer.nnodes=4
trainer.n_gpus_per_node=8
---python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8
trainer.nnodes=4
trainer.n_gpus_per_node=8
---Configuration Reference
配置参考
Algorithm Selection
算法选择
| Algorithm | | Use Case |
|---|---|---|
| GRPO | | Critic-free, math/reasoning |
| PPO/GAE | | Dense rewards, value estimation |
| REINFORCE++ | | Variance reduction |
| RLOO | | Leave-one-out baseline |
| ReMax | | Maximum reward baseline |
| OPO | | Optimal policy optimization |
| 算法 | | 适用场景 |
|---|---|---|
| GRPO | | 无需Critic模型,数学/推理任务 |
| PPO/GAE | | 密集奖励任务,价值估计 |
| REINFORCE++ | | 方差缩减 |
| RLOO | | 留一法基线 |
| ReMax | | 最大奖励基线 |
| OPO | | 最优策略优化 |
Key Parameters
核心参数
yaml
undefinedyaml
undefinedRollout parameters
Rollout参数
actor_rollout_ref.rollout.n: 8 # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
actor_rollout_ref.rollout.n: 8 # 每个prompt的采样数
actor_rollout_ref.rollout.temperature: 0.7 # 采样温度
actor_rollout_ref.rollout.top_p: 0.95 # 核采样参数
Training parameters
训练参数
actor_rollout_ref.actor.lr: 1e-6 # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
actor_rollout_ref.actor.lr: 1e-6 # 学习率
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO裁剪范围
KL control
KL控制
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
---actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # 自适应KL控制目标值
---Common Issues and Solutions
常见问题与解决方案
Issue: OOM During Rollout
问题:Rollout阶段显存不足
Symptoms: CUDA out of memory during generation phase
Solutions:
yaml
undefined症状:生成阶段出现CUDA内存不足错误
解决方案:
yaml
undefinedReduce batch size
减小批量大小
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
Enable gradient checkpointing
启用梯度检查点
actor_rollout_ref.model.enable_gradient_checkpointing: true
actor_rollout_ref.model.enable_gradient_checkpointing: true
Use FSDP2 with CPU offloading
使用FSDP2并开启CPU卸载
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
undefinedactor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
undefinedIssue: Training Instability
问题:训练不稳定
Symptoms: Loss spikes, reward collapse
Solutions:
yaml
undefined症状:损失值骤增,奖励值崩溃
解决方案:
yaml
undefinedReduce learning rate
降低学习率
actor_rollout_ref.actor.lr: 5e-7
actor_rollout_ref.actor.lr: 5e-7
Increase KL penalty
增大KL惩罚系数
actor_rollout_ref.actor.kl_loss_coef: 0.01
actor_rollout_ref.actor.kl_loss_coef: 0.01
Enable gradient clipping
启用梯度裁剪
actor_rollout_ref.actor.max_grad_norm: 1.0
undefinedactor_rollout_ref.actor.max_grad_norm: 1.0
undefinedIssue: Slow Weight Sync
问题:权重同步缓慢
Symptoms: Long pauses between rollout and training
Solutions:
bash
undefined症状:Rollout与训练阶段之间出现长时间停顿
解决方案:
bash
undefinedUse FSDP2 for faster resharding
使用FSDP2实现更快的重分片
actor_rollout_ref.actor.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
Enable async weight transfer
启用异步权重传输
trainer.async_weight_update=true
undefinedtrainer.async_weight_update=true
undefinedIssue: vLLM Version Mismatch
问题:vLLM版本不兼容
Symptoms: Import errors or generation failures
Solution: Use compatible versions:
bash
pip install vllm>=0.8.5,<=0.12.0症状:导入错误或生成失败
解决方案:使用兼容版本:
bash
pip install vllm>=0.8.5,<=0.12.0Avoid vLLM 0.7.x (known bugs)
避免使用vLLM 0.7.x版本(存在已知bug)
---
---Advanced Topics
高级主题
Multi-Turn Tool Calling
多轮工具调用
See references/multi-turn.md for agentic workflows with tool use.
关于带工具调用的智能体工作流,请参考references/multi-turn.md。
Vision-Language Models
视觉-语言模型
yaml
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: trueyaml
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: trueLoRA Training
LoRA训练
yaml
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]yaml
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]Resources
资源链接
- Documentation: https://verl.readthedocs.io/
- Paper: https://arxiv.org/abs/2409.19256
- GitHub: https://github.com/volcengine/verl
- Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
- Community: Slack at verl-project
- 官方文档:https://verl.readthedocs.io/
- 论文:https://arxiv.org/abs/2409.19256
- GitHub仓库:https://github.com/volcengine/verl
- 训练示例:https://github.com/verl-project/verl-recipe(包含DAPO、GSPO等)
- 社区交流:加入Slack社区verl-project