verl-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

verl: Volcano Engine Reinforcement Learning for LLMs

verl: 面向大语言模型的火山引擎强化学习框架

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

verl是字节跳动Seed团队推出的一款灵活、高效且可用于生产环境的大语言模型强化学习训练库。它实现了HybridFlow框架（发表于EuroSys 2025），支撑了如豆包-1.5-pro等模型在数学基准测试中达到O1级性能。

When to Use verl

何时使用verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

选择verl的场景：

可用于生产环境的大规模强化学习训练（已支持最高6710亿参数模型）
灵活切换后端（FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang）
支持多种强化学习算法（PPO、GRPO、RLOO、REINFORCE++、DAPO）
支持多轮rollout及工具调用的智能体工作流
支持多模态大语言模型（视觉-语言模型）的强化学习训练

考虑替代方案的场景：

需要Megatron原生训练 → 使用slime或miles
想要基于Monarch的PyTorch原生抽象 → 使用torchforge
仅需简单的SFT/DPO训练 → 使用TRL或Axolotl

Key Features

核心特性

Training backends: FSDP, FSDP2, Megatron-LM
Rollout engines: vLLM, SGLang, HuggingFace Transformers
Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

训练后端：FSDP、FSDP2、Megatron-LM
Rollout引擎：vLLM、SGLang、HuggingFace Transformers
算法支持：PPO、GRPO、DAPO、RLOO、ReMax、REINFORCE++、SPIN、SPPO
适配模型：Qwen-3、Llama-3.1、DeepSeek、Gemma-2（参数规模0.5B至671B）
高级功能：LoRA强化学习、序列并行、专家并行、多轮工具调用

Installation

安装方式

bash

undefined

bash

undefined

Option 1: pip install

选项1：pip安装

pip install verl[vllm] # or verl[sglang] for SGLang backend

pip install verl[vllm] # 若使用SGLang后端则安装verl[sglang]

Option 2: Docker (recommended for production)

选项2：Docker（生产环境推荐）

docker pull verlai/verl:vllm011.latest

Option 3: From source

选项3：从源码安装

git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]

undefined

git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]

undefined

Quick Start: GRPO Training

快速开始：GRPO训练

bash

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

bash

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

核心架构

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

verl采用HybridFlow编程模型，将控制流与计算流分离：

┌─────────────────────────────────────────────────────────┐
│ 单进程控制器（Ray）                                     │
│ - 编排流程：rollout → 奖励计算 → 训练 → 同步             │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作节点                                           │
│ ├── ActorRolloutRefWorker（策略+生成）                  │
│ ├── CriticWorker（价值估计，仅PPO可用）                 │
│ └── RewardManager（基于模型或规则的奖励计算）           │
└─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

工作流1：基于GRPO的数学推理训练

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

该工作流适用于在GSM8K或MATH等数学任务上训练推理模型。

Prerequisites Checklist

前置检查清单

GPU cluster with 8+ GPUs (H100 recommended)
Dataset in parquet format with
```
prompt
```
and
```
reward_model
```
columns
Base model from HuggingFace Hub

具备8张及以上GPU的集群（推荐使用H100）
格式为parquet的数据集，包含
```
prompt
```
和
```
reward_model
```
列
来自HuggingFace Hub的基础模型

Step 1: Prepare Dataset

步骤1：准备数据集

python

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

python

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "15加27等于多少？"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

步骤2：定义奖励函数

python

undefined

python

undefined

reward_function.py

import re

def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards

undefined

import re

def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # 从响应中提取答案 match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards

undefined

Step 3: Create Training Config

步骤3：创建训练配置

yaml

undefined

yaml

undefined

config/grpo_math.yaml

algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0

data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048

actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # samples per prompt temperature: 0.7 top_p: 0.95

trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100

undefined

algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0

data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048

actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # 每个prompt的采样数 temperature: 0.7 top_p: 0.95

trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100

undefined

Step 4: Launch Training

步骤4：启动训练

bash

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

bash

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

步骤5：监控与验证

Check WandB/TensorBoard for loss curves
Verify reward is increasing over steps
Run evaluation on held-out test set

在WandB/TensorBoard中查看损失曲线
验证奖励值随训练步骤提升
在预留的测试集上运行评估

Workflow 2: PPO with Critic Model

工作流2：带Critic模型的PPO训练

Use this workflow when you need value-based advantage estimation (GAE).

当需要基于价值的优势估计（GAE）时，使用该工作流。

Key Differences from GRPO

与GRPO的核心差异

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

需要独立的Critic模型
使用广义优势估计（GAE）
更适用于具有密集奖励的任务

Configuration

配置示例

yaml

algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping

yaml

algorithm:
  adv_estimator: gae  # 使用GAE替代GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # 可与Actor模型相同或不同
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO裁剪系数

Launch with Critic

启动带Critic的训练

bash

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

bash

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

工作流3：基于Megatron的大规模训练

Use this workflow for models >70B parameters or when you need expert parallelism.

该工作流适用于参数规模>70B的模型，或需要专家并行的场景。

Prerequisites

前置条件

Install Megatron-LM bridge:
```
pip install mbridge
```
Convert model to Megatron format
Multi-node cluster with NVLink/InfiniBand

安装Megatron-LM桥接工具：
```
pip install mbridge
```
将模型转换为Megatron格式
具备NVLink/InfiniBand的多节点集群

Configuration for 70B+ Models

70B+模型配置示例

yaml

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

yaml

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

启动多节点训练

bash

undefined

bash

undefined

On head node

在主节点执行

ray start --head --port=6379

On worker nodes

在工作节点执行

ray start --address='head_ip:6379'

Launch training

启动训练任务

python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8

---

python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8

---

Configuration Reference

配置参考

Algorithm Selection

算法选择

Algorithm	`adv_estimator`	Use Case
GRPO	`grpo`	Critic-free, math/reasoning
PPO/GAE	`gae`	Dense rewards, value estimation
REINFORCE++	`reinforce_plus_plus`	Variance reduction
RLOO	`rloo`	Leave-one-out baseline
ReMax	`remax`	Maximum reward baseline
OPO	`opo`	Optimal policy optimization

算法	`adv_estimator` 参数值	适用场景
GRPO	`grpo`	无需Critic模型，数学/推理任务
PPO/GAE	`gae`	密集奖励任务，价值估计
REINFORCE++	`reinforce_plus_plus`	方差缩减
RLOO	`rloo`	留一法基线
ReMax	`remax`	最大奖励基线
OPO	`opo`	最优策略优化

Key Parameters

核心参数

yaml

undefined

yaml

undefined

Rollout parameters

Rollout参数

actor_rollout_ref.rollout.n: 8 # Samples per prompt actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling

actor_rollout_ref.rollout.n: 8 # 每个prompt的采样数 actor_rollout_ref.rollout.temperature: 0.7 # 采样温度 actor_rollout_ref.rollout.top_p: 0.95 # 核采样参数

Training parameters

训练参数

actor_rollout_ref.actor.lr: 1e-6 # Learning rate actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range

actor_rollout_ref.actor.lr: 1e-6 # 学习率 actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO裁剪范围

KL control

KL控制

actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control

---

actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # 自适应KL控制目标值

---

Common Issues and Solutions

常见问题与解决方案

Issue: OOM During Rollout

问题：Rollout阶段显存不足

Symptoms: CUDA out of memory during generation phase

Solutions:

yaml

undefined

症状：生成阶段出现CUDA内存不足错误

解决方案：

yaml

undefined

Reduce batch size

减小批量大小

actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

Enable gradient checkpointing

启用梯度检查点

actor_rollout_ref.model.enable_gradient_checkpointing: true

Use FSDP2 with CPU offloading

使用FSDP2并开启CPU卸载

actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true

undefined

actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true

undefined

Issue: Training Instability

问题：训练不稳定

Symptoms: Loss spikes, reward collapse

Solutions:

yaml

undefined

症状：损失值骤增，奖励值崩溃

解决方案：

yaml

undefined

Reduce learning rate

降低学习率

actor_rollout_ref.actor.lr: 5e-7

Increase KL penalty

增大KL惩罚系数

actor_rollout_ref.actor.kl_loss_coef: 0.01

Enable gradient clipping

启用梯度裁剪

actor_rollout_ref.actor.max_grad_norm: 1.0

undefined

actor_rollout_ref.actor.max_grad_norm: 1.0

undefined

Issue: Slow Weight Sync

问题：权重同步缓慢

Symptoms: Long pauses between rollout and training

Solutions:

bash

undefined

症状：Rollout与训练阶段之间出现长时间停顿

解决方案：

bash

undefined

Use FSDP2 for faster resharding

使用FSDP2实现更快的重分片

actor_rollout_ref.actor.strategy=fsdp2

Enable async weight transfer

启用异步权重传输

trainer.async_weight_update=true

undefined

trainer.async_weight_update=true

undefined

Issue: vLLM Version Mismatch

问题：vLLM版本不兼容

Symptoms: Import errors or generation failures

Solution: Use compatible versions:

bash

pip install vllm>=0.8.5,<=0.12.0

症状：导入错误或生成失败

解决方案：使用兼容版本：

bash

pip install vllm>=0.8.5,<=0.12.0

Avoid vLLM 0.7.x (known bugs)

避免使用vLLM 0.7.x版本（存在已知bug）

---

---

Advanced Topics

高级主题

Multi-Turn Tool Calling

多轮工具调用

See references/multi-turn.md for agentic workflows with tool use.

关于带工具调用的智能体工作流，请参考references/multi-turn.md。

Vision-Language Models

视觉-语言模型

yaml

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

yaml

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

LoRA训练

yaml

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

yaml

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

Resources

资源链接

Documentation: https://verl.readthedocs.io/
Paper: https://arxiv.org/abs/2409.19256
GitHub: https://github.com/volcengine/verl
Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community: Slack at verl-project

官方文档：https://verl.readthedocs.io/
论文：https://arxiv.org/abs/2409.19256
GitHub仓库：https://github.com/volcengine/verl
训练示例：https://github.com/verl-project/verl-recipe（包含DAPO、GSPO等）
社区交流：加入Slack社区verl-project