torchforge: PyTorch-Native Agentic RL Library
torchforge:PyTorch原生智能体强化学习(RL)库
torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
torchforge是Meta推出的PyTorch原生强化学习(RL)库,可将基础设施相关逻辑与算法逻辑分离。它让你能够专注于算法研发,同时自动处理分布式训练、推理和权重同步,从而加速RL研究进程。
When to Use torchforge
何时使用torchforge
Choose torchforge when you need:
- Clean separation between RL algorithms and infrastructure
- PyTorch-native abstractions (no Ray dependency)
- Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
- Scalable training with Monarch actor system
- Integration with TorchTitan for model parallelism
Consider alternatives when:
- You need production-ready stability → use miles or verl
- You want Megatron-native training → use slime
- torchforge is experimental and APIs may change
当你有以下需求时,选择torchforge:
- RL算法与基础设施的清晰分离
- PyTorch原生抽象层(无Ray依赖)
- 便捷的算法实验(GRPO、DAPO、SAPO算法仅需约100行代码)
- 借助Monactor actor系统实现可扩展训练
- 与TorchTitan集成实现模型并行
考虑替代方案的场景:
- 需要生产级稳定性 → 使用 miles 或 verl
- 需要Megatron原生训练 → 使用 slime
- torchforge仍处于实验阶段,API可能会发生变更
- Algorithm isolation: Implement RL algorithms without touching infrastructure
- Scalability: From single GPU to thousands via Monarch
- Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)
- Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in
- 算法隔离:无需接触基础设施即可实现RL算法
- 可扩展性:从单GPU扩展至数千GPU,借助Monarch实现
- 现代化技术栈:TorchTitan(训练)、vLLM(推理)、TorchStore(同步)
- 内置损失函数:GRPO、DAPO、CISPO、GSPO、SAPO
Architecture Overview
架构概览
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
conda create -n forge python=3.12
conda activate forge
conda create -n forge python=3.12
conda activate forge
Install (handles PyTorch nightly + dependencies)
安装(自动处理PyTorch nightly及依赖)
python -c "import torch, forge, vllm; print('OK')"
python -c "import torch, forge, vllm; print('OK')"
ROCm Installation
ROCm环境安装
bash
./scripts/install_rocm.sh
bash
./scripts/install_rocm.sh
SFT Training (2+ GPUs)
SFT训练(2+ GPU)
bash
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
bash
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
GRPO Training (3+ GPUs)
GRPO训练(3+ GPU)
bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Workflow 1: GRPO Training for Math Reasoning
工作流1:数学推理场景下的GRPO训练
Use this workflow for training reasoning models with group-relative advantages.
Prerequisites Checklist
前置检查清单
Step 1: Create Configuration
步骤1:创建配置文件
config/grpo_math.yaml
config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # Responses per prompt
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL penalty coefficient
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # 每个prompt生成的响应数
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL惩罚系数
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
Step 2: Define Reward Function
步骤2:定义奖励函数
Reward functions are in forge.data.rewards
奖励函数位于forge.data.rewards模块
from forge.data.rewards import MathReward, ThinkingReward
import re
from forge.data.rewards import MathReward, ThinkingReward
import re
Or define your own reward function
或自定义奖励函数
class CustomMathReward:
def call(self, prompt: str, response: str, target: str) -> float:
# Extract answer from response
match = re.search(r'\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
class CustomMathReward:
def call(self, prompt: str, response: str, target: str) -> float:
# 从响应中提取答案
match = re.search(r'\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
Step 3: Launch Training
步骤3:启动训练
bash
python -m apps.grpo.main --config config/grpo_math.yaml
bash
python -m apps.grpo.main --config config/grpo_math.yaml
Step 4: Monitor Progress
步骤4:监控训练进度
Workflow 2: Custom Loss Function
工作流2:自定义损失函数
Use this workflow to implement new RL algorithms.
Step 1: Create Loss Class
步骤1:创建损失类
src/forge/losses/custom_loss.py
src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def init(self, clip_range: float = 0.2, beta: float = 0.1):
super().init()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# Compute importance ratio
ratio = torch.exp(logprobs - ref_logprobs)
# Clipped policy gradient
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL penalty
kl = ref_logprobs - logprobs
# Apply mask and aggregate
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def init(self, clip_range: float = 0.2, beta: float = 0.1):
super().init()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# 计算重要性比率
ratio = torch.exp(logprobs - ref_logprobs)
# 裁剪后的策略梯度
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL惩罚
kl = ref_logprobs - logprobs
# 应用掩码并聚合损失
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
Step 2: Integrate into Application
步骤2:集成到应用中
apps/custom/main.py
apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
Workflow 3: Multi-GPU Distributed Training
工作流3:多GPU分布式训练
Use this workflow for scaling to multiple GPUs or nodes.
Configuration for Distributed
分布式配置
config/distributed.yaml
config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # Split model across GPUs
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # 2 processes for TP=2
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # 在GPU间拆分模型
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # TP=2时需2个进程
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
Launch with SLURM
使用SLURM启动
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
Launch Locally (Multi-GPU)
本地多GPU启动
python -m apps.grpo.main
--config config/distributed.yaml
--trainer.procs 4
--generator.procs 4
python -m apps.grpo.main
--config config/distributed.yaml
--trainer.procs 4
--generator.procs 4
Core API Reference
核心API参考
Training Batch Format
训练批次格式
torchforge uses dictionary-based batches for training:
torchforge使用基于字典的批次格式进行训练:
inputs: list of dicts with torch.Tensor values
inputs: 包含torch.Tensor值的字典列表
inputs = [{"tokens": torch.Tensor}]
inputs = [{"tokens": torch.Tensor}]
targets: list of dicts with training signals
targets: 包含训练信号的字典列表
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
train_step returns loss as float
train_step返回浮点型损失值
loss = trainer.train_step(inputs, targets)
loss = trainer.train_step(inputs, targets)
Generated output from vLLM:
python
@dataclass
class Completion:
text: str # Generated text
token_ids: list[int] # Token IDs
logprobs: list[float] # Log probabilities
metadata: dict # Custom metadata
vLLM生成的输出格式:
python
@dataclass
class Completion:
text: str # 生成文本
token_ids: list[int] # Token ID列表
logprobs: list[float] # 对数概率列表
metadata: dict # 自定义元数据
Built-in Loss Functions
内置损失函数
Loss functions are in the
module:
python
from forge.losses import SimpleGRPOLoss, ReinforceLoss
python
from forge.losses import SimpleGRPOLoss, ReinforceLoss
SimpleGRPOLoss for GRPO training
用于GRPO训练的SimpleGRPOLoss
loss_fn = SimpleGRPOLoss(beta=0.1)
loss_fn = SimpleGRPOLoss(beta=0.1)
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
ReinforceLoss
ReinforceLoss
python
from forge.losses.reinforce_loss import ReinforceLoss
python
from forge.losses.reinforce_loss import ReinforceLoss
With optional importance ratio clipping
可选的重要性比率裁剪
loss_fn = ReinforceLoss(clip_ratio=0.2)
loss_fn = ReinforceLoss(clip_ratio=0.2)
Common Issues and Solutions
常见问题与解决方案
Issue: Not Enough GPUs
问题:GPU资源不足
Symptoms: "Insufficient GPU resources" error
Solutions:
症状:出现“Insufficient GPU resources”错误
解决方案:
Reduce service requirements
降低服务资源需求
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
Remove ref_model (uses generator weights)
Or use CPU for reference model:
```yaml
ref_model:
with_gpus: false
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
移除ref_model(使用生成器权重)
或为参考模型使用CPU:
```yaml
ref_model:
with_gpus: false
Issue: OOM During Generation
问题:生成阶段显存溢出
Symptoms: CUDA OOM in vLLM
Solutions:
症状:vLLM中出现CUDA OOM错误
解决方案:
grpo:
n_samples: 4 # Reduce from 8
grpo:
n_samples: 4 # 从8减少至4
Or reduce sequence length
或降低序列长度
Issue: Slow Weight Sync
问题:权重同步缓慢
Symptoms: Long pauses between training and generation
Solutions:
症状:训练与生成阶段之间出现长时间停顿
解决方案:
Enable RDMA (if available)
启用RDMA(如果可用)
export TORCHSTORE_USE_RDMA=1
export TORCHSTORE_USE_RDMA=1
Or reduce sync frequency
或降低同步频率
training:
sync_interval: 10 # Sync every 10 steps
training:
sync_interval: 10 # 每10步同步一次
Issue: Policy Collapse
问题:策略坍缩
Symptoms: Entropy drops to zero, reward stops improving
Solutions:
Increase KL penalty
增加KL惩罚系数
grpo:
beta: 0.2 # Increase from 0.1
grpo:
beta: 0.2 # 从0.1提升至0.2
Or add entropy bonus
或添加熵奖励
training:
entropy_coef: 0.01
training:
entropy_coef: 0.01