openclaw-rl-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenClaw-RL Training
OpenClaw-RL 训练
Skill by ara.so — Daily 2026 Skills collection.
OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw, intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.
Architecture Overview
架构概述
Four independent async loops that never block each other:
Installation
安装
bash
git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RLbash
git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RLInstall core dependencies
安装核心依赖
pip install -r requirements.txt
pip install -r requirements.txt
Install slime (training backend)
安装slime(训练后端)
cd slime && pip install -e . && cd ..
cd slime && pip install -e . && cd ..
Optional: install SGLang for fast inference
可选:安装SGLang以实现快速推理
pip install sglang
undefinedpip install sglang
undefinedProject Structure
项目结构
OpenClaw-RL/
├── openclaw-rl/ # Binary RL (GRPO) method
├── openclaw-opd/ # On-Policy Distillation method
├── openclaw-combine/ # Combined Binary RL + OPD
├── openclaw-test/ # Evaluation utilities
├── terminal-rl/ # Track 2: Terminal agent RL
├── gui-rl/ # Track 2: GUI agent RL
├── swe-rl/ # Track 2: SWE agent RL
├── toolcall-rl/ # Track 2: Tool-call agent RL
├── slime/ # Core training framework
└── openclaw/ # Runtime / API serverOpenClaw-RL/
├── openclaw-rl/ # 二进制RL(GRPO)方法
├── openclaw-opd/ # 在线策略蒸馏(OPD)方法
├── openclaw-combine/ # 二进制RL + OPD组合方法
├── openclaw-test/ # 评估工具
├── terminal-rl/ # 赛道2:终端Agent RL训练
├── gui-rl/ # 赛道2:GUI Agent RL训练
├── swe-rl/ # 赛道2:SWE Agent RL训练
├── toolcall-rl/ # 赛道2:工具调用类Agent RL训练
├── slime/ # 核心训练框架
└── openclaw/ # 运行时/API服务器Three Learning Paradigms
三种学习范式
1. Binary RL (GRPO)
1. 二进制RL(GRPO)
A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.
过程奖励模型(PRM)通过下一状态反馈为每一轮对话打分。采用GRPO优势估计结合PPO风格的截断替代损失。
2. On-Policy Distillation (OPD)
2. 在线策略蒸馏(OPD)
When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.
当下一状态揭示有用的后见之明时,Judge模型会提取文本提示来增强原始prompt,创建一个增强版的教师模型。token级别的对数概率差距会成为定向优势信号。
3. Combination Method (Recommended)
3. 组合方法(推荐)
Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.
融合二进制RL的标量监督与OPD的token级定向信号,是最强健、最稳定的优化方式。
Quick Start — Personal Agent (Track 1)
快速开始——个性化Agent(赛道1)
Binary RL Launch Script
二进制RL启动脚本
bash
undefinedbash
undefinedopenclaw-rl/run_qwen3_7b_openclaw_rl.sh
openclaw-rl/run_qwen3_7b_openclaw_rl.sh
export MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints
bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh
undefinedexport MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints
bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh
undefinedOPD Launch Script
OPD启动脚本
bash
export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data
bash openclaw-opd/run_qwen3_7b_openclaw_opd.shbash
export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data
bash openclaw-opd/run_qwen3_7b_openclaw_opd.shCombination Method (One Line)
组合方法(一键启动)
bash
undefinedbash
undefinedLaunch with combined Binary RL + OPD
启动二进制RL + OPD的组合训练
bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
undefinedbash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
undefinedConfiguration — Key Environment Variables
配置——关键环境变量
bash
undefinedbash
undefinedModel configuration
模型配置
export MODEL_PATH=/path/to/base/model
export JUDGE_MODEL_PATH=/path/to/judge/model # For OPD
export PRM_MODEL_PATH=/path/to/prm/model # For Binary RL
export MODEL_PATH=/path/to/base/model
export JUDGE_MODEL_PATH=/path/to/judge/model # 用于OPD
export PRM_MODEL_PATH=/path/to/prm/model # 用于二进制RL
Training configuration
训练配置
export CKPT_SAVE_DIR=./checkpoints
export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"
export CKPT_SAVE_DIR=./checkpoints
export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"
Rollout configuration
Rollout配置
export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"
export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"
Optimizer configuration
优化器配置
export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"
export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"
GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)
GPU分区(例如8张GPU:4张用于训练,4张用于rollout)
export TRAIN_GPUS="0,1,2,3"
export ROLLOUT_GPUS="4,5,6,7"
export TRAIN_GPUS="0,1,2,3"
export ROLLOUT_GPUS="4,5,6,7"
LoRA (optional, reduces GPU memory)
LoRA(可选,减少GPU内存占用)
export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"
undefinedexport LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"
undefinedLoRA Training
LoRA训练
bash
undefinedbash
undefinedAdd LoRA args to any launch script
在任意启动脚本中添加LoRA参数
export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"
export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"
Example: LoRA Binary RL
示例:LoRA二进制RL训练
bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh
undefinedbash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh
undefinedCustom Loss / Rollout Functions (Plugin API)
自定义损失/ Rollout函数(插件API)
The slime framework exposes extension points without modifying core code:
bash
undefinedslime框架提供了扩展点,无需修改核心代码即可实现自定义:
bash
undefinedCustom loss function
自定义损失函数
--custom-loss-function-path ./my_method/custom_loss.py
--custom-loss-function-path ./my_method/custom_loss.py
Custom rollout function
自定义rollout函数
--rollout-function-path ./my_method/custom_rollout.py
--rollout-function-path ./my_method/custom_rollout.py
Custom generation function
自定义生成函数
--custom-generate-function-path ./my_method/custom_generate.py
--custom-generate-function-path ./my_method/custom_generate.py
Custom reward model
自定义奖励模型
--custom-rm-path ./my_method/custom_rm.py
undefined--custom-rm-path ./my_method/custom_rm.py
undefinedExample Custom Loss (TypeScript-style config, Python implementation)
自定义损失示例(TypeScript风格配置,Python实现)
python
undefinedpython
undefinedmy_method/custom_loss.py
my_method/custom_loss.py
import torch
from typing import Dict, Any
def compute_loss(
policy_logits: torch.Tensor,
reference_logits: torch.Tensor,
rewards: torch.Tensor,
advantages: torch.Tensor,
config: Dict[str, Any]
) -> torch.Tensor:
"""
Custom GRPO-style loss with clipped surrogate objective.
"""
# Log-ratio between policy and reference
log_ratio = policy_logits - reference_logits
ratio = torch.exp(log_ratio)
clip_range = config.get("clip_range", 0.2)
# PPO-style clipped objective
clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
loss = -torch.min(ratio * advantages, clipped * advantages).mean()
# KL penalty
kl_coeff = config.get("kl_coeff", 0.01)
kl_penalty = kl_coeff * log_ratio.mean()
return loss + kl_penaltyundefinedimport torch
from typing import Dict, Any
def compute_loss(
policy_logits: torch.Tensor,
reference_logits: torch.Tensor,
rewards: torch.Tensor,
advantages: torch.Tensor,
config: Dict[str, Any]
) -> torch.Tensor:
"""
带有截断替代目标的自定义GRPO风格损失。
"""
# 策略与参考模型之间的对数比率
log_ratio = policy_logits - reference_logits
ratio = torch.exp(log_ratio)
clip_range = config.get("clip_range", 0.2)
# PPO风格的截断目标
clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
loss = -torch.min(ratio * advantages, clipped * advantages).mean()
# KL惩罚
kl_coeff = config.get("kl_coeff", 0.01)
kl_penalty = kl_coeff * log_ratio.mean()
return loss + kl_penaltyundefinedExample Custom Reward Model
自定义奖励模型示例
python
undefinedpython
undefinedmy_method/custom_rm.py
my_method/custom_rm.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class CustomPRM:
def init(self, model_path: str):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
self.model.eval()
def score(self, prompt: str, response: str, next_state: str) -> float:
"""
Score a turn given prompt, response, and next-state feedback.
"""
combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
logits = self.model(**inputs).logits
# Binary reward: positive class probability
return torch.softmax(logits, dim=-1)[0, 1].item()def get_reward_model(config):
return CustomPRM(config["prm_model_path"])
undefinedfrom transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class CustomPRM:
def init(self, model_path: str):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
self.model.eval()
def score(self, prompt: str, response: str, next_state: str) -> float:
"""
根据prompt、响应和下一状态反馈为单轮对话打分。
"""
combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
logits = self.model(**inputs).logits
# 二进制奖励:正类概率
return torch.softmax(logits, dim=-1)[0, 1].item()def get_reward_model(config):
return CustomPRM(config["prm_model_path"])
undefinedDeploying on Tinker (Cloud)
在Tinker(云端)部署
bash
undefinedbash
undefinedOne-line cloud deployment — Hybrid RL, OPD, Binary RL all supported
一键云端部署——支持混合RL、OPD、二进制RL
export TINKER_API_KEY=$TINKER_API_KEY
export TINKER_ENDPOINT=$TINKER_ENDPOINT
export TINKER_API_KEY=$TINKER_API_KEY
export TINKER_ENDPOINT=$TINKER_ENDPOINT
Submit job via Ray
通过Ray提交任务
ray job submit --address $TINKER_ENDPOINT
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
undefinedray job submit --address $TINKER_ENDPOINT
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh
undefinedTrack 2 — General Agentic RL
赛道2——通用Agentic RL训练
Terminal Agent RL
终端Agent RL训练
bash
export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32 # Number of parallel environment instances
bash terminal-rl/run_terminal_rl.shbash
export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32 # 并行环境实例数量
bash terminal-rl/run_terminal_rl.shGUI Agent RL
GUI Agent RL训练
bash
export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright # or selenium
export PARALLEL_ENVS=16
bash gui-rl/run_gui_rl.shbash
export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright # 或selenium
export PARALLEL_ENVS=16
bash gui-rl/run_gui_rl.shTool-Call Agent RL
工具调用类Agent RL训练
bash
export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64
bash toolcall-rl/run_toolcall_rl.shbash
export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64
bash toolcall-rl/run_toolcall_rl.shSWE Agent RL
SWE Agent RL训练
bash
export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8 # SWE environments are heavier
bash swe-rl/run_swe_rl.shbash
export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8 # SWE环境资源占用较高
bash swe-rl/run_swe_rl.shData Format — Conversation Trajectories
数据格式——对话轨迹
OpenClaw-RL automatically classifies API messages. Manual format for custom data:
json
{
"session_id": "user_session_abc123",
"turns": [
{
"type": "main",
"prompt": "Help me refactor this function to use async/await",
"response": "Here's the refactored version: ...",
"next_state": "User accepted the change and said 'perfect, thanks!'",
"trainable": true
},
{
"type": "side",
"prompt": "What is 2+2?",
"response": "4",
"trainable": false
}
]
}- turns: Multi-turn interactions that form training trajectories
main - turns: Non-trainable system/utility turns excluded from training
side
OpenClaw-RL会自动分类API消息。自定义数据的手动格式:
json
{
"session_id": "user_session_abc123",
"turns": [
{
"type": "main",
"prompt": "帮我重构这个函数以使用async/await",
"response": "这是重构后的版本:...",
"next_state": "用户接受了修改并说'完美,谢谢!'",
"trainable": true
},
{
"type": "side",
"prompt": "2+2等于多少?",
"response": "4",
"trainable": false
}
]
}- 轮次:构成训练轨迹的多轮交互
main - 轮次:非训练用的系统/工具轮次,会被排除在训练之外
side
OpenClaw API Server Setup
OpenClaw API服务器设置
bash
undefinedbash
undefinedStart OpenClaw-compatible API server wrapping your model
启动兼容OpenClaw的API服务器,封装你的模型
export BASE_MODEL_PATH=/path/to/your/model
export OPENCLAW_PORT=8000
export OPENCLAW_HOST=0.0.0.0
export BASE_MODEL_PATH=/path/to/your/model
export OPENCLAW_PORT=8000
export OPENCLAW_HOST=0.0.0.0
Using SGLang backend (recommended for speed)
使用SGLang后端(推荐,速度更快)
python -m openclaw.server
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # Enable conversation capture for RL --rl-buffer-dir ./rl_buffer # Where to store captured trajectories
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # Enable conversation capture for RL --rl-buffer-dir ./rl_buffer # Where to store captured trajectories
```typescript
// Using the server as OpenAI-compatible API in TypeScript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});
const response = await client.chat.completions.create({
model: "your-model-name",
messages: [
{ role: "user", content: "Help me write a sorting algorithm" }
],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}python -m openclaw.server
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # 启用对话捕获以用于RL训练 --rl-buffer-dir ./rl_buffer # 存储捕获轨迹的目录
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # 启用对话捕获以用于RL训练 --rl-buffer-dir ./rl_buffer # 存储捕获轨迹的目录
```typescript
// 在TypeScript中将服务器作为兼容OpenAI的API使用
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});
const response = await client.chat.completions.create({
model: "your-model-name",
messages: [
{ role: "user", content: "帮我写一个排序算法" }
],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}Majority Voting for Robust PRM Scoring
用于鲁棒PRM打分的多数投票机制
bash
undefinedbash
undefinedEnable majority voting for more robust reward estimation
启用多数投票机制以获得更鲁棒的奖励估计
export MAJORITY_VOTE_N=5 # Number of judge calls per turn
export MAJORITY_VOTE_THRESHOLD=0.6
export MAJORITY_VOTE_N=5 # 每轮对话调用Judge的次数
export MAJORITY_VOTE_THRESHOLD=0.6
Add to your launch script args:
添加到启动脚本参数中:
--majority-vote-n $MAJORITY_VOTE_N
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD
undefined--majority-vote-n $MAJORITY_VOTE_N
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD
undefinedAdding a New Method (Contribution Pattern)
添加新方法(贡献模式)
bash
undefinedbash
undefined1. Create a new top-level folder
1. 创建新的顶级文件夹
mkdir my-new-method
cd my-new-method
mkdir my-new-method
cd my-new-method
2. Required files
2. 必需文件
touch README.md # Document what, how, env vars
touch run_qwen3_7b_my_method.sh # Launch script
touch custom_loss.py # If custom loss needed
touch custom_rollout.py # If custom rollout needed
```bashtouch README.md # 说明方法内容、使用方式、环境变量
touch run_qwen3_7b_my_method.sh # 启动脚本
touch custom_loss.py # 如果需要自定义损失
touch custom_rollout.py # 如果需要自定义rollout
```bashrun_qwen3_7b_my_method.sh — follow existing conventions
run_qwen3_7b_my_method.sh —— 遵循现有规范
#!/bin/bash
set -e
MODEL_SIZE="7b"
MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}
CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}
CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"
ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"
OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"
ray job submit --working-dir .. --
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS
undefined#!/bin/bash
set -e
MODEL_SIZE="7b"
MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}
CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}
CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"
ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"
OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"
ray job submit --working-dir .. --
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS
undefinedCommon Patterns
常用操作
Monitor Training Progress
监控训练进度
bash
undefinedbash
undefinedView Ray dashboard
查看Ray仪表盘
ray dashboard # Opens at http://localhost:8265
ray dashboard # 打开地址:http://localhost:8265
Watch checkpoint saves
监控 checkpoint 保存
watch -n 10 ls -la $CKPT_SAVE_DIR
watch -n 10 ls -la $CKPT_SAVE_DIR
Stream training logs
流式查看训练日志
tail -f ./logs/training.log
undefinedtail -f ./logs/training.log
undefinedResume from Checkpoint
从Checkpoint恢复训练
bash
export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500bash
export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500Add to launch script:
添加到启动脚本:
--resume-from-checkpoint $RESUME_CKPT
undefined--resume-from-checkpoint $RESUME_CKPT
undefinedEvaluate Trained Checkpoints
评估训练后的Checkpoint
bash
bash openclaw-test/run_eval.sh \
--model-path $CKPT_SAVE_DIR/checkpoint-latest \
--eval-tasks "conversation,coding,tool-use"bash
bash openclaw-test/run_eval.sh \
--model-path $CKPT_SAVE_DIR/checkpoint-latest \
--eval-tasks "conversation,coding,tool-use"Troubleshooting
故障排除
Out of GPU memory during rollout + training:
bash
undefinedRollout + 训练时GPU内存不足:
bash
undefinedUse LoRA to reduce memory footprint
使用LoRA减少内存占用
export LORA_ARGS="--use-lora --lora-rank 32"
export LORA_ARGS="--use-lora --lora-rank 32"
Or reduce parallel environments
或减少并行环境数量
export PARALLEL_ENVS=8
export PARALLEL_ENVS=8
Or use offloading
或使用卸载机制
--offload-optimizer-state
**Async loop falling behind (buffer overflow):**
```bash--offload-optimizer-state
**异步循环滞后(缓冲区溢出):**
```bashReduce rollout batch size or increase judge throughput
减小rollout批量大小或提高Judge吞吐量
export ROLLOUT_ARGS="--rollout-batch-size 16"
export ROLLOUT_ARGS="--rollout-batch-size 16"
Or add more judge workers
或添加更多Judge工作进程
--num-judge-workers 4
**PRM scores all near 0.5 (reward collapse):**
- Verify `next_state` fields contain meaningful feedback signals
- Check judge model prompt template matches expected format
- Try increasing majority vote N: `--majority-vote-n 7`
**SGLang server not starting:**
```bash--num-judge-workers 4
**PRM分数均接近0.5(奖励坍缩):**
- 验证`next_state`字段包含有意义的反馈信号
- 检查Judge模型的prompt模板是否符合预期格式
- 尝试增加多数投票次数:`--majority-vote-n 7`
**SGLang服务器无法启动:**
```bashCheck SGLang version compatibility
检查SGLang版本兼容性
pip install sglang==0.4.x # Check slime/requirements.txt for pinned version
pip install sglang==0.4.x # 查看slime/requirements.txt中的固定版本
Fallback to vLLM backend
回退到vLLM后端
--backend vllm
**Ray job submission fails:**
```bash--backend vllm
**Ray任务提交失败:**
```bashStart Ray cluster first
先启动Ray集群
ray start --head --num-gpus=$(nvidia-smi -L | wc -l)
ray start --head --num-gpus=$(nvidia-smi -L | wc -l)
Then submit job
然后提交任务
ray job submit --address auto -- bash run.sh
undefinedray job submit --address auto -- bash run.sh
undefinedKey References
关键参考资料
- Technical Report (arXiv)
- OpenClaw Plugin
- Slime Training Framework
- Tinker Cloud Platform
- SDFT Paper — integrated in openclaw-opd
- SDPO Paper — integrated in openclaw-opd
- 技术报告(arXiv)
- OpenClaw插件
- Slime训练框架
- Tinker云平台
- SDFT论文 —— 已集成到openclaw-opd中
- SDPO论文 —— 已集成到openclaw-opd中