openclaw-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenClaw-RL Training

OpenClaw-RL 训练

Skill by ara.so — Daily 2026 Skills collection.

OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw, intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.

来自ara.so的技能——2026每日技能合集。

OpenClaw-RL是一个全异步强化学习框架，可将实时多轮对话转化为训练信号，用于训练个性化AI Agent。它通过OpenClaw将自托管模型封装为兼容OpenAI的API，拦截对话内容，并在后台持续优化策略，且不会中断使用。同时它还支持针对终端、GUI、SWE和工具调用类Agent的可扩展RL训练。

Architecture Overview

架构概述

Four independent async loops that never block each other:

Agent Serving — OpenClaw-compatible API serving rollouts
Rollout Collection — Captures multi-turn conversations as training trajectories
PRM/Judge Evaluation — Scores turns using next-state feedback (majority voting optional)
Policy Training — GRPO/OPD/Combine training via slime or Tinker

四个独立的异步循环，彼此互不阻塞：

Agent服务 —— 提供兼容OpenClaw的API，用于输出rollout结果
Rollout收集 —— 捕获多轮对话作为训练轨迹
PRM/Judge评估 —— 利用下一状态反馈为每一轮对话打分（可选多数投票机制）
策略训练 —— 通过slime或Tinker进行GRPO/OPD/组合式训练

Installation

安装

bash

git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL

bash

git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL

Install core dependencies

安装核心依赖

pip install -r requirements.txt

Install slime (training backend)

安装slime（训练后端）

cd slime && pip install -e . && cd ..

Optional: install SGLang for fast inference

可选：安装SGLang以实现快速推理

pip install sglang

undefined

pip install sglang

undefined

Project Structure

项目结构

OpenClaw-RL/
├── openclaw-rl/          # Binary RL (GRPO) method
├── openclaw-opd/         # On-Policy Distillation method
├── openclaw-combine/     # Combined Binary RL + OPD
├── openclaw-test/        # Evaluation utilities
├── terminal-rl/          # Track 2: Terminal agent RL
├── gui-rl/               # Track 2: GUI agent RL
├── swe-rl/               # Track 2: SWE agent RL
├── toolcall-rl/          # Track 2: Tool-call agent RL
├── slime/                # Core training framework
└── openclaw/             # Runtime / API server

OpenClaw-RL/
├── openclaw-rl/          # 二进制RL（GRPO）方法
├── openclaw-opd/         # 在线策略蒸馏（OPD）方法
├── openclaw-combine/     # 二进制RL + OPD组合方法
├── openclaw-test/        # 评估工具
├── terminal-rl/          # 赛道2：终端Agent RL训练
├── gui-rl/               # 赛道2：GUI Agent RL训练
├── swe-rl/               # 赛道2：SWE Agent RL训练
├── toolcall-rl/          # 赛道2：工具调用类Agent RL训练
├── slime/                # 核心训练框架
└── openclaw/             # 运行时/API服务器

Three Learning Paradigms

三种学习范式

1. Binary RL (GRPO)

1. 二进制RL（GRPO）

A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.

过程奖励模型（PRM）通过下一状态反馈为每一轮对话打分。采用GRPO优势估计结合PPO风格的截断替代损失。

2. On-Policy Distillation (OPD)

2. 在线策略蒸馏（OPD）

When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.

当下一状态揭示有用的后见之明时，Judge模型会提取文本提示来增强原始prompt，创建一个增强版的教师模型。token级别的对数概率差距会成为定向优势信号。

3. Combination Method (Recommended)

3. 组合方法（推荐）

Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.

融合二进制RL的标量监督与OPD的token级定向信号，是最强健、最稳定的优化方式。

Quick Start — Personal Agent (Track 1)

快速开始——个性化Agent（赛道1）

Binary RL Launch Script

二进制RL启动脚本

bash

undefined

bash

undefined

openclaw-rl/run_qwen3_7b_openclaw_rl.sh

export MODEL_PATH=/path/to/qwen3-7b export DATA_PATH=/path/to/conversation/data export CKPT_SAVE_DIR=/path/to/checkpoints

bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh

undefined

export MODEL_PATH=/path/to/qwen3-7b export DATA_PATH=/path/to/conversation/data export CKPT_SAVE_DIR=/path/to/checkpoints

bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh

undefined

OPD Launch Script

OPD启动脚本

bash

export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data

bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh

bash

export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data

bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh

Combination Method (One Line)

组合方法（一键启动）

bash

undefined

bash

undefined

Launch with combined Binary RL + OPD

启动二进制RL + OPD的组合训练

bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

undefined

bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

undefined

Configuration — Key Environment Variables

配置——关键环境变量

bash

undefined

bash

undefined

Model configuration

模型配置

export MODEL_PATH=/path/to/base/model export JUDGE_MODEL_PATH=/path/to/judge/model # For OPD export PRM_MODEL_PATH=/path/to/prm/model # For Binary RL

export MODEL_PATH=/path/to/base/model export JUDGE_MODEL_PATH=/path/to/judge/model # 用于OPD export PRM_MODEL_PATH=/path/to/prm/model # 用于二进制RL

Training configuration

训练配置

export CKPT_SAVE_DIR=./checkpoints export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"

Rollout configuration

Rollout配置

export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"

Optimizer configuration

优化器配置

export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"

GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)

GPU分区（例如8张GPU：4张用于训练，4张用于rollout）

export TRAIN_GPUS="0,1,2,3" export ROLLOUT_GPUS="4,5,6,7"

LoRA (optional, reduces GPU memory)

LoRA（可选，减少GPU内存占用）

export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"

undefined

export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"

undefined

LoRA Training

LoRA训练

bash

undefined

bash

undefined

Add LoRA args to any launch script

在任意启动脚本中添加LoRA参数

export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"

Example: LoRA Binary RL

示例：LoRA二进制RL训练

bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh

undefined

bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh

undefined

Custom Loss / Rollout Functions (Plugin API)

自定义损失/ Rollout函数（插件API）

The slime framework exposes extension points without modifying core code:

bash

undefined

slime框架提供了扩展点，无需修改核心代码即可实现自定义：

bash

undefined

Custom loss function

自定义损失函数

--custom-loss-function-path ./my_method/custom_loss.py

Custom rollout function

自定义rollout函数

--rollout-function-path ./my_method/custom_rollout.py

Custom generation function

自定义生成函数

--custom-generate-function-path ./my_method/custom_generate.py

Custom reward model

自定义奖励模型

--custom-rm-path ./my_method/custom_rm.py

undefined

--custom-rm-path ./my_method/custom_rm.py

undefined

Example Custom Loss (TypeScript-style config, Python implementation)

自定义损失示例（TypeScript风格配置，Python实现）

python

undefined

python

undefined

my_method/custom_loss.py

import torch from typing import Dict, Any

def compute_loss( policy_logits: torch.Tensor, reference_logits: torch.Tensor, rewards: torch.Tensor, advantages: torch.Tensor, config: Dict[str, Any] ) -> torch.Tensor: """ Custom GRPO-style loss with clipped surrogate objective. """ # Log-ratio between policy and reference log_ratio = policy_logits - reference_logits ratio = torch.exp(log_ratio)

clip_range = config.get("clip_range", 0.2)

# PPO-style clipped objective
clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
loss = -torch.min(ratio * advantages, clipped * advantages).mean()

# KL penalty
kl_coeff = config.get("kl_coeff", 0.01)
kl_penalty = kl_coeff * log_ratio.mean()

return loss + kl_penalty

undefined

import torch from typing import Dict, Any

def compute_loss( policy_logits: torch.Tensor, reference_logits: torch.Tensor, rewards: torch.Tensor, advantages: torch.Tensor, config: Dict[str, Any] ) -> torch.Tensor: """ 带有截断替代目标的自定义GRPO风格损失。 """ # 策略与参考模型之间的对数比率 log_ratio = policy_logits - reference_logits ratio = torch.exp(log_ratio)

clip_range = config.get("clip_range", 0.2)

# PPO风格的截断目标
clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
loss = -torch.min(ratio * advantages, clipped * advantages).mean()

# KL惩罚
kl_coeff = config.get("kl_coeff", 0.01)
kl_penalty = kl_coeff * log_ratio.mean()

return loss + kl_penalty

undefined

Example Custom Reward Model

自定义奖励模型示例

python

undefined

python

undefined

my_method/custom_rm.py

from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch

class CustomPRM: def init(self, model_path: str): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForSequenceClassification.from_pretrained( model_path, torch_dtype=torch.bfloat16 ) self.model.eval()

def score(self, prompt: str, response: str, next_state: str) -> float:
    """
    Score a turn given prompt, response, and next-state feedback.
    """
    combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
    inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
    
    with torch.no_grad():
        logits = self.model(**inputs).logits
    
    # Binary reward: positive class probability
    return torch.softmax(logits, dim=-1)[0, 1].item()

def get_reward_model(config): return CustomPRM(config["prm_model_path"])

undefined

from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch

def score(self, prompt: str, response: str, next_state: str) -> float:
    """
    根据prompt、响应和下一状态反馈为单轮对话打分。
    """
    combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
    inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
    
    with torch.no_grad():
        logits = self.model(**inputs).logits
    
    # 二进制奖励：正类概率
    return torch.softmax(logits, dim=-1)[0, 1].item()

def get_reward_model(config): return CustomPRM(config["prm_model_path"])

undefined

Deploying on Tinker (Cloud)

在Tinker（云端）部署

bash

undefined

bash

undefined

One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported

一键云端部署——支持混合RL、OPD、二进制RL

export TINKER_API_KEY=$TINKER_API_KEY export TINKER_ENDPOINT=$TINKER_ENDPOINT

Submit job via Ray

通过Ray提交任务

ray job submit --address $TINKER_ENDPOINT
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

undefined

ray job submit --address $TINKER_ENDPOINT
--working-dir .
-- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

undefined

Track 2 — General Agentic RL

赛道2——通用Agentic RL训练

Terminal Agent RL

终端Agent RL训练

bash

export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32   # Number of parallel environment instances

bash terminal-rl/run_terminal_rl.sh

bash

export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32   # 并行环境实例数量

bash terminal-rl/run_terminal_rl.sh

GUI Agent RL

GUI Agent RL训练

bash

export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright   # or selenium
export PARALLEL_ENVS=16

bash gui-rl/run_gui_rl.sh

bash

export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright   # 或selenium
export PARALLEL_ENVS=16

bash gui-rl/run_gui_rl.sh

Tool-Call Agent RL

工具调用类Agent RL训练

bash

export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64

bash toolcall-rl/run_toolcall_rl.sh

bash

export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64

bash toolcall-rl/run_toolcall_rl.sh

SWE Agent RL

SWE Agent RL训练

bash

export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8   # SWE environments are heavier

bash swe-rl/run_swe_rl.sh

bash

export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8   # SWE环境资源占用较高

bash swe-rl/run_swe_rl.sh

Data Format — Conversation Trajectories

数据格式——对话轨迹

OpenClaw-RL automatically classifies API messages. Manual format for custom data:

json

{
  "session_id": "user_session_abc123",
  "turns": [
    {
      "type": "main",
      "prompt": "Help me refactor this function to use async/await",
      "response": "Here's the refactored version: ...",
      "next_state": "User accepted the change and said 'perfect, thanks!'",
      "trainable": true
    },
    {
      "type": "side", 
      "prompt": "What is 2+2?",
      "response": "4",
      "trainable": false
    }
  ]
}

main
turns: Multi-turn interactions that form training trajectories
side
turns: Non-trainable system/utility turns excluded from training

OpenClaw-RL会自动分类API消息。自定义数据的手动格式：

json

{
  "session_id": "user_session_abc123",
  "turns": [
    {
      "type": "main",
      "prompt": "帮我重构这个函数以使用async/await",
      "response": "这是重构后的版本：...",
      "next_state": "用户接受了修改并说'完美，谢谢！'",
      "trainable": true
    },
    {
      "type": "side", 
      "prompt": "2+2等于多少？",
      "response": "4",
      "trainable": false
    }
  ]
}

main
轮次：构成训练轨迹的多轮交互
side
轮次：非训练用的系统/工具轮次，会被排除在训练之外

OpenClaw API Server Setup

OpenClaw API服务器设置

bash

undefined

bash

undefined

Start OpenClaw-compatible API server wrapping your model

启动兼容OpenClaw的API服务器，封装你的模型

export BASE_MODEL_PATH=/path/to/your/model export OPENCLAW_PORT=8000 export OPENCLAW_HOST=0.0.0.0

Using SGLang backend (recommended for speed)

使用SGLang后端（推荐，速度更快）

python -m openclaw.server
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # Enable conversation capture for RL --rl-buffer-dir ./rl_buffer # Where to store captured trajectories


```typescript
// Using the server as OpenAI-compatible API in TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "user", content: "Help me write a sorting algorithm" }
  ],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

python -m openclaw.server
--model-path $BASE_MODEL_PATH
--port $OPENCLAW_PORT
--backend sglang
--enable-rl-intercept # 启用对话捕获以用于RL训练 --rl-buffer-dir ./rl_buffer # 存储捕获轨迹的目录


```typescript
// 在TypeScript中将服务器作为兼容OpenAI的API使用
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "user", content: "帮我写一个排序算法" }
  ],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Majority Voting for Robust PRM Scoring

用于鲁棒PRM打分的多数投票机制

bash

undefined

bash

undefined

Enable majority voting for more robust reward estimation

启用多数投票机制以获得更鲁棒的奖励估计

export MAJORITY_VOTE_N=5 # Number of judge calls per turn export MAJORITY_VOTE_THRESHOLD=0.6

export MAJORITY_VOTE_N=5 # 每轮对话调用Judge的次数 export MAJORITY_VOTE_THRESHOLD=0.6

Add to your launch script args:

添加到启动脚本参数中：

--majority-vote-n $MAJORITY_VOTE_N
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD

undefined

--majority-vote-n $MAJORITY_VOTE_N
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD

undefined

Adding a New Method (Contribution Pattern)

添加新方法（贡献模式）

bash

undefined

bash

undefined

1. Create a new top-level folder

1. 创建新的顶级文件夹

mkdir my-new-method cd my-new-method

2. Required files

2. 必需文件

touch README.md # Document what, how, env vars touch run_qwen3_7b_my_method.sh # Launch script touch custom_loss.py # If custom loss needed touch custom_rollout.py # If custom rollout needed


```bash

touch README.md # 说明方法内容、使用方式、环境变量 touch run_qwen3_7b_my_method.sh # 启动脚本 touch custom_loss.py # 如果需要自定义损失 touch custom_rollout.py # 如果需要自定义rollout


```bash

run_qwen3_7b_my_method.sh — follow existing conventions

run_qwen3_7b_my_method.sh —— 遵循现有规范

#!/bin/bash set -e

MODEL_SIZE="7b" MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b} CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}

CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR" ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4" OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"

ray job submit --working-dir .. --
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS

undefined

#!/bin/bash set -e

MODEL_SIZE="7b" MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b} CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}

CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR" ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4" OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"

ray job submit --working-dir .. --
python slime/train.py
--model-path $MODEL_PATH
--custom-loss-function-path my-new-method/custom_loss.py
$CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS

undefined

Common Patterns

常用操作

Monitor Training Progress

监控训练进度

bash

undefined

bash

undefined

View Ray dashboard

查看Ray仪表盘

ray dashboard # Opens at http://localhost:8265

ray dashboard # 打开地址：http://localhost:8265

Watch checkpoint saves

监控 checkpoint 保存

watch -n 10 ls -la $CKPT_SAVE_DIR

Stream training logs

流式查看训练日志

tail -f ./logs/training.log

undefined

tail -f ./logs/training.log

undefined

Resume from Checkpoint

从Checkpoint恢复训练

bash

export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500

bash

export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500

Add to launch script:

添加到启动脚本：

--resume-from-checkpoint $RESUME_CKPT

undefined

--resume-from-checkpoint $RESUME_CKPT

undefined

Evaluate Trained Checkpoints

评估训练后的Checkpoint

bash

bash openclaw-test/run_eval.sh \
  --model-path $CKPT_SAVE_DIR/checkpoint-latest \
  --eval-tasks "conversation,coding,tool-use"

bash

bash openclaw-test/run_eval.sh \
  --model-path $CKPT_SAVE_DIR/checkpoint-latest \
  --eval-tasks "conversation,coding,tool-use"

Troubleshooting

故障排除

Out of GPU memory during rollout + training:

bash

undefined

Rollout + 训练时GPU内存不足：

bash

undefined

Use LoRA to reduce memory footprint

使用LoRA减少内存占用

export LORA_ARGS="--use-lora --lora-rank 32"

Or reduce parallel environments

或减少并行环境数量

export PARALLEL_ENVS=8

Or use offloading

或使用卸载机制

--offload-optimizer-state


**Async loop falling behind (buffer overflow):**
```bash

--offload-optimizer-state


**异步循环滞后（缓冲区溢出）：**
```bash

Reduce rollout batch size or increase judge throughput

减小rollout批量大小或提高Judge吞吐量

export ROLLOUT_ARGS="--rollout-batch-size 16"

Or add more judge workers

或添加更多Judge工作进程

--num-judge-workers 4


**PRM scores all near 0.5 (reward collapse):**
- Verify `next_state` fields contain meaningful feedback signals
- Check judge model prompt template matches expected format
- Try increasing majority vote N: `--majority-vote-n 7`

**SGLang server not starting:**
```bash

--num-judge-workers 4


**PRM分数均接近0.5（奖励坍缩）：**
- 验证`next_state`字段包含有意义的反馈信号
- 检查Judge模型的prompt模板是否符合预期格式
- 尝试增加多数投票次数：`--majority-vote-n 7`

**SGLang服务器无法启动：**
```bash

Check SGLang version compatibility

检查SGLang版本兼容性

pip install sglang==0.4.x # Check slime/requirements.txt for pinned version

pip install sglang==0.4.x # 查看slime/requirements.txt中的固定版本

Fallback to vLLM backend

回退到vLLM后端

--backend vllm


**Ray job submission fails:**
```bash

--backend vllm


**Ray任务提交失败：**
```bash

Start Ray cluster first

先启动Ray集群

ray start --head --num-gpus=$(nvidia-smi -L | wc -l)

Then submit job

然后提交任务

ray job submit --address auto -- bash run.sh

undefined

ray job submit --address auto -- bash run.sh

undefined

Key References

关键参考资料

Technical Report (arXiv)
OpenClaw Plugin
Slime Training Framework
Tinker Cloud Platform
SDFT Paper — integrated in openclaw-opd
SDPO Paper — integrated in openclaw-opd

技术报告（arXiv）
OpenClaw插件
Slime训练框架
Tinker云平台
SDFT论文 —— 已集成到openclaw-opd中
SDPO论文 —— 已集成到openclaw-opd中