fine-tuning-openvla-oft

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenVLA-OFT

Fine-tuning and evaluation workflows for OpenVLA-OFT and OpenVLA-OFT+ from the official

openvla-oft

codebase. Covers blank-machine setup plus LoRA-based adaptation of OpenVLA for robot action generation with continuous action prediction heads.

基于官方

openvla-oft

代码库的OpenVLA-OFT和OpenVLA-OFT+微调和评估工作流。涵盖裸机环境搭建到基于LoRA的OpenVLA适配全流程，用于搭载连续动作预测头的机器人动作生成场景。

Quick start

快速开始

Clone the public repo, follow the official setup, then evaluate a pretrained LIBERO checkpoint:

bash

git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7

克隆公开仓库，按照官方指引完成环境搭建，然后评估预训练的LIBERO检查点：

bash

git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7

Core concepts

核心概念

What OpenVLA-OFT changes: Standard OpenVLA tokenizes continuous actions into discrete bins, losing precision. OFT replaces this with dedicated continuous action heads (L1 regression or diffusion) while keeping the VLA backbone frozen and adapting via LoRA.

OFT vs OFT+ variants:

Variant	FiLM	Images	Typical use
OFT	Off	2 (front + wrist)	LIBERO simulation
OFT+	On	3 (high + left + right wrist)	ALOHA real-world

Key architecture choices:

LoRA adaptation: Rank-32 LoRA on VLA backbone (no full fine-tuning needed)
Continuous actions: L1 regression head (default) or diffusion head
FiLM conditioning: Feature-wise Linear Modulation for stronger language grounding in OFT+
Multi-image input: Configurable 2 or 3 camera streams via
```
num_images_in_input
```

OpenVLA-OFT的改进点：标准OpenVLA会将连续动作编码为离散token桶，会损失精度。OFT使用专用的连续动作头（L1回归或扩散模型）替换该逻辑，同时冻结VLA主干网络，仅通过LoRA进行适配。

OFT与OFT+版本对比:

版本	FiLM	输入图像数	典型场景
OFT	关闭	2（前置摄像头 + 腕部摄像头）	LIBERO仿真环境
OFT+	开启	3（高位摄像头 + 左/右腕部摄像头）	ALOHA真实机器人场景

核心架构选型:

LoRA适配：在VLA主干上使用秩为32的LoRA，无需全量微调
连续动作输出：默认使用L1回归头，也可选择扩散头
FiLM调节：OFT+版本使用特征层面线性调制，增强语言 grounding 效果
多图像输入：通过
```
num_images_in_input
```
参数可配置2或3路摄像头输入

Compute requirements

算力要求

Task	GPU	VRAM	Notes
LIBERO evaluation	1x A100/A40	~16 GB	Single GPU
ALOHA evaluation	1x A100/A40	~18 GB	Single GPU
LIBERO fine-tuning	8x A100	~27 GB/GPU	Paper default
ALOHA fine-tuning (OFT+)	8x A100	~35 GB/GPU	FiLM + 3 images
LoRA merge	1x any GPU	~16 GB	One-time step

任务	GPU配置	显存需求	说明
LIBERO评估	1x A100/A40	~16 GB	单GPU即可运行
ALOHA评估	1x A100/A40	~18 GB	单GPU即可运行
LIBERO微调	8x A100	~27 GB/卡	论文默认配置
ALOHA微调 (OFT+)	8x A100	~35 GB/卡	FiLM模块+3路图像输入
LoRA合并	任意单GPU	~16 GB	一次性操作

Expected performance benchmarks

预期性能基准

Official results (paper setup, seed=7, 50 trials per task):

Task Suite	Task-Specific	Combined Policy	Notes
LIBERO-Spatial	97.2%	96.8%	Easiest suite
LIBERO-Object	97.4%	97.0%	Object manipulation
LIBERO-Goal	95.8%	95.4%	May peak at 50k-100k steps
LIBERO-10	98.0%	98.0%	Long-horizon tasks
Average	97.1%	96.8%	Near-equivalent

Reproduction notes: results are tied to Python 3.10.14, PyTorch 2.2.0, NVIDIA A100, and custom Transformers fork.

官方结果（论文配置，seed=7，每个任务50次试验）:

任务套件	单任务专用策略	通用组合策略	说明
LIBERO-Spatial	97.2%	96.8%	最简单的套件
LIBERO-Object	97.4%	97.0%	物体操纵类任务
LIBERO-Goal	95.8%	95.4%	可能在5万-10万步达到性能峰值
LIBERO-10	98.0%	98.0%	长时序任务
平均	97.1%	96.8%	两类策略性能接近

复现说明：结果和Python 3.10.14、PyTorch 2.2.0、NVIDIA A100以及定制化Transformers分支强相关。

When to use vs alternatives

适用场景与替代方案对比

Use OpenVLA-OFT when:

The target task is robot action generation with visual and language conditioning
LoRA-based adaptation of
```
openvla/openvla-7b
```
is preferred
You need official LIBERO or ALOHA workflows from the OpenVLA-OFT paper
You want continuous action heads (L1 regression or diffusion) instead of tokenized actions

Use alternatives when:

You need a different VLA architecture (use
```
fine-tuning-serving-openpi
```
for pi0/pi0.5 models)
You need the NVIDIA Cosmos Policy stack (use
```
evaluating-cosmos-policy
```
)
You need general LLM fine-tuning without robot action heads

适合使用OpenVLA-OFT的场景:

目标任务是基于视觉和语言条件的机器人动作生成
优先选择基于
```
openvla/openvla-7b
```
的LoRA适配方案
需要使用OpenVLA-OFT论文中官方的LIBERO或ALOHA工作流
需要连续动作头（L1回归或扩散模型）而非token化动作输出

适合使用替代方案的场景:

需要其他VLA架构（pi0/pi0.5模型可使用
```
fine-tuning-serving-openpi
```
）
需要NVIDIA Cosmos Policy栈（可使用
```
evaluating-cosmos-policy
```
）
需要通用LLM微调，不需要机器人动作头

Workflow 1: Set up environment

工作流1：环境搭建

Copy this checklist and track progress:

text

Setup Progress:
- [ ] Step 1: Create conda env and install PyTorch
- [ ] Step 2: Install openvla-oft package in editable mode
- [ ] Step 3: Install FlashAttention2
- [ ] Step 4: Verify critical versions

Step 1: Create conda env and clone repo

bash

conda create -n openvla-oft python=3.10 -y
conda activate openvla-oft
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
pip3 install robosuite==1.4.0

Step 2: Install package

bash

pip install -e .

Step 3: Install FlashAttention2

bash

pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation

Step 4: Verify versions

python

import torch, transformers, peft
print(f"PyTorch: {torch.__version__}")         # Expected: 2.2.0
print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")             # Expected: 0.11.1

复制以下检查清单跟踪进度:

text

搭建进度:
- [ ] 步骤1：创建conda环境并安装PyTorch
- [ ] 步骤2：以可编辑模式安装openvla-oft包
- [ ] 步骤3：安装FlashAttention2
- [ ] 步骤4：验证核心依赖版本

步骤1：创建conda环境并克隆仓库

bash

conda create -n openvla-oft python=3.10 -y
conda activate openvla-oft
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
pip3 install robosuite==1.4.0

步骤2：安装依赖包

bash

pip install -e .

步骤3：安装FlashAttention2

bash

pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation

步骤4：验证版本

python

import torch, transformers, peft
print(f"PyTorch: {torch.__version__}")         # 预期版本: 2.2.0
print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")             # 预期版本: 0.11.1

Workflow 2: Evaluate pretrained checkpoints on LIBERO

工作流2：在LIBERO上评估预训练检查点

text

LIBERO Eval Progress:
- [ ] Step 1: Install LIBERO dependencies
- [ ] Step 2: Choose checkpoint and task suite
- [ ] Step 3: Run evaluation
- [ ] Step 4: Parse and validate results

Step 1: Install LIBERO

bash

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt

Step 2: Choose checkpoint

Checkpoint	Task suite
`moojink/openvla-7b-oft-finetuned-libero-spatial`	`libero_spatial`
`moojink/openvla-7b-oft-finetuned-libero-object`	`libero_object`
`moojink/openvla-7b-oft-finetuned-libero-goal`	`libero_goal`
`moojink/openvla-7b-oft-finetuned-libero-10`	`libero_10`
`moojink/openvla-7b-oft-finetuned-libero-spatial-object-goal-10`	Combined

Step 3: Run evaluation

bash

python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7

Step 4: Parse results

python

import re

def parse_libero_log(log_path):
    """Extract per-task success rates from LIBERO eval log."""
    with open(log_path) as f:
        content = f.read()
    matches = re.findall(r"Task (.+?): (\d+)/(\d+) successes", content)
    for task, successes, trials in matches:
        rate = int(successes) / int(trials)
        print(f"  {task}: {rate:.0%} ({successes}/{trials})")

parse_libero_log("experiments/logs/latest.log")

text

LIBERO评估进度:
- [ ] 步骤1：安装LIBERO依赖
- [ ] 步骤2：选择检查点和任务套件
- [ ] 步骤3：运行评估
- [ ] 步骤4：解析并验证结果

步骤1：安装LIBERO

bash

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt

步骤2：选择检查点

检查点	对应任务套件
`moojink/openvla-7b-oft-finetuned-libero-spatial`	`libero_spatial`
`moojink/openvla-7b-oft-finetuned-libero-object`	`libero_object`
`moojink/openvla-7b-oft-finetuned-libero-goal`	`libero_goal`
`moojink/openvla-7b-oft-finetuned-libero-10`	`libero_10`
`moojink/openvla-7b-oft-finetuned-libero-spatial-object-goal-10`	通用组合套件

步骤3：运行评估

bash

python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7

步骤4：解析结果

python

import re

def parse_libero_log(log_path):
    """从LIBERO评估日志中提取每个任务的成功率。"""
    with open(log_path) as f:
        content = f.read()
    matches = re.findall(r"Task (.+?): (\d+)/(\d+) successes", content)
    for task, successes, trials in matches:
        rate = int(successes) / int(trials)
        print(f"  {task}: {rate:.0%} ({successes}/{trials})")

parse_libero_log("experiments/logs/latest.log")

Workflow 3: Fine-tune on LIBERO

工作流3：在LIBERO上微调

Detailed reference: See references/libero-workflow.md for the full LIBERO setup, checkpoint selection strategy, and LoRA merge instructions.

text

LIBERO Fine-Tune Progress:
- [ ] Step 1: Prepare RLDS dataset
- [ ] Step 2: Launch torchrun with OFT defaults
- [ ] Step 3: Evaluate intermediate and final checkpoints
- [ ] Step 4: Merge LoRA for deployment if needed

Step 1: Dataset

Use RLDS datasets:

libero_spatial_no_noops

libero_object_no_noops

libero_goal_no_noops

libero_10_no_noops

Step 2: Launch training

bash

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir /PATH/TO/RLDS/DATASETS/ \
  --dataset_name libero_spatial_no_noops \
  --run_root_dir /YOUR/CHECKPOINTS/ \
  --use_l1_regression True \
  --use_diffusion False \
  --use_film False \
  --num_images_in_input 2 \
  --use_proprio True \
  --batch_size 8 \
  --learning_rate 5e-4 \
  --num_steps_before_decay 100000 \
  --max_steps 150005 \
  --save_freq 10000 \
  --save_latest_checkpoint_only False \
  --image_aug True \
  --lora_rank 32 \
  --wandb_entity YOUR_WANDB_ENTITY \
  --wandb_project YOUR_WANDB_PROJECT

Step 3: Evaluate checkpoints

Evaluate 50k, 100k, and 150k checkpoints — LIBERO-Goal may peak earlier than other suites. Keep best checkpoint per suite by actual task success, not only training loss.

Step 4: Merge LoRA

bash

python vla-scripts/merge_lora_weights_and_save.py \
  --base_checkpoint openvla/openvla-7b \
  --lora_finetuned_checkpoint_dir /PATH/TO/CHECKPOINT_DIR

详细参考：完整的LIBERO搭建、检查点选择策略和LoRA合并指引见 references/libero-workflow.md

text

LIBERO微调进度:
- [ ] 步骤1：准备RLDS数据集
- [ ] 步骤2：使用OFT默认配置启动torchrun训练
- [ ] 步骤3：评估中间和最终检查点
- [ ] 步骤4：按需合并LoRA权重用于部署

步骤1：数据集准备

使用RLDS数据集：

libero_spatial_no_noops

、

libero_object_no_noops

、

libero_goal_no_noops

、

libero_10_no_noops

。

步骤2：启动训练

bash

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir /PATH/TO/RLDS/DATASETS/ \
  --dataset_name libero_spatial_no_noops \
  --run_root_dir /YOUR/CHECKPOINTS/ \
  --use_l1_regression True \
  --use_diffusion False \
  --use_film False \
  --num_images_in_input 2 \
  --use_proprio True \
  --batch_size 8 \
  --learning_rate 5e-4 \
  --num_steps_before_decay 100000 \
  --max_steps 150005 \
  --save_freq 10000 \
  --save_latest_checkpoint_only False \
  --image_aug True \
  --lora_rank 32 \
  --wandb_entity YOUR_WANDB_ENTITY \
  --wandb_project YOUR_WANDB_PROJECT

步骤3：评估检查点

评估5万步、10万步和15万步的检查点——LIBERO-Goal套件的性能峰值可能早于其他套件。根据实际任务成功率保留每个套件的最优检查点，不要仅依赖训练损失。

步骤4：合并LoRA权重

bash

python vla-scripts/merge_lora_weights_and_save.py \
  --base_checkpoint openvla/openvla-7b \
  --lora_finetuned_checkpoint_dir /PATH/TO/CHECKPOINT_DIR

Workflow 4: Train and evaluate OpenVLA-OFT+ on ALOHA

工作流4：在ALOHA上训练和评估OpenVLA-OFT+

Detailed reference: See references/aloha-workflow.md for the full ALOHA server-client setup, data preprocessing, dataset registration, and troubleshooting.

text

ALOHA Progress:
- [ ] Step 1: Preprocess raw ALOHA demonstrations
- [ ] Step 2: Convert to RLDS and register dataset configs
- [ ] Step 3: Fine-tune OFT+ with FiLM and 3 images
- [ ] Step 4: Start VLA server on GPU machine
- [ ] Step 5: Run client-side robot evaluation

Step 1: Preprocess raw data

bash

python experiments/robot/aloha/preprocess_split_aloha_data.py \
  --dataset_path /path/to/aloha_raw/task_name/ \
  --out_base_dir /path/to/aloha_preprocessed/ \
  --percent_val 0.05

Step 2: Register RLDS dataset

Add entries in:

prismatic/vla/datasets/rlds/oxe/configs.py

prismatic/vla/datasets/rlds/oxe/transforms.py

prismatic/vla/datasets/rlds/oxe/mixtures.py

Set ALOHA constants in

prismatic/vla/constants.py

python

undefined

详细参考：完整的ALOHA服务端-客户端搭建、数据预处理、数据集注册和故障排查指引见 references/aloha-workflow.md

text

ALOHA进度:
- [ ] 步骤1：预处理原始ALOHA演示数据
- [ ] 步骤2：转换为RLDS格式并注册数据集配置
- [ ] 步骤3：使用FiLM和3路图像输入微调OFT+
- [ ] 步骤4：在GPU机器上启动VLA服务
- [ ] 步骤5：运行客户端侧机器人评估

步骤1：预处理原始数据

bash

python experiments/robot/aloha/preprocess_split_aloha_data.py \
  --dataset_path /path/to/aloha_raw/task_name/ \
  --out_base_dir /path/to/aloha_preprocessed/ \
  --percent_val 0.05

步骤2：注册RLDS数据集

在以下文件中添加配置项:

prismatic/vla/datasets/rlds/oxe/configs.py

prismatic/vla/datasets/rlds/oxe/transforms.py

prismatic/vla/datasets/rlds/oxe/mixtures.py

在

prismatic/vla/constants.py

中设置ALOHA常量:

python

undefined

Expected defaults for ALOHA

ALOHA默认配置

NUM_ACTIONS_CHUNK = 25 # Match control frequency (25 Hz) ACTION_DIM = 14 # 7 joints x 2 arms PROPRIO_DIM = 14 ACTION_PROPRIO_NORMALIZATION_TYPE = "BOUNDS" # Absolute joint angles


**Step 3: Fine-tune OFT+**

```bash
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir /PATH/TO/RLDS/DATASETS/ \
  --dataset_name aloha_task_name \
  --run_root_dir /YOUR/CHECKPOINTS/ \
  --use_l1_regression True \
  --use_diffusion False \
  --use_film True \
  --num_images_in_input 3 \
  --use_proprio True \
  --batch_size 4 \
  --learning_rate 5e-4 \
  --num_steps_before_decay 50000 \
  --max_steps 100005 \
  --use_val_set True \
  --val_freq 10000 \
  --save_freq 10000 \
  --lora_rank 32

Step 4: Start VLA server (GPU machine)

bash

python vla-scripts/deploy.py \
  --pretrained_checkpoint /PATH/TO/FINETUNED/CHECKPOINT/ \
  --use_l1_regression True \
  --use_film True \
  --num_images_in_input 3 \
  --use_proprio True \
  --center_crop True \
  --unnorm_key aloha_task_name

Server listens on

http://<server-ip>:8777/act

Step 5: Run client evaluation

bash

python experiments/robot/aloha/run_aloha_eval.py \
  --center_crop True \
  --num_open_loop_steps 25 \
  --use_vla_server True \
  --vla_server_url http://<SERVER_IP>:8777 \
  --num_rollouts_planned 50 \
  --max_steps 1500

NUM_ACTIONS_CHUNK = 25 # 匹配控制频率 (25 Hz) ACTION_DIM = 14 # 7个关节 x 2个机械臂 PROPRIO_DIM = 14 ACTION_PROPRIO_NORMALIZATION_TYPE = "BOUNDS" # 绝对关节角度


**步骤3：微调OFT+**

```bash
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir /PATH/TO/RLDS/DATASETS/ \
  --dataset_name aloha_task_name \
  --run_root_dir /YOUR/CHECKPOINTS/ \
  --use_l1_regression True \
  --use_diffusion False \
  --use_film True \
  --num_images_in_input 3 \
  --use_proprio True \
  --batch_size 4 \
  --learning_rate 5e-4 \
  --num_steps_before_decay 50000 \
  --max_steps 100005 \
  --use_val_set True \
  --val_freq 10000 \
  --save_freq 10000 \
  --lora_rank 32

步骤4：启动VLA服务（GPU机器）

bash

python vla-scripts/deploy.py \
  --pretrained_checkpoint /PATH/TO/FINETUNED/CHECKPOINT/ \
  --use_l1_regression True \
  --use_film True \
  --num_images_in_input 3 \
  --use_proprio True \
  --center_crop True \
  --unnorm_key aloha_task_name

服务监听地址为

http://<server-ip>:8777/act

。

步骤5：运行客户端评估

bash

python experiments/robot/aloha/run_aloha_eval.py \
  --center_crop True \
  --num_open_loop_steps 25 \
  --use_vla_server True \
  --vla_server_url http://<SERVER_IP>:8777 \
  --num_rollouts_planned 50 \
  --max_steps 1500

Critical invariants

关键约束

These flags must be consistent between training and inference. Mismatches cause silent failures:

Area	Required consistency	Failure if mismatched
Action head	`use_l1_regression` vs `use_diffusion`	Wrong head loading, invalid actions
FiLM	`use_film` across train/eval/deploy	Reduced language grounding
Image streams	`num_images_in_input` parity	Shape mismatch or performance drop
Proprio	`use_proprio` parity	State conditioning mismatch
LoRA rank	`lora_rank` parity	Adapter loading errors
Crop	`image_aug=True` in train → `center_crop=True` in eval	Significant success-rate drop
Action chunk	`num_open_loop_steps` ≈ `NUM_ACTIONS_CHUNK`	Latency/success tradeoff shifts
Unnorm key	`unnorm_key` present in checkpoint stats	Bad action scale

Quick validation:

python

undefined

以下配置项必须在训练和推理阶段保持一致，不匹配会导致无报错的隐性故障:

领域	必须一致的配置	不匹配的故障表现
动作头	`use_l1_regression` 和 `use_diffusion` 配置	动作头加载错误、输出无效动作
FiLM	训练/评估/部署全阶段 `use_film` 配置一致	语言 grounding 效果下降
图像流	`num_images_in_input` 数值一致	张量形状不匹配或性能下降
本体感知	`use_proprio` 配置一致	状态条件不匹配
LoRA秩	`lora_rank` 数值一致	适配器加载错误
裁剪	训练时 `image_aug=True` → 评估时 `center_crop=True`	成功率大幅下降
动作块	`num_open_loop_steps` ≈ `NUM_ACTIONS_CHUNK`	延迟/成功率权衡偏移
反归一化键	`unnorm_key` 存在于检查点统计信息中	动作输出缩放异常

快速验证:

python

undefined

Verify config parity before long eval runs

在长时间评估运行前验证配置一致性

train_flags = {"use_film": False, "num_images": 2, "use_proprio": True, "lora_rank": 32} eval_flags = {"use_film": False, "num_images": 2, "use_proprio": True, "lora_rank": 32} for k in train_flags: assert train_flags[k] == eval_flags[k], f"Mismatch: {k}: {train_flags[k]} vs {eval_flags[k]}" print("All flags consistent")

---

train_flags = {"use_film": False, "num_images": 2, "use_proprio": True, "lora_rank": 32} eval_flags = {"use_film": False, "num_images": 2, "use_proprio": True, "lora_rank": 32} for k in train_flags: assert train_flags[k] == eval_flags[k], f"配置不匹配: {k}: {train_flags[k]} vs {eval_flags[k]}" print("所有配置一致")

---

Common issues

常见问题

Issue: Action quality drops after moving checkpoints across GPU types

Fix: re-merge LoRA adapter on the downstream device:

bash

python vla-scripts/merge_lora_weights_and_save.py \
  --base_checkpoint openvla/openvla-7b \
  --lora_finetuned_checkpoint_dir /PATH/TO/CHECKPOINT_DIR

Issue: Wrong action scale or failed un-normalization

Fix: check

--unnorm_key

matches dataset statistics in checkpoint:

python

import torch
ckpt = torch.load("checkpoint/model.pt", map_location="cpu")
print("Available norm keys:", list(ckpt.get("norm_stats", {}).keys()))

Issue: Eval success unexpectedly low

Fix: verify all invariants in the table above. Most common culprit: missing

center_crop=True

when trained with

image_aug=True

Issue: LIBERO eval crashes with
EOFError
asking for dataset path

Fix: set

LIBERO_CONFIG_PATH

and write a non-interactive config before headless eval.

Issue: ALOHA client ROS import fails with
libffi
symbol errors

Fix:

conda install -c conda-forge libffi

Issue:
flash-attn
install fails

Fix: export

TMPDIR

and

PIP_CACHE_DIR

to the same filesystem, retry with

--no-cache-dir

Issue: EGL teardown logs show
EGL_NOT_INITIALIZED

Fix: treat as teardown noise unless exit code is non-zero. Set EGL env vars:

bash

export MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
export CUDA_VISIBLE_DEVICES=0 MUJOCO_EGL_DEVICE_ID=0

问题：跨GPU类型迁移检查点后动作质量下降

修复方案：在下游设备上重新合并LoRA适配器:

bash

python vla-scripts/merge_lora_weights_and_save.py \
  --base_checkpoint openvla/openvla-7b \
  --lora_finetuned_checkpoint_dir /PATH/TO/CHECKPOINT_DIR

问题：动作缩放错误或反归一化失败

修复方案：检查

--unnorm_key

是否匹配检查点中的数据集统计信息:

python

import torch
ckpt = torch.load("checkpoint/model.pt", map_location="cpu")
print("可用的归一化键:", list(ckpt.get("norm_stats", {}).keys()))

问题：评估成功率异常低

修复方案：验证上文表格中所有约束配置。最常见的原因是训练时

image_aug=True

但评估时未设置

center_crop=True

。

问题：LIBERO评估崩溃，抛出
EOFError
要求输入数据集路径

修复方案：设置

LIBERO_CONFIG_PATH

，在无交互评估前写入非交互式配置。

问题：ALOHA客户端ROS导入失败，抛出
libffi
符号错误

修复方案：执行

conda install -c conda-forge libffi

问题：
flash-attn
安装失败

修复方案：将

TMPDIR

和

PIP_CACHE_DIR

导出到同一文件系统，添加

--no-cache-dir

参数重试。

问题：EGL销毁日志显示
EGL_NOT_INITIALIZED

修复方案：如果退出码为0则视为销毁阶段的正常日志，无需处理。可设置EGL环境变量:

bash

export MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
export CUDA_VISIBLE_DEVICES=0 MUJOCO_EGL_DEVICE_ID=0

For HPC/cluster users

HPC/集群用户指引

On Slurm clusters, route caches to scratch to avoid filling

/home

quota:

bash

export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export TMPDIR=/scratch/$USER/tmp

Avoid stacking cluster Python modules when using conda. Typically

module load cuda

is sufficient.

在Slurm集群上，将缓存路径设置到scratch目录，避免占满

/home

配额:

bash

export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export TMPDIR=/scratch/$USER/tmp

使用conda时不要叠加集群的Python模块，通常只需加载

module load cuda

即可。

Advanced topics

高级主题

Paper summary and checkpoints: See references/paper-and-checkpoints.md Detailed LIBERO workflow: See references/libero-workflow.md Detailed ALOHA workflow: See references/aloha-workflow.md Config map and troubleshooting matrix: See references/config-troubleshooting.md

论文总结和检查点：见 references/paper-and-checkpoints.md LIBERO详细工作流：见 references/libero-workflow.md ALOHA详细工作流：见 references/aloha-workflow.md 配置映射和故障排查矩阵：见 references/config-troubleshooting.md

Resources

资源

Project website: https://openvla-oft.github.io/
Paper: https://arxiv.org/abs/2502.19645
Repository: https://github.com/moojink/openvla-oft
RLDS builder: https://github.com/moojink/rlds_dataset_builder

项目官网: https://openvla-oft.github.io/
论文: https://arxiv.org/abs/2502.19645
代码仓库: https://github.com/moojink/openvla-oft
RLDS构建工具: https://github.com/moojink/rlds_dataset_builder