Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitation of using overall mean squared error (MSE) as a performance predictor when fine-tuning vision-language-action (VLA) models on mobile manipulators with heterogeneous joint spaces, such as the 11-DoF Toyota HSR. Instead of relying on aggregate MSE, the authors propose a joint-group-wise error analysis—separating errors by arm, gripper, head, and base—to better guide model selection. Evaluating SmolVLA and π₀.₅ models with expert-headed fine-tuning across 60 real-world trials, they demonstrate that the π₀.₅ 80k variant achieves significantly higher task success rates (4.0/4 vs. 3.75/4 and 3.5/4, p ≤ 0.010). Crucially, its performance strongly correlates with arm-group error, confirming that group-wise error metrics more accurately reflect model efficacy in heterogeneous action spaces.

📝 Abstract

Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

mobile manipulation

heterogeneous joint spaces

checkpoint selection

per-group error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-Group Error

Vision-Language-Action Models

Mobile Manipulation