🤖 AI Summary
This work addresses the limitation of using overall mean squared error (MSE) as a performance predictor when fine-tuning vision-language-action (VLA) models on mobile manipulators with heterogeneous joint spaces, such as the 11-DoF Toyota HSR. Instead of relying on aggregate MSE, the authors propose a joint-group-wise error analysis—separating errors by arm, gripper, head, and base—to better guide model selection. Evaluating SmolVLA and π₀.₅ models with expert-headed fine-tuning across 60 real-world trials, they demonstrate that the π₀.₅ 80k variant achieves significantly higher task success rates (4.0/4 vs. 3.75/4 and 3.5/4, p ≤ 0.010). Crucially, its performance strongly correlates with arm-group error, confirming that group-wise error metrics more accurately reflect model efficacy in heterogeneous action spaces.
📝 Abstract
Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail.
We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla