🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in continuous action prediction, which rely on conventional regression objectives such as mean squared error (MSE) and struggle to effectively model the distribution of action errors—particularly under data imbalance, noise, or few-shot conditions. To overcome this, the study introduces the minimum error entropy (MEE) principle from information theory into modern VLA architectures for the first time, proposing a trajectory-level MEE objective along with a weighted variant. These are jointly optimized with MSE to reshape the action error distribution without introducing additional inference overhead. The approach significantly improves task success rates across multiple simulated benchmarks—including LIBERO and SimplerEnv—and real-world robotic tasks, demonstrating notably enhanced robustness and generalization, especially in challenging scenarios.
📝 Abstract
In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range. Project Page: https://cognition2actionlab.github.io/VLA-TMEE.github.io/