Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models in continuous action prediction, which rely on conventional regression objectives such as mean squared error (MSE) and struggle to effectively model the distribution of action errors—particularly under data imbalance, noise, or few-shot conditions. To overcome this, the study introduces the minimum error entropy (MEE) principle from information theory into modern VLA architectures for the first time, proposing a trajectory-level MEE objective along with a weighted variant. These are jointly optimized with MSE to reshape the action error distribution without introducing additional inference overhead. The approach significantly improves task success rates across multiple simulated benchmarks—including LIBERO and SimplerEnv—and real-world robotic tasks, demonstrating notably enhanced robustness and generalization, especially in challenging scenarios.

Technology Category

Application Category

📝 Abstract

In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range. Project Page: https://cognition2actionlab.github.io/VLA-TMEE.github.io/

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

action error distribution

continuous action regression

robustness

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimum Error Entropy

Vision-Language-Action Models

Action Error Distribution