🤖 AI Summary
This work addresses the limited generalization of current vision-language-action (VLA) models to novel instructions or multi-task settings, which stems from a collapse in the conditional mutual information between language instructions and actions—caused by the redundancy of instructions in training data where actions can be predicted directly from visual inputs alone. To tackle this “information collapse,” the paper formally characterizes the problem and introduces a Bayesian decomposition–based dual-branch architecture. This framework employs learnable latent action queries to separately model a vision-driven prior and a language-conditioned posterior, while explicitly optimizing pointwise mutual information (PMI) between actions and instructions to enforce instruction adherence. Evaluated on SimplerEnv and RoboCasa benchmarks, the method achieves significant generalization gains without additional data, improving out-of-distribution accuracy by 11.3% on SimplerEnv.
📝 Abstract
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $\pi(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.