🤖 AI Summary
Existing vision-language-action (VLA) policies exhibit limited generalization under out-of-distribution language instructions, primarily due to the significantly lower diversity of language data compared to visual and action modalities, which leads models to rely on visual shortcuts at the expense of robust language understanding. To address this, this work proposes APT, a two-stage training framework: first, a language-agnostic vision-action prior is pretrained atop a frozen vision-language model; second, language conditioning is incorporated via a gated fusion mechanism that preserves the learned motor prior. APT introduces, for the first time in continuous-action expert policies, a Bayesian-inspired pretraining paradigm that decomposes the policy into a language-agnostic prior and a language-conditioned likelihood, effectively mitigating language data imbalance. The approach substantially improves generalization across unseen instructions and compositional tasks and is compatible with mainstream VLA architectures such as π and GR00T.
📝 Abstract
Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $π$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/