๐ค AI Summary
Current vision-language-action (VLA) models exhibit poor generalization across environments, tasks, and robotic platforms, hindering direct deployment in novel scenarios. To address this, we propose a lightweight, one-shot adaptation framework: robot policies are modeled as linear combinations of learnable basis functions, enabling gradient-free skill inference via L1-regularized convex optimization from a single demonstration. A hybrid skill architecture is jointly pre-trained on the Open X-Embodiment multi-source dataset to construct a structured, reusable skill space. Experiments demonstrate that our method achieves significantly lower action prediction error than state-of-the-art VLA models across five unseen benchmarks. Moreover, it successfully executes tasks in both simulation and real-world robotic settingsโwhere baseline VLA models fail entirely. This work advances practical robot policy adaptation by combining structured representation learning with efficient, optimization-based few-shot inference.
๐ Abstract
Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright. Project page: mos-vla.github.io/