MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Current vision-language-action (VLA) models exhibit poor generalization across environments, tasks, and robotic platforms, hindering direct deployment in novel scenarios. To address this, we propose a lightweight, one-shot adaptation framework: robot policies are modeled as linear combinations of learnable basis functions, enabling gradient-free skill inference via L1-regularized convex optimization from a single demonstration. A hybrid skill architecture is jointly pre-trained on the Open X-Embodiment multi-source dataset to construct a structured, reusable skill space. Experiments demonstrate that our method achieves significantly lower action prediction error than state-of-the-art VLA models across five unseen benchmarks. Moreover, it successfully executes tasks in both simulation and real-world robotic settings—where baseline VLA models fail entirely. This work advances practical robot policy adaptation by combining structured representation learning with efficient, optimization-based few-shot inference.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright. Project page: mos-vla.github.io/

Problem

Research questions and friction points this paper is trying to address.

Adapts robot policies to new tasks with one demonstration

Enables gradient-free skill adaptation via convex optimization

Improves action prediction across diverse unseen robot datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Skills VLA represents policies as linear combinations

Learns basis functions across datasets for structured skill space

Adapts to new tasks via single demonstration and convex optimization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Mamba Fusion: Learning Actions Through Questioning

2024-09-17arXiv.orgCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

AI Research Scientist, Robotics