Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual gap between perception and action prediction in transferring general-purpose vision-language models (VLMs) to robotic control. The authors propose a three-stage progressive training framework that introduces embodied trajectory coupling (ETC) data as an intermediate bridge, comprising distribution bridging, goal bridging, and preservational adaptation stages to efficiently transfer VLMs into vision-language-action (VLA) policies. By integrating cross-distribution language-vision semantic alignment with few-shot action data through hybrid training, the method significantly enhances policy generalization and robustness in both simulation and real-robot experiments, enabling rapid adaptation to novel scenarios with only a small number of demonstrations.
📝 Abstract
Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
robot control policies
generalization gap
embodied perception
action prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied trajectory-coupled data
vision-language models
visual-language-action policies
gradual bridging strategy
retentive adaptation
🔎 Similar Papers
2024-10-04International Conference on Learning RepresentationsCitations: 0