🤖 AI Summary
To address the inefficiency of fine-tuning vision-language-action (VLA) models across diverse robot morphologies and tasks—caused by action distribution shift—this paper proposes a plug-and-play unified latent-guided adaptation framework. The method constructs a shared latent space to align heterogeneous action distributions, employs a variational autoencoder regularized by reverse KL divergence for mode matching, and integrates either diffusion or normalizing flow models to enable generative fine-tuning guidance. Compared to standard fine-tuning, our approach achieves an average 9.8% success rate improvement on simulated multi-task benchmarks and a 32% gain in real-world cross-morphology manipulation. Crucially, it significantly reduces data and computational overhead. Our core contribution is the first unification of latent-space alignment, variational constraints, and generative guidance within a single VLA adaptation framework, enabling efficient, generalizable zero-shot transfer.
📝 Abstract
Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce extbf{Align-Then-stEer ( exttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. exttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to extbf{9.8%} in simulation and achieves a striking extbf{32% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.