Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the inefficiency of fine-tuning vision-language-action (VLA) models across diverse robot morphologies and tasks—caused by action distribution shift—this paper proposes a plug-and-play unified latent-guided adaptation framework. The method constructs a shared latent space to align heterogeneous action distributions, employs a variational autoencoder regularized by reverse KL divergence for mode matching, and integrates either diffusion or normalizing flow models to enable generative fine-tuning guidance. Compared to standard fine-tuning, our approach achieves an average 9.8% success rate improvement on simulated multi-task benchmarks and a 32% gain in real-world cross-morphology manipulation. Crucially, it significantly reduces data and computational overhead. Our core contribution is the first unification of latent-space alignment, variational constraints, and generative guidance within a single VLA adaptation framework, enabling efficient, generalizable zero-shot transfer.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce extbf{Align-Then-stEer ( exttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. exttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to extbf{9.8%} in simulation and achieves a striking extbf{32% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language-action models to new robotic tasks efficiently

Addressing action distribution mismatches in cross-embodiment scenarios

Reducing data and computation requirements for model fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent space alignment for action spaces

Reverse KL divergence constrained variational autoencoder

Guidance mechanism steers diffusion generation process

🔎 Similar Papers

No similar papers found.