🤖 AI Summary
This work addresses the challenge that fine-tuning general-purpose vision-language models (VLMs) for robotic control often leads to degradation of semantic understanding and interference with fine-grained motor learning. To mitigate this, the authors propose TwinBrainVLA, an architecture that freezes a pre-trained VLM as the “left brain” to preserve open-world semantic comprehension, while introducing a trainable, embodied perception-specific model as the “right brain” to learn high-precision continuous actions. The two components are integrated via an Asymmetric Mixture-of-Transformers (AsyMoT) mechanism and further enhanced by a Flow-Matching action expert module, enabling effective fusion of high-level semantics and low-level control. Experiments demonstrate that the approach outperforms existing methods on SimplerEnv and RoboCasa benchmarks, significantly alleviates catastrophic forgetting, and maintains the pretrained VLM’s general visual understanding capabilities.
📝 Abstract
The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to"catastrophic forgetting"that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called"Left Brain") and a trainable specialist (also called"Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation conditions a flow-matching action expert for precise continuous control. Empirical results on SimplerEnv and RoboCasa benchmarks demonstrate that by explicitly retaining general capabilities, TwinBrainVLA achieves substantial performance gains over baseline models in complex manipulation tasks.