TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

📅 2026-01-20

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that fine-tuning general-purpose vision-language models (VLMs) for robotic control often leads to degradation of semantic understanding and interference with fine-grained motor learning. To mitigate this, the authors propose TwinBrainVLA, an architecture that freezes a pre-trained VLM as the “left brain” to preserve open-world semantic comprehension, while introducing a trainable, embodied perception-specific model as the “right brain” to learn high-precision continuous actions. The two components are integrated via an Asymmetric Mixture-of-Transformers (AsyMoT) mechanism and further enhanced by a Flow-Matching action expert module, enabling effective fusion of high-level semantics and low-level control. Experiments demonstrate that the approach outperforms existing methods on SimplerEnv and RoboCasa benchmarks, significantly alleviates catastrophic forgetting, and maintains the pretrained VLM’s general visual understanding capabilities.

Technology Category

Application Category

📝 Abstract

The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to"catastrophic forgetting"that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called"Left Brain") and a trainable specialist (also called"Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation conditions a flow-matching action expert for precise continuous control. Empirical results on SimplerEnv and RoboCasa benchmarks demonstrate that by explicitly retaining general capabilities, TwinBrainVLA achieves substantial performance gains over baseline models in complex manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

catastrophic forgetting

embodied tasks

semantic understanding

sensorimotor skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

Asymmetric Mixture-of-Transformers

catastrophic forgetting

embodied perception

generalist VLM

🔎 Similar Papers

No similar papers found.

Authors to Follow