TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

📅 2026-01-20
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that fine-tuning general-purpose vision-language models (VLMs) for robotic control often leads to degradation of semantic understanding and interference with fine-grained motor learning. To mitigate this, the authors propose TwinBrainVLA, an architecture that freezes a pre-trained VLM as the “left brain” to preserve open-world semantic comprehension, while introducing a trainable, embodied perception-specific model as the “right brain” to learn high-precision continuous actions. The two components are integrated via an Asymmetric Mixture-of-Transformers (AsyMoT) mechanism and further enhanced by a Flow-Matching action expert module, enabling effective fusion of high-level semantics and low-level control. Experiments demonstrate that the approach outperforms existing methods on SimplerEnv and RoboCasa benchmarks, significantly alleviates catastrophic forgetting, and maintains the pretrained VLM’s general visual understanding capabilities.

Technology Category

Application Category

📝 Abstract
The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to"catastrophic forgetting"that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called"Left Brain") and a trainable specialist (also called"Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation conditions a flow-matching action expert for precise continuous control. Empirical results on SimplerEnv and RoboCasa benchmarks demonstrate that by explicitly retaining general capabilities, TwinBrainVLA achieves substantial performance gains over baseline models in complex manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
catastrophic forgetting
embodied tasks
semantic understanding
sensorimotor skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
Asymmetric Mixture-of-Transformers
catastrophic forgetting
embodied perception
generalist VLM
🔎 Similar Papers
No similar papers found.
B
Bin Yu
HIT, ZGCA
S
Shijie Lian
ZGCA, HUST
X
Xiaopeng Lin
ZGCA, HKUST(GZ)
Y
Yuliang Wei
HIT, ZGCA
Z
Zhaolong Shen
ZGCA, BUAA
C
Changti Wu
ZGCA, ECNU
Y
Yuzhuo Miao
HIT, ZGCA
X
Xinming Wang
ZGCA, CASIA
B
Bailing Wang
HIT
Cong Huang
Cong Huang
University of Science and Technology of China
Image/Video processing
Kai Chen
Kai Chen
Shanghai AI Laboratory
LLMVLMComputer Vision