Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in Vision-Language-Action (VLA) model scaling—inefficient reuse of pre-trained weights and low real-time control efficiency—this paper proposes a sparse expert-based expansion method termed Action Expert Sparsification. The core innovation lies in decoupling expert selection from expert weighting, introducing learnable scaling adapters to enable collaborative multi-expert decision-making and thereby overcoming the limitations of conventional “winner-take-all” routing. Furthermore, the feed-forward layers are replaced with task-aware, sparsely activated expert layers governed by dynamic routing. Evaluated on the LIBERO and RoboTwin benchmarks, the method achieves absolute improvements of 1.8% and 9.3%, respectively, and yields a 21.5% performance gain on real-robot manipulation tasks. The approach effectively balances representational capacity, computational efficiency, and cross-task transferability.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Scaling Vision-Language-Action models efficiently with limited robot data
Balancing model capacity with computational efficiency for real-time control
Leveraging pretrained VLA weights while avoiding winner-takes-all expert selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparsely activated MoE layers for scaling
Decouples expert selection from expert weighting
Leverages pretrained VLA model weights efficiently
🔎 Similar Papers
No similar papers found.
W
Weijie Shen
MoE key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Y
Yitian Liu
School of Computer Science, Shanghai Jiao Tong University
Y
Yuhao Wu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Zhixuan Liang
Zhixuan Liang
University of Hong Kong
Embodied AIMachine LearningRoboticsComputer Vision
S
Sijia Gu
Tongji University
D
Dehui Wang
MoE key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
T
Tian Nian
MoE key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
L
Lei Xu
School of Computer Science, Shanghai Jiao Tong University
Y
Yusen Qin
D-Robotics
J
Jiangmiao Pang
Shanghai AI Laboratory
Xinping Guan
Xinping Guan
Shanghai Jiao Tong University
Wireless Networks and ApplicationsInternet of ThingsControl and Systems
X
Xiaokang Yang
School of Computer Science, Shanghai Jiao Tong University
Y
Yao Mu
School of Computer Science, Shanghai Jiao Tong University