M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In robot vision-language learning, key bottlenecks include weak unsupervised semantic extraction, severe modality loss, and high computational overhead. To address these, we propose a lightweight multimodal fusion model. Methodologically, we innovatively integrate Mamba’s efficient sequence modeling capability with a semantic-adaptive attention mechanism, constructing a Mamba-enhanced Transformer architecture that jointly optimizes feature fusion, cross-modal alignment, and modality reconstruction. Compared to state-of-the-art methods, our model reduces parameter count by 67%, accelerates pretraining inference by 2.3×, and achieves 0.74 accuracy on the VQA task. These improvements significantly enhance deployment feasibility on resource-constrained mobile robotic platforms while boosting robustness in semantic understanding.

Technology Category

Application Category

📝 Abstract
In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
Problem

Research questions and friction points this paper is trying to address.

Improving semantic extraction in unsupervised robotic environments with modality loss
Reducing computational intensity and resource consumption in multimodal learning
Enabling efficient vision-language learning for resource-constrained robotic platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba module enhances multimodal feature fusion
Semantic-based adaptive attention optimizes alignment
Lightweight design reduces parameters by 67%
🔎 Similar Papers
No similar papers found.
Y
Yanxin Zhang
School of Software, Northwestern Polytechnical University, Xi’an, China
L
Liang He
School of Software, Northwestern Polytechnical University, Xi’an, China
Z
Zeyi Kang
School of Software, Northwestern Polytechnical University, Xi’an, China
Zuheng Ming
Zuheng Ming
Institut Gélilée, Université Sorbonne Paris Nord
multimodal learningcomputer visiondeep learning
K
Kaixing Zhao
School of Software, Yangtze River Delta Research Institute (Taicang), Northwestern Polytechnical University, Xi’an, China