MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Hybrid Mamba-Transformer vision backbones lack efficient, unified pretraining paradigms. Method: This paper proposes Masked Autoregressive Pretraining (MAP), the first end-to-end co-optimization framework tailored for such hybrid architectures. MAP synergistically integrates MAE’s local image reconstruction with Mamba’s sequential modeling capabilities, employing block-wise image serialization, cross-modal positional encoding, and 2D/3D multi-scale reconstruction objectives to jointly enhance representation learning in both Transformer and Mamba modules. Results: On diverse 2D and 3D vision benchmarks—including ImageNet-1K, ADE20K, COCO, ScanNet, and S3DIS—MAP consistently outperforms single-architecture baselines (e.g., MAE, SimMIM) and achieves state-of-the-art performance. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP

Problem

Research questions and friction points this paper is trying to address.

Effective pretraining for hybrid Mamba-Transformer networks

Combining MAE and autoregressive pretraining strengths

Achieving state-of-the-art performance on vision tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines MAE and Autoregressive pretraining strengths

Pretrains hybrid Mamba-Transformer vision backbone

Achieves state-of-the-art performance on datasets

🔎 Similar Papers

MambaVision: A Hybrid Mamba-Transformer Vision Backbone