🤖 AI Summary
Hybrid Mamba-Transformer vision backbones lack efficient, unified pretraining paradigms. Method: This paper proposes Masked Autoregressive Pretraining (MAP), the first end-to-end co-optimization framework tailored for such hybrid architectures. MAP synergistically integrates MAE’s local image reconstruction with Mamba’s sequential modeling capabilities, employing block-wise image serialization, cross-modal positional encoding, and 2D/3D multi-scale reconstruction objectives to jointly enhance representation learning in both Transformer and Mamba modules. Results: On diverse 2D and 3D vision benchmarks—including ImageNet-1K, ADE20K, COCO, ScanNet, and S3DIS—MAP consistently outperforms single-architecture baselines (e.g., MAE, SimMIM) and achieves state-of-the-art performance. The code and pretrained models are publicly released.
📝 Abstract
Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP