🤖 AI Summary
This work addresses the limitations of existing Vision Mamba models, which rely on unidirectional scanning and thus struggle to capture non-causal dependencies among image patches while exhibiting poor computational efficiency on short sequences. To overcome these issues, we propose an efficient Vision Mamba architecture that enables bidirectional information exchange under unidirectional scanning through an auxiliary patch-swapping mechanism. Furthermore, we integrate batch folding and periodic state reset strategies to enhance GPU parallelism. Our approach maintains linear computational complexity while significantly improving modeling capacity. Extensive experiments demonstrate that the proposed method consistently outperforms state-of-the-art baselines across multiple vision tasks—including image classification, object detection, and instance and semantic segmentation—and achieves higher throughput across various model scales.
📝 Abstract
The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.