🤖 AI Summary
This work addresses the limitations of existing salient object detection methods, which are often constrained by the limited receptive fields of convolutional operators or the high computational cost of Transformers, and are typically tailored to specific tasks, hindering unified handling of multimodal and multispectral scenarios. To overcome these challenges, we propose Samba+, the first pure Mamba-based architecture for saliency detection, integrating spatial neighborhood scanning (SNS), context-aware upsampling (CAU), hub-and-spoke graph attention (HGA), and modality-anchored continual learning (MACL) to form a unified and efficient framework. Samba+ supports six task types—including RGB, RGB-D, RGB-T, and video—and achieves state-of-the-art performance across 22 benchmarks with a single model, outperforming specialized approaches while significantly reducing computational overhead.
📝 Abstract
Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the"task-specific"problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.