🤖 AI Summary
To address the insufficient robustness of multimodal semantic segmentation under challenging conditions such as low illumination and overexposure, this work pioneers the integration of State Space Models (SSMs) into multimodal perception tasks. We propose the Siamese Mamba architecture and a novel Mamba-based cross-modal fusion mechanism, achieving linear computational complexity while preserving global receptive fields. The method comprises a Siamese encoder, Mamba-based sequential modeling, a channel-enhanced decoder, and a cross-modal feature selection module. Evaluated on RGB-Thermal and RGB-Depth semantic segmentation benchmarks, our approach significantly outperforms CNN- and ViT-based baselines. To the best of our knowledge, this is the first work to successfully apply SSMs to multimodal segmentation. Our framework establishes a new paradigm for robust perception by AI agents in complex real-world environments.
📝 Abstract
Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.