Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

📅 2024-04-05
🏛️ IEEE Workshop/Winter Conference on Applications of Computer Vision
📈 Citations: 56
Influential: 5
📄 PDF
🤖 AI Summary
To address the insufficient robustness of multimodal semantic segmentation under challenging conditions such as low illumination and overexposure, this work pioneers the integration of State Space Models (SSMs) into multimodal perception tasks. We propose the Siamese Mamba architecture and a novel Mamba-based cross-modal fusion mechanism, achieving linear computational complexity while preserving global receptive fields. The method comprises a Siamese encoder, Mamba-based sequential modeling, a channel-enhanced decoder, and a cross-modal feature selection module. Evaluated on RGB-Thermal and RGB-Depth semantic segmentation benchmarks, our approach significantly outperforms CNN- and ViT-based baselines. To the best of our knowledge, this is the first work to successfully apply SSMs to multimodal segmentation. Our framework establishes a new paradigm for robust perception by AI agents in complex real-world environments.

Technology Category

Application Category

📝 Abstract
Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal semantic segmentation under adverse conditions
Achieves global receptive fields with linear computational complexity
Introduces Mamba-based fusion for effective cross-modal information integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Siamese Mamba network for multi-modal segmentation
Linear complexity global receptive fields
Mamba-based fusion mechanism across modalities
🔎 Similar Papers
No similar papers found.
Zifu Wan
Zifu Wan
General Robotics
Computer VisionRobotics
Y
Yuhao Wang
School of Future Technology, Dalian University of Technology, China
Silong Yong
Silong Yong
Ph.D. student at Carnegie Mellon University
3D VisionEmbodied AIGenerative Model
P
Pingping Zhang
School of Future Technology, Dalian University of Technology, China
Simon Stepputtis
Simon Stepputtis
Virginia Tech
Artificial IntelligenceNatural Language ProcessingRoboticsHuman-Robot Interaction
K
Katia P. Sycara
Robotics Institute, Carnegie Mellon University, USA
Y
Yaqi Xie
Robotics Institute, Carnegie Mellon University, USA