Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of modality-specific expressiveness in State Space Models (SSMs) for multimodal sequence modeling, this paper proposes a Modality-Aware Sparse Mixture-of-Mamba architecture. It is the first to introduce modality-aware sparsity into SSMs, enabling cross-modal parameter decoupling and joint sparsification. We design modality-specific projection modules to jointly enhance feature representation capability and computational efficiency. At the 1.4B parameter scale, our method substantially reduces computational overhead: Transfusion achieves comparable image reconstruction performance using only 34.76% of the FLOPs; Chameleon attains equivalent image and text reconstruction quality with merely 42.50% and 65.40% of the FLOPs, respectively; and for trimodal speech modeling, it matches baseline performance while consuming only 24.80% of the FLOPs.

Technology Category

Application Category

📝 Abstract
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
Problem

Research questions and friction points this paper is trying to address.

State Space Models
Multi-modal Information
Performance Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba Hybrid Model
Multimodal Information Processing
Efficient Computation
🔎 Similar Papers
No similar papers found.