🤖 AI Summary
To address the high quadratic computational complexity, strong data dependency, and poor scalability of unified multimodal understanding and generation models, this paper proposes the first multimodal generative framework built upon the linear state space model Mamba-2. Methodologically: (i) it introduces a novel decoupled vocabulary to enable modality-specific generation; (ii) it designs task-specific LoRA adapters for parameter-efficient fine-tuning; and (iii) it adopts a two-stage curriculum learning strategy to mitigate image-text data imbalance. Experiments demonstrate that, trained on only 2 million image-text pairs—1,000× fewer than Show-o—our framework surpasses Show-o and matches JanusFlow in performance. Moreover, it achieves 119.2× faster long-sequence generation and reduces GPU memory consumption by 63%, significantly overcoming the efficiency bottlenecks inherent to Transformer-based architectures.
📝 Abstract
Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba