TransMamba: Flexibly Switching between Transformer and Mamba

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Transformers suffer from quadratic computational complexity, hindering efficient long-sequence processing, while linear-complexity Mamba exhibits unstable in-context learning and multi-task generalization. Method: We propose TransMamba—a unified parameterized architecture enabling dynamic mechanism switching between Transformer and Mamba layers and across sequence lengths via shared QKV/CBx matrices; a Memory Converter module for seamless mapping from attention outputs to SSM states; and the TransPoint scheduling strategy, uncovering deep paradigmatic consistency between the two architectures. Contribution/Results: Extensive experiments demonstrate that TransMamba significantly outperforms both pure Transformer and pure Mamba baselines on multiple long-sequence benchmarks—including language modeling, time-series forecasting, and genomic sequence analysis—while achieving higher training efficiency, enhanced context stability, and superior multi-task generalization capability.

Technology Category

Application Category

📝 Abstract

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.

Problem

Research questions and friction points this paper is trying to address.

Overcome quadratic complexity of Transformers in long sequences

Address unstable learning in Mamba for multitask generalization

Unify Transformer and Mamba via dynamic mechanism switching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies Transformer and Mamba via shared parameters

Dynamically switches between attention and SSM mechanisms

Memory converter bridges Transformer and Mamba seamlessly

🔎 Similar Papers

No similar papers found.

Authors to Follow