TransMamba: Flexibly Switching between Transformer and Mamba

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformers suffer from quadratic computational complexity, hindering efficient long-sequence processing, while linear-complexity Mamba exhibits unstable in-context learning and multi-task generalization. Method: We propose TransMamba—a unified parameterized architecture enabling dynamic mechanism switching between Transformer and Mamba layers and across sequence lengths via shared QKV/CBx matrices; a Memory Converter module for seamless mapping from attention outputs to SSM states; and the TransPoint scheduling strategy, uncovering deep paradigmatic consistency between the two architectures. Contribution/Results: Extensive experiments demonstrate that TransMamba significantly outperforms both pure Transformer and pure Mamba baselines on multiple long-sequence benchmarks—including language modeling, time-series forecasting, and genomic sequence analysis—while achieving higher training efficiency, enhanced context stability, and superior multi-task generalization capability.

Technology Category

Application Category

📝 Abstract
Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.
Problem

Research questions and friction points this paper is trying to address.

Overcome quadratic complexity of Transformers in long sequences
Address unstable learning in Mamba for multitask generalization
Unify Transformer and Mamba via dynamic mechanism switching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies Transformer and Mamba via shared parameters
Dynamically switches between attention and SSM mechanisms
Memory converter bridges Transformer and Mamba seamlessly
🔎 Similar Papers
No similar papers found.
Y
Yixing Li
The Chinese University of Hong Kong
Ruobing Xie
Ruobing Xie
Tencent
Large Language ModelRecommender SystemNatural Language Processing
Z
Zhen Yang
Tencent Hunyuan
Xingwu Sun
Xingwu Sun
Tencent
Natural Language ProcessingQuestion AnsweringQuestion Generation
Shuaipeng Li
Shuaipeng Li
Tencent
Weidong Han
Weidong Han
Tencent Inc., School of Data Science, Fudan University
Large Language ModelNLPMulti-Modal
Z
Zhanhui Kang
Tencent Hunyuan
Y
Yu Cheng
The Chinese University of Hong Kong
C
Chengzhong Xu
University of Macau
D
Di Wang
Tencent Hunyuan
J
Jie Jiang
Tencent Hunyuan