🤖 AI Summary
Existing hybrid language models struggle to simultaneously capture global information retrieval and sequence-aware importance: Transformers lack the ability to prioritize critical content, while state space models (SSMs) suffer from limited historical context retention. This work proposes SISA (SSM-Informed Softmax Attention), introducing a novel “score-level fusion” paradigm that directly integrates SSM-generated importance signals into attention score computation. By enhancing query and key vectors within standard scaled dot-product attention (SDPA), SISA achieves deep integration of SSMs and attention without requiring recurrent states or custom kernels. Experiments show that with only 152M parameters, SISA attains 17.3% accuracy on LAMBADA-greedy, significantly outperforming both Transformer (13.9%) and Mamba-3 (15.5%). On the NIAH task, SISA achieves 100% recall in just 1K training steps—seven times faster than Transformer.
📝 Abstract
Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.