On Subquadratic Architectures: From Applications to Principles

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high computational cost of Transformer attention mechanisms—stemming from their quadratic complexity—by systematically evaluating and comparing three sub-quadratic sequence modeling architectures: xLSTM, Mamba-2, and Gated DeltaNet. The evaluation spans diverse tasks including code pretraining, large-model distillation, and time-series foundation modeling. Through a unified formal framework, the work analyzes their state-tracking and memory mechanisms, revealing for the first time that xLSTM’s gated architecture enables more flexible and robust state updates and memory accumulation. This intrinsic advantage allows xLSTM to achieve consistently superior performance across tasks involving complex long-range dependencies, a finding further corroborated by its strong results on synthetic length generalization benchmarks.
📝 Abstract
Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.
Problem

Research questions and friction points this paper is trying to address.

subquadratic architectures
sequence modeling
computational cost
attention mechanism
model effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

subquadratic architectures
xLSTM
state tracking
memory dynamics
gating mechanism