🤖 AI Summary
To address the challenge of balancing expressive power and computational efficiency in state space models (SSMs) for long-sequence modeling, this paper proposes a spectral-domain–window hybrid architecture. We introduce the Spectral-state Transition Unit (STU), the first SSM variant operating in the spectral domain, integrated with sliding-window attention and segmented state caching to enable billion-parameter scalability at near-linear complexity. Efficient state updates are realized via Fourier- and Toeplitz-based spectral transforms, and we design a hardware-aware Flash-STU kernel optimized for modern accelerators. Our approach breaks the traditional efficiency–capability trade-off inherent in both SSMs and Transformers. Extensive experiments demonstrate state-of-the-art performance across linear system identification, robotic control, and language modeling—outperforming S4, Mamba-2, and Transformer baselines.
📝 Abstract
Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.