Flash STU: Fast Spectral Transform Units

📅 2024-09-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing expressive power and computational efficiency in state space models (SSMs) for long-sequence modeling, this paper proposes a spectral-domain–window hybrid architecture. We introduce the Spectral-state Transition Unit (STU), the first SSM variant operating in the spectral domain, integrated with sliding-window attention and segmented state caching to enable billion-parameter scalability at near-linear complexity. Efficient state updates are realized via Fourier- and Toeplitz-based spectral transforms, and we design a hardware-aware Flash-STU kernel optimized for modern accelerators. Our approach breaks the traditional efficiency–capability trade-off inherent in both SSMs and Transformers. Extensive experiments demonstrate state-of-the-art performance across linear system identification, robotic control, and language modeling—outperforming S4, Mamba-2, and Transformer baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.
Problem

Research questions and friction points this paper is trying to address.

Balancing computational efficiency with model expressiveness
Scalability to billions of parameters for language modeling
Outperforming Transformer and state-space models like S4 and Mamba-2
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid spectral state space model layers
Interleaves sliding window attention
Near-linear time complexity scalability
🔎 Similar Papers
No similar papers found.