Flash STU: Fast Spectral Transform Units

📅 2024-09-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the challenge of balancing expressive power and computational efficiency in state space models (SSMs) for long-sequence modeling, this paper proposes a spectral-domain–window hybrid architecture. We introduce the Spectral-state Transition Unit (STU), the first SSM variant operating in the spectral domain, integrated with sliding-window attention and segmented state caching to enable billion-parameter scalability at near-linear complexity. Efficient state updates are realized via Fourier- and Toeplitz-based spectral transforms, and we design a hardware-aware Flash-STU kernel optimized for modern accelerators. Our approach breaks the traditional efficiency–capability trade-off inherent in both SSMs and Transformers. Extensive experiments demonstrate state-of-the-art performance across linear system identification, robotic control, and language modeling—outperforming S4, Mamba-2, and Transformer baselines.

Technology Category

Application Category

📝 Abstract

Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.

Problem

Research questions and friction points this paper is trying to address.

Balancing computational efficiency with model expressiveness

Scalability to billions of parameters for language modeling

Outperforming Transformer and state-space models like S4 and Mamba-2

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid spectral state space model layers

Interleaves sliding window attention

Near-linear time complexity scalability

🔎 Similar Papers

No similar papers found.