🤖 AI Summary
Long convolutional sequence models (e.g., Hyena) suffer from O(L²) time complexity during inference, severely limiting scalability for long sequences.
Method: This paper proposes the first exact-inference framework achieving quasi-linear acceleration—reducing overall complexity to O(L log²L). The core innovation lies in uncovering inherent parallelism and computational reuse in the positional mixing module, enabling a synergistic design of block-wise computation and relaxed polynomial interpolation, further enhanced by memory locality optimization and inter-layer parallelization.
Contribution/Results: The framework is architecture-agnostic, requires no approximation or retraining, and delivers end-to-end speedups of up to 7.8× on Hyena, with the positional mixing module alone accelerated by up to 110×. This breakthrough significantly alleviates the inference efficiency bottleneck in long-sequence modeling while preserving numerical exactness.
📝 Abstract
While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs'exact inference to quasilinear $O(Llog^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8 imes$ end-to-end improvement over standard inference by improving $110 imes$ within the position-mixing part.