Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

📅 2024-10-16

🏛️ International Conference on Learning Representations

📈 Citations: 2

✨ Influential: 0

career value

264K/year

🤖 AI Summary

Long convolutional sequence models (e.g., Hyena) suffer from O(L²) time complexity during inference, severely limiting scalability for long sequences. Method: This paper proposes the first exact-inference framework achieving quasi-linear acceleration—reducing overall complexity to O(L log²L). The core innovation lies in uncovering inherent parallelism and computational reuse in the positional mixing module, enabling a synergistic design of block-wise computation and relaxed polynomial interpolation, further enhanced by memory locality optimization and inter-layer parallelization. Contribution/Results: The framework is architecture-agnostic, requires no approximation or retraining, and delivers end-to-end speedups of up to 7.8× on Hyena, with the positional mixing module alone accelerated by up to 110×. This breakthrough significantly alleviates the inference efficiency bottleneck in long-sequence modeling while preserving numerical exactness.

Technology Category

Application Category

📝 Abstract

While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs'exact inference to quasilinear $O(Llog^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8 imes$ end-to-end improvement over standard inference by improving $110 imes$ within the position-mixing part.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic inference cost in long convolution sequence models

Enables quasilinear time complexity for exact inference

Improves memory movement and computation sharing through tiling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speeds up LCSM inference to quasilinear time

Uses tiling to reduce memory and computation

Enables parallelization across position-mixing layers

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models