🤖 AI Summary
This work addresses the challenge in autoregressive generation of simultaneously achieving fast decoding, low memory overhead, and effective long-range dependency modeling. To this end, the authors propose a unified structured generalized linear recurrence framework that decouples the direct single-step input–output influence from the multi-step state propagation mechanism. By incorporating recurrence designs that depend on multiple historical states, the framework substantially enhances model expressivity while maintaining controllable computational complexity. This formulation generalizes both state space models and attention mechanisms, establishing a unified token mixing paradigm. Empirical validation on synthetic tasks and language modeling benchmarks demonstrates the framework’s efficiency and strong representational capacity, offering a novel toolkit for designing high-performance token mixers.
📝 Abstract
Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.