🤖 AI Summary
To address the training inefficiency of DeltaNet caused by sequential, dimension-wise state updates, this work proposes a hardware-friendly parallel training algorithm based on a compact Householder matrix representation—enabling, for the first time, full sequence-length parallelization of Delta-rule linear Transformers. Methodologically, the approach integrates Delta-rule state updates, memory-efficient Householder parameterization, and hybrid attention (sliding window + global), balancing modeling capacity with computational efficiency. Experiments on a 1.3B-parameter model trained on 100B tokens demonstrate significant improvements in perplexity and zero-shot downstream task performance over linear sequence modeling baselines—including Mamba and GLA—and further surpass strong Transformer baselines. These results validate both the effectiveness and scalability of the proposed parallelization strategy and hybrid architecture.
📝 Abstract
Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.