🤖 AI Summary
This work addresses a key limitation in existing linear attention mechanisms, which neglect the role of queries in memory state evolution and use them solely for readout, thereby constraining model expressivity. We propose Q-Delta, a novel approach that, for the first time, incorporates query information directly into the state evolution process. By leveraging query-aware prediction errors to drive joint key-value memory updates, Q-Delta enhances model expressiveness while preserving the computational efficiency of the Delta rule. We provide theoretical guarantees for stability and introduce a hardware-friendly block-wise parallel implementation. Empirical results demonstrate that Q-Delta achieves consistently superior performance over strong baselines in language modeling and long-context retrieval tasks, with high training stability and throughput.
📝 Abstract
Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key-query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.