ELASTIC: Efficient Linear Attention for Sequential Interest Compression

📅 2024-08-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational and memory overhead of Transformer self-attention in modeling long user behavior sequences, this paper proposes LinearDispatcher—a linear-complexity attention mechanism. Methodologically, it introduces (1) a learnable sparse dispatcher module that dynamically compresses raw behavioral sequences into fixed-length interest representations, and (2) a sparse retrieval-based learnable interest memory bank, effectively decoupling model expressiveness from computational cost. Evaluated on multiple public benchmarks, LinearDispatcher maintains competitive recommendation accuracy while reducing GPU memory consumption by 90% and accelerating inference by 2.7× compared to standard attention baselines. These gains significantly enhance the scalability and practicality of modeling ultra-long user behavior sequences in real-world recommender systems.

Technology Category

Application Category

📝 Abstract
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of attention mechanisms
Compresses long-term behavior sequences efficiently
Balances recommendation accuracy and computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention reduces complexity
Fixed length interest experts compression
Sparse retrieval from memory bank
🔎 Similar Papers
No similar papers found.
J
Jiaxin Deng
KuaiShou Inc., Beijing, China
S
Shiyao Wang
KuaiShou Inc., Beijing, China
S
Song Lu
KuaiShou Inc., Beijing, China
Yinfeng Li
Yinfeng Li
KuaiShou Inc., Beijing, China
Xinchen Luo
Xinchen Luo
kuaishou
Yuanjun Liu
Yuanjun Liu
KuaiShou Inc., Beijing, China
P
Peixing Xu
KuaiShou Inc., Beijing, China
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP