Tensor Product Attention Is All You Need

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address the excessive KV cache memory overhead in long-context reasoning—which severely limits the scalability of large language models—this paper proposes Tensor Product Attention (TPA) and its associated T6 model architecture. The core innovation is a novel context-aware tensor factorization mechanism that seamlessly integrates low-rank contextual decomposition with RoPE positional encoding, enabling effective long-sequence modeling while substantially compressing KV cache size. Experiments demonstrate that T6 consistently outperforms standard multi-head attention (MHA), multi-query attention (MQA), grouped-query attention (GQA), and mixture-of-attention (MLA) across perplexity and multiple authoritative long-context benchmarks (e.g., LRA, LongBench). Under identical hardware constraints, T6 supports significantly longer sequences and reduces KV cache memory consumption by 40%–65%, thereby markedly improving inference efficiency and model scalability.

Technology Category

Application Category

📝 Abstract

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Memory Demand

Long Text Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Product Attention (TPA)

Positional Encoding (RoPE)

Tensor Product Attention Transformer (T6)

🔎 Similar Papers

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures