Tucker Attention: A generalization of approximate attention mechanisms

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses the high memory overhead of multi-head self-attention by proposing Tucker Attention, a low-rank approximation method based on Tucker tensor decomposition. The approach unifies Grouped-Query Attention (GQA), Multi-Head Attention (MHA), and Mixture-of-Experts-based Linear Attention (MLA) as special cases, revealing their underlying rank structure. Tucker Attention is compatible with FlashAttention and rotary positional embeddings (RoPE), offering both architectural generality and theoretical interpretability. Experiments on large language models (LLMs) and Vision Transformers (ViTs) demonstrate that Tucker Attention achieves performance comparable to GQA and MLA using only approximately one-tenth of the parameters, while significantly simplifying the intricate design of MLA.

Technology Category

Application Category

📝 Abstract

The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.

Problem

Research questions and friction points this paper is trying to address.

self-attention

low-rank approximation

multi-headed attention

Tucker decomposition

parameter efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tucker Attention

low-rank approximation

multi-head attention