Compressible Softmax-Attended Language under Incompressible Attention

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study reveals a fundamental discrepancy between the theoretical dimensionality of language representations in Transformer attention mechanisms and their practical compressibility. Through singular value decomposition and spectral analysis, the work demonstrates for the first time that linguistic compressibility stems from the data itself rather than the analytical framework, and quantifies a substantial gap between effective rank and model dimensionality. Across five models ranging from 124M to 7B parameters, 90% of logit energy variance is captured by only 2–11 singular components, whereas learned interaction matrices require 38–75 components—indicating an effective rank disparity of 5× to 25×. This finding suggests that attention mechanisms fail to efficiently utilize the representational space they operate within.
📝 Abstract
Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.
Problem

Research questions and friction points this paper is trying to address.

compressibility
attention mechanism
language models
spectral gap
effective rank
Innovation

Methods, ideas, or system contributions that make the work stand out.

compressible attention
logit energy field
spectral gap
effective rank
softmax-attended language
🔎 Similar Papers
No similar papers found.