🤖 AI Summary
This study reveals a fundamental discrepancy between the theoretical dimensionality of language representations in Transformer attention mechanisms and their practical compressibility. Through singular value decomposition and spectral analysis, the work demonstrates for the first time that linguistic compressibility stems from the data itself rather than the analytical framework, and quantifies a substantial gap between effective rank and model dimensionality. Across five models ranging from 124M to 7B parameters, 90% of logit energy variance is captured by only 2–11 singular components, whereas learned interaction matrices require 38–75 components—indicating an effective rank disparity of 5× to 25×. This finding suggests that attention mechanisms fail to efficiently utilize the representational space they operate within.
📝 Abstract
Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.