Compressible Softmax-Attended Language under Incompressible Attention

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study reveals a fundamental discrepancy between the theoretical dimensionality of language representations in Transformer attention mechanisms and their practical compressibility. Through singular value decomposition and spectral analysis, the work demonstrates for the first time that linguistic compressibility stems from the data itself rather than the analytical framework, and quantifies a substantial gap between effective rank and model dimensionality. Across five models ranging from 124M to 7B parameters, 90% of logit energy variance is captured by only 2–11 singular components, whereas learned interaction matrices require 38–75 components—indicating an effective rank disparity of 5× to 25×. This finding suggests that attention mechanisms fail to efficiently utilize the representational space they operate within.

Technology Category

Application Category

📝 Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Problem

Research questions and friction points this paper is trying to address.

compressibility

attention mechanism

language models

spectral gap

effective rank

Innovation

Methods, ideas, or system contributions that make the work stand out.

compressible attention

logit energy field

spectral gap