Singular Vectors of Attention Heads Align with Features

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether singular vectors in attention mechanisms can reflect the internal feature representations of language models, addressing a gap in theoretical grounding within interpretability research. By integrating singular value decomposition, theoretical modeling, and sparse attention decomposition, the work provides the first theoretical justification for the alignment between singular vectors and model features, and derives operational criteria testable in real-world models. Empirical analysis in controlled-feature models confirms this alignment, while theoretical analysis characterizes the general conditions under which it holds. Furthermore, the study identifies sparse attention structures in actual language models that conform to these theoretical predictions, offering a novel perspective and a falsifiable explanatory framework for understanding attention mechanisms.

Technology Category

Application Category

📝 Abstract
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.
Problem

Research questions and friction points this paper is trying to address.

singular vectors
attention heads
feature representations
mechanistic interpretability
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

singular vectors
attention heads
feature alignment
sparse attention decomposition
mechanistic interpretability
🔎 Similar Papers
No similar papers found.
G
Gabriel Franco
Department of Computer Science, Boston University, Boston, USA
C
Carson Loughridge
Department of Computer Science, Boston University, Boston, USA
Mark Crovella
Mark Crovella
Professor of Computer Science and Computing & Data Sciences, Boston University
NetworkingSystemsData MiningStatisticsPerformance Evaluation