The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training

📅 2025-02-15

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work investigates the information embedding mechanism in self-attention matrices and how training objectives shape their structural evolution. We propose the first mathematical framework for self-attention weight updates, integrating matrix calculus and dynamical systems theory. Our analysis reveals that bidirectional pretraining induces weight symmetry, whereas autoregressive training yields column-wise directional dominance—leading to a novel symmetric initialization strategy. We systematically evaluate this approach across diverse architectures (ModernBERT, GPT, LLaMA3, Mistral) and multimodal tasks (text, vision, audio). Symmetric initialization consistently improves language understanding in encoder-only models (average +1.8% GLUE score) and enhances attention pattern interpretability. Our core contribution is a unified characterization of Transformer dynamics across modalities, establishing an interpretable mapping from training objective → weight symmetry → model performance.

Technology Category

Application Category

📝 Abstract

Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models - including ModernBERT, GPT, LLaMA3, and Mistral - and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.

Problem

Research questions and friction points this paper is trying to address.

Analyzes self-attention matrices in Transformers

Explores impact of training objectives on structure

Improves model interpretability with symmetric initialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mathematical framework for self-attention analysis

Bidirectional training induces matrix symmetry

Symmetric initialization boosts encoder performance

🔎 Similar Papers

No similar papers found.