🤖 AI Summary
This work investigates the information embedding mechanism in self-attention matrices and how training objectives shape their structural evolution. We propose the first mathematical framework for self-attention weight updates, integrating matrix calculus and dynamical systems theory. Our analysis reveals that bidirectional pretraining induces weight symmetry, whereas autoregressive training yields column-wise directional dominance—leading to a novel symmetric initialization strategy. We systematically evaluate this approach across diverse architectures (ModernBERT, GPT, LLaMA3, Mistral) and multimodal tasks (text, vision, audio). Symmetric initialization consistently improves language understanding in encoder-only models (average +1.8% GLUE score) and enhances attention pattern interpretability. Our core contribution is a unified characterization of Transformer dynamics across modalities, establishing an interpretable mapping from training objective → weight symmetry → model performance.
📝 Abstract
Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models - including ModernBERT, GPT, LLaMA3, and Mistral - and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.