Impact of Layer Norm on Memorization and Generalization in Transformers

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the differential impact mechanisms of LayerNorm on training stability, memorization behavior, and generalization in Pre- versus Post-LayerNorm Transformer architectures. Through gradient flow analysis, ablation of LayerNorm parameters, and dynamic modeling of label-fitting trajectories, we conduct cross-architectural empirical studies across 13 Transformer variants and 6 vision-language datasets. Our findings are: (1) In Pre-LayerNorm Transformers, LayerNorm primarily ensures training stability—especially in early layers; (2) In Post-LayerNorm Transformers, it functions predominantly as a regularizer that suppresses memorization-driven overfitting and enhances generalization; (3) LayerNorm’s effect is highly sensitive to its placement and overall architectural design, with no universally optimal configuration. This study is the first to uncover LayerNorm’s structural role in mediating the memorization–generalization trade-off, offering novel insights into Transformer architecture design and the mechanistic understanding of implicit regularization.

Technology Category

Application Category

📝 Abstract
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
Problem

Research questions and friction points this paper is trying to address.

Investigates LayerNorm's impact on memorization and generalization in transformers
Compares Pre-LayerNorm versus Post-LayerNorm transformer architectures' learning behaviors
Identifies LayerNorm's role in stable learning versus memorization across different models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LayerNorm stabilizes learning in Pre-LayerNorm transformers
Removing LayerNorm parameters mitigates memorization in Post-LayerNorm
Early layers LayerNorm are most critical across architectures
🔎 Similar Papers
No similar papers found.