🤖 AI Summary
This work investigates how normalization schemes govern the evolution of token representations in deep Transformers, focusing on representation clustering dynamics and representation collapse. We propose a differential-geometric modeling framework grounded in spherical particle dynamics, formalizing inter-layer representation propagation as an interacting particle system on the unit sphere; this reveals normalization’s role as a “velocity regulator” in attention dynamics. We conduct a unified theoretical analysis of six normalization variants—Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling—characterizing their distinct impacts on representation structure. Our analysis identifies Peri-LN as optimal in balancing convergence speed and representation diversity, effectively mitigating deep-layer collapse. Empirical evaluation confirms Peri-LN’s superior generalization across language modeling and diverse downstream tasks. The study provides principled, geometry-informed guidance for normalization design in Transformer architectures.
📝 Abstract
We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.