Normalization in Attention Dynamics

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates how normalization schemes govern the evolution of token representations in deep Transformers, focusing on representation clustering dynamics and representation collapse. We propose a differential-geometric modeling framework grounded in spherical particle dynamics, formalizing inter-layer representation propagation as an interacting particle system on the unit sphere; this reveals normalization’s role as a “velocity regulator” in attention dynamics. We conduct a unified theoretical analysis of six normalization variants—Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling—characterizing their distinct impacts on representation structure. Our analysis identifies Peri-LN as optimal in balancing convergence speed and representation diversity, effectively mitigating deep-layer collapse. Empirical evaluation confirms Peri-LN’s superior generalization across language modeling and diverse downstream tasks. The study provides principled, geometry-informed guidance for normalization design in Transformer architectures.

Technology Category

Application Category

📝 Abstract

We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

Problem

Research questions and friction points this paper is trying to address.

Analyzing normalization effects on token representations

Modeling attention dynamics as interacting spherical particles

Comparing normalization schemes to prevent representation collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling token evolution as interacting particles

Analyzing normalization as speed regulation mechanism

Identifying Peri-LN as optimal normalization scheme

🔎 Similar Papers

No similar papers found.

Authors to Follow