Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the design bottleneck of layer normalization (LN) placement in large-scale Transformer training. We systematically propose and validate peripheral LN (Peri-LN), a novel paradigm that applies LN outside—rather than inside—the sublayer computation. Unlike dominant Pre-LN and Post-LN configurations, Peri-LN is the first LN placement strategy theoretically proven to yield milder variance growth across layers, more balanced gradient propagation, and more stable activation distributions. Through rigorous theoretical analysis and large-scale empirical evaluation on a 3.2B-parameter model, we uncover Peri-LN’s intrinsic mechanisms for mitigating activation explosion and gradient vanishing, thereby enhancing training stability and accelerating convergence. Our work establishes Peri-LN as a principled third alternative in the LN placement taxonomy, filling a critical theoretical and practical gap in Transformer architecture design.

Technology Category

Application Category

📝 Abstract
Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures large-scale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth -- unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.'' To validate our theoretical insight, we conduct large-scale experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.
Problem

Research questions and friction points this paper is trying to address.

Optimizing layer normalization in Transformers
Ensuring training stability and convergence
Exploring Peri-LN's mechanisms and benefits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Peri-LN optimizes layer normalization placement
Balances variance growth for stable training
Enhances gradient flow and convergence stability
🔎 Similar Papers
No similar papers found.
J
Jeonghoon Kim
NAVER Cloud, Korea Advanced Institute of Science and Technology
B
Byeongchan Lee
Korea Advanced Institute of Science and Technology
Cheonbok Park
Cheonbok Park
Foundation research, NAVER Cloud
Machine learningMachine translationLarge Language modelNatural languageTime-series
Y
Yeontaek Oh
NAVER Cloud
B
Beomjun Kim
Korea Advanced Institute of Science and Technology
T
Taehwan Yoo
NAVER Cloud
S
Seongjin Shin
NAVER Cloud
Dongyoon Han
Dongyoon Han
NAVER AI Lab
Machine LearningComputer VisionNatural Language Processing
Jinwoo Shin
Jinwoo Shin
ICT Endowed Chair Professor
Machine LearningDeep Learning
Kang Min Yoo
Kang Min Yoo
NAVER Cloud Hyperscale AI
Language ModelsNatural Language ProcessingLanguage GenerationData Augmentation