Just One Layer Norm Guarantees Stable Extrapolation

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural networks often exhibit unstable extrapolation outside the training distribution due to unbounded outputs, yet existing theories lack a general explanation. This paper, grounded in Neural Tangent Kernel (NTK) theory, establishes for the first time that introducing LayerNorm (LN) in *any single layer* of an infinitely wide network suffices to bound the NTK’s variance—thereby guaranteeing globally bounded outputs during extrapolation. In contrast, networks without LN face systematic divergence risks. This reveals LN’s fundamental role in suppressing pathological amplification of distant inputs. Empirical validation on finite-width networks demonstrates that a single LN layer markedly improves extrapolation robustness in protein residue distance prediction and cross-ethnic facial age estimation, outperforming baselines in generalization. Our work provides both a theoretical foundation and a practical mechanism for enhancing out-of-distribution stability in deep networks.

Technology Category

Application Category

📝 Abstract
In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results -- the first of their kind -- by applying Neural Tangent Kernel (NTK) theory to analyse infinitely-wide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.
Problem

Research questions and friction points this paper is trying to address.

Understanding Neural Networks' extrapolation behavior far from training data
Analyzing impact of Layer Norm on Neural Tangent Kernel stability
Mitigating uncontrolled output growth in networks without Layer Norm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single Layer Norm ensures stable extrapolation
Neural Tangent Kernel theory analyzes wide networks
Bounded-variance kernel prevents uncontrolled output growth
🔎 Similar Papers
No similar papers found.