🤖 AI Summary
This study investigates the causes of delayed loss spikes during training of batch-normalized neural networks, with a particular focus on late-stage instabilities unexplained by existing theory. By analyzing a whitened squared-loss linear regression model, the work precisely characterizes the absence of an initial rise and identifies conditions for delayed onset, revealing that batch normalization progressively amplifies the effective learning rate, thereby postponing divergence along an otherwise stable optimization trajectory. The paper further demonstrates that such spikes can self-stabilize within a finite number of steps. Combining directional dynamic tracking with finite-horizon stability analysis, the authors establish a complete mechanism for delayed loss spikes under whitening assumptions and, under strong conditions, provide theoretical evidence of directional precursors in logistic regression.
📝 Abstract
Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.