🤖 AI Summary
This work investigates how SGD hyperparameters—learning rate, batch size, and initial weight variance—affect the training dynamics of multilayer neural networks. We introduce a phase-diagram framework based on the evolution of singular values of weight matrices, wherein the ratio of initial weight variance to the learning rate–batch size quotient is interpreted as an effective disorder strength relative to an effective temperature. Leveraging a Langevin equation derived from Dyson Brownian motion, mean-field theory, and random matrix theory, we characterize the stochastic dynamics of soft spin degrees of freedom in feature space. Our analysis reveals three distinct dynamical phases—convergent, oscillatory, and divergent—each corresponding to qualitatively different training behaviors. Based on this classification, we establish theoretical criteria for hyperparameter selection. This study provides the first unified statistical-physical perspective and quantitative foundation for understanding SGD optimization mechanisms and guiding practical hyperparameter tuning.
📝 Abstract
Hyperparameter tuning is one of the essential steps to guarantee the convergence of machine learning models. We argue that intuition about the optimal choice of hyperparameters for stochastic gradient descent can be obtained by studying a neural network's phase diagram, in which each phase is characterised by distinctive dynamics of the singular values of weight matrices. Taking inspiration from disordered systems, we start from the observation that the loss landscape of a multilayer neural network with mean squared error can be interpreted as a disordered system in feature space, where the learnt features are mapped to soft spin degrees of freedom, the initial variance of the weight matrices is interpreted as the strength of the disorder, and temperature is given by the ratio of the learning rate and the batch size. As the model is trained, three phases can be identified, in which the dynamics of weight matrices is qualitatively different. Employing a Langevin equation for stochastic gradient descent, previously derived using Dyson Brownian motion, we demonstrate that the three dynamical regimes can be classified effectively, providing practical guidance for the choice of hyperparameters of the optimiser.