🤖 AI Summary
This work investigates the role of stochastic gradient descent (SGD) noise in the training dynamics of deep linear networks as they traverse between saddle points. By modeling SGD as a stochastic Langevin dynamics with state-dependent, anisotropic noise and invoking weight alignment and balancing assumptions, the high-dimensional dynamics are decomposed into a set of mode-wise one-dimensional stochastic differential equations. The study provides the first precise characterization of mode-wise evolution under SGD, revealing how noise encodes progress in feature learning without altering the overall dynamical structure. Theoretical analysis shows that, in the absence of noise, the stationary distribution across modes aligns with gradient flow dynamics, whereas with noise it approximates a Boltzmann distribution. Experimental results confirm the qualitative validity of these findings even under non-ideal conditions.
📝 Abstract
Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.