🤖 AI Summary
Neural network loss landscapes exhibit spurious flat minima that mislead optimizers, yet their geometric origins remain poorly understood. Method: We identify a novel class of “infinite-flat channels” wherein the output weights of two neurons diverge to ±∞ while their input weight vectors become asymptotically parallel; along such channels, the loss decreases slowly, creating a false impression of flatness. Using gradient flow analysis, nonlinear dynamical systems modeling, and differential-geometric characterization, we establish—rigorously for the first time—the geometric correspondence between these channels and symmetry-induced critical lines. Contribution/Results: We prove that optimization trajectories within these channels converge to functionally equivalent solutions and implicitly implement gated linear unit (GLU)-like computations. Empirically, this phenomenon occurs frequently in multivariate regression tasks, offering a unified explanation for optimizer entrapment in pseudo-flat regions and revealing a new principle: fully connected layers implicitly learn structured activation mechanisms through geometric symmetries in parameter space.
📝 Abstract
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $pm$infinity, and their input weight vectors, $mathbf{w_i}$ and $mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_isigma(mathbf{w_i} cdot mathbf{x}) + a_jsigma(mathbf{w_j} cdot mathbf{x})
ightarrow sigma(mathbf{w} cdot mathbf{x}) + (mathbf{v} cdot mathbf{x}) sigma'(mathbf{w} cdot mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.