🤖 AI Summary
Existing gradient flow theory predicts that deep linear networks with multiple pathways converge to single-pathway dominant solutions due to symmetry breaking, contradicting empirical observations of discrete optimization dynamics. This work resolves this discrepancy by analyzing discrete gradient descent dynamics, quantifying loss landscape sharpness, and leveraging the stability margin framework. We demonstrate that large learning rates—interacting with both network depth and pathway count—induce early symmetry breaking and subsequent signal rebalancing across pathways. Crucially, we establish for the first time that single-pathway solutions correspond to high-sharpness minima, whereas multi-pathway shared representations yield significantly flatter minima. Large-step gradient descent inherently favors such flat minima, thereby restoring pathway symmetry and challenging the classical conclusions derived from continuous gradient flow models.
📝 Abstract
Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.