🤖 AI Summary
Existing theoretical analyses lack a rigorous characterization of the global optima properties of modern architectures—such as Transformers and ResNets—under data-aware settings.
Method: We establish a rigorous end-to-end equivalence between standard training dynamics and an unconstrained feature model, integrating LayerNorm, residual connections, high-dimensional geometric modeling, and regularization. This enables theoretical analysis of global optima under cross-entropy or MSE loss.
Contribution/Results: We prove, for the first time, that deep regularized Transformers and ResNets exhibit approximate neural collapse at their global optima in the data-aware regime, with collapse severity monotonically increasing with depth and collapsing error converging asymptotically. Empirical validation across CV and NLP benchmarks confirms that deeper networks achieve significantly higher collapse fidelity. This work establishes the first formal link between global optimality and neural collapse in modern deep architectures, providing a novel theoretical foundation for understanding implicit bias in deep learning.
📝 Abstract
The empirical emergence of neural collapse -- a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks -- has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.