🤖 AI Summary
This work investigates the generalization behavior of deep neural networks (DNNs) in the strongly overparameterized regime—where the number of training samples, input dimension, and hidden-layer widths all scale proportionally—where DNNs risk “trivialization” into linear models.
Method: To address this, the authors develop a high-dimensional asymptotic analysis framework under the Bayesian optimal setting, integrating information theory, random matrix theory, and the Gaussian equivalence principle.
Contribution/Results: They establish, for the first time, an information-theoretic equivalence between arbitrary-depth DNNs and generalized linear models (GLMs) in terms of optimal generalization performance within this limit. This yields an exact closed-form expression for the optimal generalization error, rigorously confirming the conjecture of Cui et al. (2023). Crucially, the analysis reveals that escaping linear degradation hinges on substantially increasing the signal-to-noise ratio—specifically, by scaling up training data volume. The work uncovers the fundamental mechanism behind the failure of depth in overparameterized regimes and provides a theoretical foundation for understanding the expressive boundaries of overparameterized DNNs.
📝 Abstract
We rigorously analyse fully-trained neural networks of arbitrary depth in the Bayesian optimal setting in the so-called proportional scaling regime where the number of training samples and width of the input and all inner layers diverge proportionally. We prove an information-theoretic equivalence between the Bayesian deep neural network model trained from data generated by a teacher with matching architecture, and a simpler model of optimal inference in a generalized linear model. This equivalence enables us to compute the optimal generalization error for deep neural networks in this regime. We thus prove the"deep Gaussian equivalence principle"conjectured in Cui et al. (2023) (arXiv:2302.00375). Our result highlights that in order to escape this"trivialisation"of deep neural networks (in the sense of reduction to a linear model) happening in the strongly overparametrized proportional regime, models trained from much more data have to be considered.