🤖 AI Summary
This work addresses the quantification of prediction uncertainty arising from weight initialization in deep neural networks, with particular relevance to physical science applications. Building upon the neural tangent kernel (NTK) framework, the authors systematically compare test loss means and variances—across varying training set sizes—for both infinite-width and finite-width networks on MNIST and CIFAR classification, as well as calorimeter energy regression. Their key contributions are: (1) empirical and theoretical identification that the loss variance converges, in the large-sample limit, to a constant independent of both dataset size and network width; and (2) demonstration that finite-width uncertainty is accurately approximated by its infinite-width NTK limit. The study reveals that initialization-induced uncertainty follows a power-law scaling behavior and establishes a novel, analytically grounded pathway for uncertainty quantification in practical deep networks via the NTK.
📝 Abstract
Quantifying the uncertainty from machine learning analyses is critical to their use in the physical sciences. In this work we focus on uncertainty inherited from the initialization distribution of neural networks. We compute the mean $mu_{mathcal{L}}$ and variance $sigma_{mathcal{L}}^2$ of the test loss $mathcal{L}$ for an ensemble of multi-layer perceptrons (MLPs) with neural tangent kernel (NTK) initialization in the infinite-width limit, and compare empirically to the results from finite-width networks for three example tasks: MNIST classification, CIFAR classification and calorimeter energy regression. We observe scaling laws as a function of training set size $N_mathcal{D}$ for both $mu_{mathcal{L}}$ and $sigma_{mathcal{L}}$, but find that the coefficient of variation $epsilon_{mathcal{L}} equiv sigma_{mathcal{L}}/mu_{mathcal{L}}$ becomes independent of $N_mathcal{D}$ at both infinite and finite width for sufficiently large $N_mathcal{D}$. This implies that the coefficient of variation of a finite-width network may be approximated by its infinite-width value, and may in principle be calculable using finite-width perturbation theory.