🤖 AI Summary
This work investigates the probability mass of parameter regions in quantized neural networks that satisfy specific behavioral constraints—e.g., test loss below a threshold—to uncover intrinsic links between such mass, model information content, and generalization ability. Addressing the pervasive underestimation (by up to six orders of magnitude) of prior-volume estimation methods under Gaussian or uniform priors, we propose a novel gradient-importance sampling approach that substantially improves estimation accuracy. Theoretically, we establish, for the first time, a rigorous connection between probabilistic negative log-mass and both Minimum Description Length (MDL) principles and rate-distortion theory, revealing its fundamental role in characterizing model information and generalization bias. Empirically, we demonstrate that model information monotonically increases during training; moreover, poorly generalizing behaviors occupy significantly smaller parameter subspaces—providing evidence that neural networks possess an inherent generalization preference.
📝 Abstract
We present an algorithm for estimating the probability mass, under a Gaussian or uniform prior, of a region in neural network parameter space corresponding to a particular behavior, such as achieving test loss below some threshold. When the prior is uniform, this problem is equivalent to measuring the volume of a region. We show empirically and theoretically that existing algorithms for estimating volumes in parameter space underestimate the true volume by millions of orders of magnitude. We find that this error can be dramatically reduced, but not entirely eliminated, with an importance sampling method using gradient information that is already provided by popular optimizers. The negative logarithm of this probability can be interpreted as a measure of a network's information content, in accordance with minimum description length (MDL) principles and rate-distortion theory. As expected, this quantity increases during language model training. We also find that badly-generalizing behavioral regions are smaller, and therefore less likely to be sampled at random, demonstrating an inductive bias towards well-generalizing functions.