🤖 AI Summary
This paper addresses generalization failure in high-dimensional regression caused by spurious correlations—nonpredictive features that exhibit accidental statistical coupling with the target. Methodologically, it establishes, for the first time, an explicit analytical relationship among the spurious correlation strength $C$, ridge regularization parameter $lambda$, and the data’s covariance structure. It further reveals a fundamental trade-off between $C$ and in-distribution test loss $L$, and proves that overparameterized random feature models are statistically equivalent to regularized linear regression. The analysis leverages tools from high-dimensional statistics, matrix perturbation theory, and Schur complement techniques. Experiments on Gaussian synthetic data, Color-MNIST, and CIFAR-10 validate the theory. Results demonstrate that $C$ decreases monotonically with $lambda$, and uncover a quantifiable, unified principle linking regularization strength, feature selection simplicity, and generalization performance.
📝 Abstract
Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature $x$ and a spurious feature $y$. Specifically, we quantify the amount of spurious correlations $C$ learned via linear regression, in terms of the data covariance and the strength $lambda$ of the ridge regularization. As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $C$ and the in-distribution test loss $L$, by showing that the value of $lambda$ that minimizes $L$ lies in an interval where $C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.