Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates model collapse in overparameterized linear regression arising from iterative mixing of ground-truth labels with model-generated synthetic labels. We propose an iterative learning framework incorporating a tunable label-mixing ratio and conduct rigorous asymptotic analysis of the generalization error for both minimum-ℓ₂-norm interpolation and ridge regression. Our key theoretical finding is that, under minimum-ℓ₂-norm interpolation, the optimal proportion of ground-truth labels converges to the inverse golden ratio (≈0.618), a phenomenon governed by the geometric structure of the data covariance spectrum; this result formally establishes that model collapse is avoidable. We further derive exact asymptotic generalization error expressions, proving that the optimal mixing ratio is always at least 1/2—i.e., ground-truth data must dominate. Extensive simulations corroborate the theory, providing an interpretable, principled foundation for robust self-training.

Technology Category

Application Category

📝 Abstract
Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We validate our theoretical results with extensive simulations.
Problem

Research questions and friction points this paper is trying to address.

Studying model collapse in overparameterized linear regression
Deriving optimal mixing ratios for interpolation and ridge regression
Analyzing spectral geometry impact on preventing model degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal mixing ratios prevent model collapse
Interpolation learning uses golden ratio proportion
Ridge regression favors real-data over synthetic labels
🔎 Similar Papers
No similar papers found.
A
Anvit Garg
Department of Statistics, Harvard University
Sohom Bhattacharya
Sohom Bhattacharya
Assistant Professor, Department of Statistics, University of Florida
P
Pragya Sur
Department of Statistics, Harvard University