🤖 AI Summary
Variable selection in high-dimensional settings with strong correlations often yields unstable models and unreliable statistical inference. To address this, we propose the first diffusion-model-based resampling aggregation framework: high-fidelity synthetic data are generated via diffusion generative models and subsequently subjected to multiple rounds of variable selection using estimators such as Lasso and SCAD; stability is quantified via a novel inclusion measure and coefficient estimation consistency. We establish, for the first time, theoretical selection consistency guarantees for diffusion-driven variable selection. Furthermore, we incorporate generative transfer learning to enhance performance in low-sample regimes, and extend the framework to graphical model selection and hypothesis testing. Empirical evaluations demonstrate that our method significantly outperforms Lasso, stability selection, and Knockoff in strongly correlated settings—achieving superior true positive rate, lower false discovery rate, and improved confidence interval coverage.
📝 Abstract
Variable selection for high-dimensional, highly correlated data has long been a challenging problem, often yielding unstable and unreliable models. We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data. Specifically, we draw multiple pseudo-data sets from a diffusion model fitted to the original data, apply any off-the-shelf selector (e.g., lasso or SCAD), and store the resulting inclusion indicators and coefficients. Aggregating across replicas produces a stable subset of predictors with calibrated stability scores for variable selection. Theoretically, we show that the proposed method is selection consistent under mild assumptions. Because the generative model imports knowledge from large pre-trained weights, the procedure naturally benefits from transfer learning, boosting power when the observed sample is small or noisy. We also extend the framework of aggregating synthetic data to other model selection problems, including graphical model selection, and statistical inference that supports valid confidence intervals and hypothesis tests. Extensive simulations show consistent gains over the lasso, stability selection, and knockoff baselines, especially when predictors are strongly correlated, achieving higher true-positive rates and lower false-discovery proportions. By coupling diffusion-based data augmentation with principled aggregation, our method advances variable selection methodology and broadens the toolkit for interpretable, statistically rigorous analysis in complex scientific applications.