🤖 AI Summary
In high-dimensional clustered data, when covariates exhibit heterogeneous distributions across clusters, conventional marginal LASSO may erroneously treat them as sparse proxies for latent cluster effects, leading to biased estimation and incorrect variable selection. This work proposes the Synthetic Heterogeneous Effects LASSO (SHEL), which, for the first time, integrates cluster-level synthetic variables into a fixed-effects penalized regression framework to explicitly model latent heterogeneity and correct estimation bias. SHEL enables accurate variable selection and valid post-selection inference in high-dimensional settings. Theoretical analysis establishes its desirable asymptotic properties under high dimensionality, while simulations demonstrate substantial improvements over existing methods. The approach is successfully applied to longitudinal RNA-seq data from neutrophils of COVID-19 patients, illustrating its practical utility.
📝 Abstract
This paper studies variable selection and post-selection inference for high-dimensional clustered data using marginal-model-based procedures. We show that, when covariates are heterogeneously distributed across clusters, marginal-model LASSO may use them as sparse proxies for latent cluster effects, shifting the estimation target away from the structural fixed effects and inducing false selections. To address this problem, we propose Synthetic Heterogeneous-Effects LASSO (SHEL), a fixed-effects penalized framework that incorporates cluster-level synthetic approximations to the latent heterogeneity. We establish theoretical properties of SHEL in high-dimensional settings and develop procedures for valid post-selection inference. The finite sample performance of the proposed method is investigated through extensive simulation studies. A longitudinal bulk RNA-seq dataset of enriched blood neutrophils from hospitalized COVID-19 patients is analyzed to demonstrate the method in a real application.