🤖 AI Summary
This study addresses the lack of a statistical framework for ultra-high-dimensional DNA methylation data that simultaneously enables supervised structure discovery, interpretability, and adaptive selection of latent variable dimensionality. The authors propose SOLAR, a supervised low-rank latent factor model designed to identify CpG methylation structures associated with residualized epigenetic age. SOLAR integrates orthogonal low-rank regression with a Bayesian penalization mechanism and employs an adaptive BIC criterion coupled with cross-dimensional simulated annealing to automatically select the latent rank. The method provides theoretical guarantees for identifiability, fixed-rank recovery, and consistent rank selection. Applied to million-scale CpG datasets, SOLAR robustly recovers the latent rank and uncovers developmentally relevant heterogeneous methylation patterns along with biologically interpretable CpG features.
📝 Abstract
Ultra-high-dimensional array-based CpG methylation studies require statistical frameworks that simultaneously provide supervised structure discovery, interpretability, scalable latent-dimension identification, and computational feasibility. We propose SOLAR (Supervised Orthogonal Low-rank Adaptive Regression), a supervised low-rank latent-factor framework for identifying CpG-level methylation structure associated with residualized DNAm age. SOLAR combines orthogonal low-rank regression with a penalized maximum a posteriori formulation, dimension-adaptive BIC-type penalization, and a trans-dimensional simulated-annealing strategy for automatic latent-rank selection, together with theoretical guarantees including identifiability, fixed-rank recovery, and rank-selection consistency under suitable regularity conditions. The framework additionally incorporates computationally and memory-efficient optimization strategies demonstrating scalability up to $p=10^7$, while analyses at $p=10^6$ remain feasible on standard desktop computing environments. Simulation studies demonstrate stable rank recovery, competitive supervised signal recovery, and strong scalability across moderate-, high-, and ultra-high-dimensional regimes. Using longitudinal EPIC-array CpG methylation data from the GUSTO birth cohort, comprising $n=1051$ methylation profiles collected across infancy and early childhood with approximately 860,000 assayed CpGs per sample, SOLAR identifies heterogeneous supervised methylation structure associated with residualized DNAm age beyond chronological age alone, together with biologically coherent CpG signatures and enrichment patterns.