Supervised Low-Rank Structure Discovery for Developmental Epigenetic Aging in Ultra-High-Dimensional DNA Methylation Data

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a statistical framework for ultra-high-dimensional DNA methylation data that simultaneously enables supervised structure discovery, interpretability, and adaptive selection of latent variable dimensionality. The authors propose SOLAR, a supervised low-rank latent factor model designed to identify CpG methylation structures associated with residualized epigenetic age. SOLAR integrates orthogonal low-rank regression with a Bayesian penalization mechanism and employs an adaptive BIC criterion coupled with cross-dimensional simulated annealing to automatically select the latent rank. The method provides theoretical guarantees for identifiability, fixed-rank recovery, and consistent rank selection. Applied to million-scale CpG datasets, SOLAR robustly recovers the latent rank and uncovers developmentally relevant heterogeneous methylation patterns along with biologically interpretable CpG features.
📝 Abstract
Ultra-high-dimensional array-based CpG methylation studies require statistical frameworks that simultaneously provide supervised structure discovery, interpretability, scalable latent-dimension identification, and computational feasibility. We propose SOLAR (Supervised Orthogonal Low-rank Adaptive Regression), a supervised low-rank latent-factor framework for identifying CpG-level methylation structure associated with residualized DNAm age. SOLAR combines orthogonal low-rank regression with a penalized maximum a posteriori formulation, dimension-adaptive BIC-type penalization, and a trans-dimensional simulated-annealing strategy for automatic latent-rank selection, together with theoretical guarantees including identifiability, fixed-rank recovery, and rank-selection consistency under suitable regularity conditions. The framework additionally incorporates computationally and memory-efficient optimization strategies demonstrating scalability up to $p=10^7$, while analyses at $p=10^6$ remain feasible on standard desktop computing environments. Simulation studies demonstrate stable rank recovery, competitive supervised signal recovery, and strong scalability across moderate-, high-, and ultra-high-dimensional regimes. Using longitudinal EPIC-array CpG methylation data from the GUSTO birth cohort, comprising $n=1051$ methylation profiles collected across infancy and early childhood with approximately 860,000 assayed CpGs per sample, SOLAR identifies heterogeneous supervised methylation structure associated with residualized DNAm age beyond chronological age alone, together with biologically coherent CpG signatures and enrichment patterns.
Problem

Research questions and friction points this paper is trying to address.

supervised low-rank structure
epigenetic aging
ultra-high-dimensional data
DNA methylation
latent-factor discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

supervised low-rank regression
epigenetic aging
ultra-high-dimensional data
automatic rank selection
DNA methylation
P
Priyam Das
Department of Biostatistics, Virginia Commonwealth University
J
Jiyeon Song
Department of Biostatistics, University of Michigan
L
Lathika Mohanraj
Department of Adult Health and Nursing Systems, Virginia Commonwealth University
K
Karolina A. Aberg
Center for Biomarker Research and Precision Medicine, Virginia Commonwealth University
Yi Li
Yi Li
Professor of Biostatistics, University of Michigan, Ann Arbor
Survival AnalysisStatisticsBiostatistics
Subharup Guha
Subharup Guha
Professor, Department of Biomedical Data Science, Dartmouth College
Causal inferenceData integrationHigh-dimensional inferenceBayesian methodsPrecision oncology