Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This study addresses the challenge of jointly modeling high-dimensional epigenetic data with low-dimensional covariates while balancing predictive accuracy and interpretability of covariate effects—a task in which traditional Bayesian Additive Regression Trees (BART) suffer from unstable variable selection in high dimensions. The authors propose a semiparametric BART (spBART) framework that incorporates low-dimensional covariates into a parametric component with interpretable coefficients, while flexibly modeling high-dimensional epigenetic features via a tree ensemble, thereby unifying prediction and inference. A novel variable selection strategy combines cross-validated posterior inclusion probabilities with Bayesian false discovery rate control to achieve stable identification of relevant high-dimensional features. Applied to 5hmC data from multiple myeloma patients, spBART yields a parsimonious set of candidate loci and achieves excellent discriminative performance with an AUC of 0.96 on an independent validation set.
📝 Abstract
In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to jointly model them with low-dimensional covariates when the goal is to obtain interpretable effect estimates for covariate adjustment. Standard Bayesian additive regression trees (BART) provide strong predictive performance but treat all predictors uniformly within the tree ensemble, obscuring the contributions of significant covariates and complicating variable selection in high-dimensional settings. We propose a semi-parametric BART model (spBART) that addresses this limitation by modeling low-dimensional covariates through a parametric component with interpretable coefficients, while capturing complex nonlinear associations among high-dimensional predictors through the tree ensemble. To perform stable variable selection, we develop a cross-validation-based procedure that aggregates posterior inclusion probabilities across folds and applies Bayesian false discovery rate control. We apply the proposed method to a pooled case--control analysis of high-dimensional genome-wide 5-hydroxymethylcytosine profiles derived from circulating cell-free DNA in two multiple myeloma studies ($N = 869$). The approach identifies a parsimonious set of candidate loci and achieves strong out-of-sample discrimination (AUC $= 0.96$) in a held-out validation set. Overall, spBART provides a unified framework for combining interpretable covariate inference with flexible modeling and variable selection in high-dimensional biomedical studies.
Problem

Research questions and friction points this paper is trying to address.

risk prediction
high-dimensional data
epigenetic signatures
covariate adjustment
variable selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-parametric BART
high-dimensional epigenetics
interpretable covariate modeling
Bayesian variable selection
cross-validation aggregation
🔎 Similar Papers
No similar papers found.