🤖 AI Summary
Functional annotation of noncoding short variants and their clinical interpretation remain key bottlenecks in genomic medicine due to limited interpretability and accuracy. To address this, we propose a multi-model DNABERT ensemble framework: 700 DNABERT models are fine-tuned on large-scale ENCODE regulatory data, and integrated with variant effect scoring, motif perturbation analysis, attention visualization, and survival association testing—enabling precise, interpretable prioritization of splicing-regulatory and transcription factor binding site (TFBS) mutations. Applied to glioblastoma whole-genome sequencing data, our method identified 572 splice-disrupting variants and 9,837 TFBS-altering variants; among these, 1,352 were significantly associated with overall survival. Furthermore, we constructed a prognostic stratification model based solely on noncoding mutation features. This approach substantially improves both the accuracy and clinical translatability of noncoding variant functional interpretation.
📝 Abstract
Whole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.