HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of genetic risk prediction for lung cancer in never-smokers. We propose an end-to-end interpretable Transformer model that learns risk signals directly from raw GWAS genotype data—without requiring clinical covariates. Methodologically, we design a neural genotype embedding coupled with additive positional encoding to accommodate high-dimensional, sparse SNP data; introduce Layer-wise Integrated Gradients for SNP-level attribution to enhance biological interpretability; and apply a functionally informed SNP filtering strategy to refine input representation. Trained on 27,254 participants, the model achieves an AUC of 99.1%. Critically, top attributed SNPs strongly colocalize with established lung cancer susceptibility loci (e.g., *CHRNA3*, *TERT*), validating biological plausibility. Our approach establishes a novel paradigm for precision screening in never-smokers and facilitates hypothesis generation regarding disease mechanisms.

Technology Category

Application Category

📝 Abstract
Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.
Problem

Research questions and friction points this paper is trying to address.

Predicting lung cancer risk using genetic data from GWAS
Applying explainable transformer models to raw genotype data
Identifying specific genetic variants contributing to cancer risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer model processes raw genotype data directly
Uses additive positional encodings and neural embeddings
Implements post hoc explainability with integrated gradients
🔎 Similar Papers
No similar papers found.
M
Maria Mahbub
Oak Ridge National Laboratory, Oak Ridge, TN, USA
R
Robert J. Klein
Department of Genetics and Genomics, and Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
M
Myvizhi Esai Selvan
Department of Genetics and Genomics, and Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
R
Rowena Yip
Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
C
Claudia Henschke
Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
P
Providencia Morales
Phoenix Veteran Affairs Health Care System, Phoenix, AZ, USA
I
Ian Goethert
Oak Ridge National Laboratory, Oak Ridge, TN, USA
O
Olivera Kotevska
Oak Ridge National Laboratory, Oak Ridge, TN, USA
M
Mayanka Chandra Shekar
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Sean R. Wilkinson
Sean R. Wilkinson
Research Scientist, Oak Ridge National Laboratory
BioinformaticsData ScienceHigh Performance ComputingFAIRWorkflows
E
Eileen McAllister
Oak Ridge National Laboratory, Oak Ridge, TN, USA
S
Samuel M. Aguayo
Phoenix Veteran Affairs Health Care System, Phoenix, AZ, USA
Z
Zeynep H. Gümüş
Department of Genetics and Genomics, and Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
I
Ioana Danciu
Oak Ridge National Laboratory, Oak Ridge, TN, USA
V
VA Million Veteran Program
VA Million Veteran Program