🤖 AI Summary
This study addresses the challenge of genetic risk prediction for lung cancer in never-smokers. We propose an end-to-end interpretable Transformer model that learns risk signals directly from raw GWAS genotype data—without requiring clinical covariates. Methodologically, we design a neural genotype embedding coupled with additive positional encoding to accommodate high-dimensional, sparse SNP data; introduce Layer-wise Integrated Gradients for SNP-level attribution to enhance biological interpretability; and apply a functionally informed SNP filtering strategy to refine input representation. Trained on 27,254 participants, the model achieves an AUC of 99.1%. Critically, top attributed SNPs strongly colocalize with established lung cancer susceptibility loci (e.g., *CHRNA3*, *TERT*), validating biological plausibility. Our approach establishes a novel paradigm for precision screening in never-smokers and facilitates hypothesis generation regarding disease mechanisms.
📝 Abstract
Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.