Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of optimal subset selection in generalized linear models (GLMs) for high-dimensional biomedical data, where computational intractability often compromises accuracy, sparsity, and interpretability. The authors propose COMBSS-GLM, a method that reformulates discrete subset selection as a continuous optimization problem via continuous Boolean relaxation. They develop an efficient Frank–Wolfe algorithm based on envelope gradients, which at each iteration fits only a single penalized GLM and traces a solution path across model sizes to yield sparse estimates. Theoretically, under specific curvature conditions, the relaxed objective is concave in the selection weights, guaranteeing that the global optimum lies at a binary vertex. Experiments demonstrate that COMBSS-GLM achieves more accurate variable selection and superior predictive performance than state-of-the-art penalized methods in logistic and multinomial regression tasks, successfully replicates known rice GWAS loci, and attains 100% test accuracy on the Khan SRBCT cancer dataset using only a small number of genes.

Technology Category

Application Category

📝 Abstract
High-dimensional biomedical studies require models that are simultaneously accurate, sparse, and interpretable, yet exact best subset selection for generalized linear models is computationally intractable. We develop a scalable method that combines a continuous Boolean relaxation of the subset problem with a Frank--Wolfe algorithm driven by envelope gradients. The resulting method, which we refer to as COMBSS-GLM, is simple to implement, requires one penalized generalized linear model fit per iteration, and produces sparse models along a model-size path. Theoretically, we identify a curvature-based parameter regime in which the relaxed objective is concave in the selection weights, implying that global minimizers occur at binary corners. Empirically, in logistic and multinomial simulations across low- and high-dimensional correlated settings, the proposed method consistently improves variable-selection quality relative to established penalised likelihood competitors while maintaining strong predictive performance. In biomedical applications, it recovers established loci in a binary-outcome rice genome-wide association study and achieves perfect multiclass test accuracy on the Khan SRBCT cancer dataset using a small subset of genes. Open-source implementations are available in R at https://github.com/benoit-liquet/COMBSS-GLM-R and in Python at https://github.com/saratmoka/COMBSS-GLM-Python.
Problem

Research questions and friction points this paper is trying to address.

best subset selection
generalized linear models
high-dimensional data
sparsity
biomedical applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

best subset selection
generalized linear models
Frank-Wolfe algorithm
Boolean relaxation
sparse modeling
🔎 Similar Papers
No similar papers found.
A
Anant Mathur
School of Mathematics and Statistics, University of New South Wales, NSW, Australia
B
Benoit Liquet
School of Mathematical and Physical Sciences, Macquarie University, NSW, Australia; Laboratoire de Mathématiques et de leurs Applications, Université de Pau et des Pays de l’Adour, Pau, France
Samuel Muller
Samuel Muller
Executive Dean and Professor, Faculty of Science and Engineering, Macquarie University
StatisticsModel SelectionVariable SelectionRobustnessResampling
S
Sarat Moka
School of Mathematics and Statistics, University of New South Wales, NSW, Australia