Stabilizing black-box model selection with the inflated argmax

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing model selection methods—such as LASSO and SINDy—exhibit poor stability under minor data perturbations, particularly in challenging black-box settings involving highly correlated features, dynamical system identification, and graph-structured learning, where they often yield volatile model selections. To address this, we propose the “dilated argmax” mechanism, which integrates bagging-based resampling, sparse regression, and a dilated-threshold decision rule to construct a theoretically grounded, stable candidate model set. Our approach is the first to guarantee high intersection rate among selected models even after arbitrary single-point deletion from the training data. Evaluated on highly correlated synthetic data, Lotka–Volterra system modeling, and protein signaling pathway graph inference, the method significantly improves selection stability while preserving predictive accuracy and model parsimony—outperforming state-of-the-art baselines across all dimensions.

Technology Category

Application Category

📝 Abstract
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an ''inflated'' argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, and (c) a graph subset selection problem using cell-signaling data from proteomics. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Stabilizing model selection methods to prevent instability from data changes.
Addressing instability in sparse model selection like LASSO and SINDy.
Ensuring selected model collections remain consistent despite training data removal.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining bagging with inflated argmax operation
Selecting multiple stable models fitting data
Theoretical guarantees for model selection stability
🔎 Similar Papers
No similar papers found.