🤖 AI Summary
This work addresses the fundamental challenge of efficiently selecting an informative subset of $n$ samples from a large dataset of size $N$ for parameter estimation, particularly when data volume is massive or labeling costs are prohibitive. Building upon optimal approximate design theory, the authors propose the first general-purpose subdata selection framework that is theoretically convergent and accommodates multiple optimality criteria. They develop efficient algorithms capable of approximating the optimal solution across arbitrary $N$ and $n$. The selected subdata achieve information efficiency nearly matching the theoretical upper bound, substantially outperforming existing methods. Moreover, this study provides the first tight upper and lower bounds to rigorously evaluate the efficiency of any subset selection strategy.
📝 Abstract
When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size $n$ from $N$ available data points is which $n$ points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of $N$ and $n$, and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.