🤖 AI Summary
To address computational redundancy and high annotation costs in large-scale data training, this paper proposes an influence-function-based data subset selection method—the first systematic application of influence function theory to efficient training set pruning. The method models each training sample’s impact on model parameters via logistic regression, enabling principled ranking and selection of the most representative subset. On binary classification tasks, the selected subset achieves full-training accuracy using only 10% of the original data; remarkably, with 60% of the data, it surpasses the full-training baseline in accuracy. This approach substantially reduces computational overhead while preserving model performance. Crucially, it offers a novel, interpretable, and scalable paradigm for small-sample efficient training—grounded in theoretically justified influence estimation rather than heuristic sampling.
📝 Abstract
In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.