How to Achieve Higher Accuracy with Less Training Points?

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address computational redundancy and high annotation costs in large-scale data training, this paper proposes an influence-function-based data subset selection method—the first systematic application of influence function theory to efficient training set pruning. The method models each training sample’s impact on model parameters via logistic regression, enabling principled ranking and selection of the most representative subset. On binary classification tasks, the selected subset achieves full-training accuracy using only 10% of the original data; remarkably, with 60% of the data, it surpasses the full-training baseline in accuracy. This approach substantially reduces computational overhead while preserving model performance. Crucially, it offers a novel, interpretable, and scalable paradigm for small-sample efficient training—grounded in theoretically justified influence estimation rather than heuristic sampling.

Technology Category

Application Category

📝 Abstract

In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational inefficiency in large-scale model training

Identify informative subsets for comparable model performance

Use influence functions to select optimal training samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses influence functions for data selection

Achieves high accuracy with 10% data

Improves performance using 60% data

🔎 Similar Papers

No similar papers found.