How to Achieve Higher Accuracy with Less Training Points?

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address computational redundancy and high annotation costs in large-scale data training, this paper proposes an influence-function-based data subset selection method—the first systematic application of influence function theory to efficient training set pruning. The method models each training sample’s impact on model parameters via logistic regression, enabling principled ranking and selection of the most representative subset. On binary classification tasks, the selected subset achieves full-training accuracy using only 10% of the original data; remarkably, with 60% of the data, it surpasses the full-training baseline in accuracy. This approach substantially reduces computational overhead while preserving model performance. Crucially, it offers a novel, interpretable, and scalable paradigm for small-sample efficient training—grounded in theoretically justified influence estimation rather than heuristic sampling.

Technology Category

Application Category

📝 Abstract
In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational inefficiency in large-scale model training
Identify informative subsets for comparable model performance
Use influence functions to select optimal training samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses influence functions for data selection
Achieves high accuracy with 10% data
Improves performance using 60% data
🔎 Similar Papers
No similar papers found.
J
Jinghan Yang
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong
Anupam Pani
Anupam Pani
The University Of Hong Kong
Embodied Artificial Intelligence
Y
Yunchao Zhang
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong