Improving Model Classification by Optimizing the Training Dataset

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of conventional dataset optimization methods, which overly rely on loss approximation and neglect downstream classification metrics. We propose a systematic coreset construction framework explicitly designed to enhance classification performance. Methodologically, it integrates deterministic sampling, class-level importance weighting, and active learning–based refinement, optimizing directly for task-oriented metrics such as the F1 score—thereby overcoming the narrow focus of sensitivity-based sampling on gradient or loss approximation alone. Our key innovation lies in the joint design of importance sampling and tunable parameters to enable end-to-end optimization of training data quality. Extensive experiments across multiple benchmark datasets and classifiers demonstrate that our approach significantly outperforms standard coreset baselines and even full-data training, achieving superior classification accuracy while maintaining high training efficiency.

Technology Category

Application Category

📝 Abstract
In the era of data-centric AI, the ability to curate high-quality training data is as crucial as model design. Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets through importance sampling. However, conventional sensitivity-based coreset construction often falls short in optimizing for classification performance metrics, e.g., $F1$ score, focusing instead on loss approximation. In this work, we present a systematic framework for tuning the coreset generation process to enhance downstream classification quality. Our method introduces new tunable parameters--including deterministic sampling, class-wise allocation, and refinement via active sampling, beyond traditional sensitivity scores. Through extensive experiments on diverse datasets and classifiers, we demonstrate that tuned coresets can significantly outperform both vanilla coresets and full dataset training on key classification metrics, offering an effective path towards better and more efficient model training.
Problem

Research questions and friction points this paper is trying to address.

Optimizing coresets to improve classification performance metrics
Enhancing model training efficiency via tuned data reduction
Addressing limitations of sensitivity-based coreset construction methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tunable coreset generation for classification enhancement
Deterministic and class-wise sampling strategies
Active sampling refinement beyond sensitivity scores
🔎 Similar Papers
No similar papers found.