🤖 AI Summary
Conventional resampling methods for class-imbalanced classification suffer from inherent limitations—oversampling introduces noise and boundary ambiguity, while undersampling discards informative majority-class samples, leading to information loss and underfitting.
Method: This paper proposes an intelligent majority-class sample selection mechanism guided by model loss improvement. Its core innovation is a novel gradient-driven, differentiable bilevel optimization framework: the upper-level objective maximizes generalization performance, while the lower-level optimizes a differentiable loss improvement metric, enabling end-to-end, deterministic undersampling.
Contribution/Results: By directly selecting discriminative majority-class instances—without synthesizing noisy minority samples—the method preserves data fidelity and decision boundary clarity. Evaluated on multiple benchmark datasets, it achieves up to a 10% absolute improvement in F1-score over state-of-the-art methods, significantly enhancing minority-class detection while maintaining majority-class accuracy.
📝 Abstract
Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise and overlap caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by random undersampling. Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss. Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it. In so doing, our approach rejects majority datapoints redundant to datapoints already accepted and, thereby, finds an optimal subset of majority training data for classification. The accept/reject component of our algorithm is motivated by a bilevel optimization problem uniquely formulated to identify the optimal training set we seek. Experimental results show our proposed technique with F1 scores up to 10% higher than state-of-the-art methods.