A Bilevel Optimization Framework for Imbalanced Data Classification

📅 2024-10-15

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Conventional resampling methods for class-imbalanced classification suffer from inherent limitations—oversampling introduces noise and boundary ambiguity, while undersampling discards informative majority-class samples, leading to information loss and underfitting. Method: This paper proposes an intelligent majority-class sample selection mechanism guided by model loss improvement. Its core innovation is a novel gradient-driven, differentiable bilevel optimization framework: the upper-level objective maximizes generalization performance, while the lower-level optimizes a differentiable loss improvement metric, enabling end-to-end, deterministic undersampling. Contribution/Results: By directly selecting discriminative majority-class instances—without synthesizing noisy minority samples—the method preserves data fidelity and decision boundary clarity. Evaluated on multiple benchmark datasets, it achieves up to a 10% absolute improvement in F1-score over state-of-the-art methods, significantly enhancing minority-class detection while maintaining majority-class accuracy.

Technology Category

Application Category

📝 Abstract

Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise and overlap caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by random undersampling. Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss. Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it. In so doing, our approach rejects majority datapoints redundant to datapoints already accepted and, thereby, finds an optimal subset of majority training data for classification. The accept/reject component of our algorithm is motivated by a bilevel optimization problem uniquely formulated to identify the optimal training set we seek. Experimental results show our proposed technique with F1 scores up to 10% higher than state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Proposes undersampling method to avoid noise from synthetic data

Optimizes majority data selection by improving model loss

Formulates bilevel optimization for optimal training subset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Undersamples based on model loss improvement

Avoids noise and overlap from synthetic data

Uses bilevel optimization for optimal training subset

🔎 Similar Papers

No similar papers found.