Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

📅 2024-02-06

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work investigates the necessity and efficacy of rebalancing strategies—particularly SMOTE and its variants—for imbalanced tabular data, through both theoretical analysis and empirical evaluation. We derive, for the first time, a non-asymptotic upper bound on the density induced by SMOTE, rigorously proving that, under default parameters, it degenerates to mere sample duplication and yields vanishing density near class boundaries—revealing intrinsic “density degradation” and “boundary failure.” Guided by this theory, we propose two novel SMOTE variants. Comprehensive evaluation across 13 benchmark datasets, 10 rebalancing methods (including diffusion-based approaches), and strong baselines such as LightGBM demonstrates that competitive performance is often achievable without rebalancing in realistic scenarios; moreover, as class imbalance intensifies, our variants significantly outperform standard SMOTE and state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set.bFurthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on 13 tabular data sets with 10 state-of-the-art rebalancing procedures, including deep generative and diffusion models. One of our key findings is that, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. However, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other state-of-the-art strategies.

Problem

Research questions and friction points this paper is trying to address.

Theoretical analysis of SMOTE's density and asymptotic behavior

Evaluation of SMOTE variants vs. state-of-the-art rebalancing methods

Impact of imbalance ratio on rebalancing strategy effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derives non-asymptotic upper bounds on SMOTE density

Introduces two new SMOTE variants based on theory

Compares strategies with deep generative models

🔎 Similar Papers

Sample Selection Bias in Machine Learning for Healthcare

2024-05-13arXiv.orgCitations: 1

Authors to Follow