๐ค AI Summary
To address degraded predictive performance in sparse regions of the target variable distribution in regression tasks, this paper proposes a two-stage oversampling framework: first generating initial synthetic samples for sparse regions, then refining them via a distribution-aware generative adversarial network (DistGAN). DistGAN is the first GAN architecture designed specifically for imbalanced regression; it models the generator as a learnable filtering layer and jointly optimizes adversarial loss and maximum mean discrepancy (MMD) to align the real joint featureโtarget distribution. Unlike conventional linear interpolation or noise-addition strategies, DistGAN avoids modeling bias in nonlinear relationships. Experiments across 23 imbalanced regression datasets demonstrate that DistGAN significantly improves prediction accuracy in sparse regions, validating its effectiveness in modeling distributional consistency and its strong generalization capability.
๐ Abstract
Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN's filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.