SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address degraded predictive performance in sparse regions of the target variable distribution in regression tasks, this paper proposes a two-stage oversampling framework: first generating initial synthetic samples for sparse regions, then refining them via a distribution-aware generative adversarial network (DistGAN). DistGAN is the first GAN architecture designed specifically for imbalanced regression; it models the generator as a learnable filtering layer and jointly optimizes adversarial loss and maximum mean discrepancy (MMD) to align the real joint feature–target distribution. Unlike conventional linear interpolation or noise-addition strategies, DistGAN avoids modeling bias in nonlinear relationships. Experiments across 23 imbalanced regression datasets demonstrate that DistGAN significantly improves prediction accuracy in sparse regions, validating its effectiveness in modeling distributional consistency and its strong generalization capability.

Technology Category

Application Category

📝 Abstract

Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN's filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.

Problem

Research questions and friction points this paper is trying to address.

Addressing imbalanced regression with skewed target variables

Improving synthetic data generation for sparse regions

Enhancing model performance on underrepresented samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step oversampling framework for imbalanced regression

Uses DistGAN for distribution-aware synthetic sample refinement

Combines adversarial loss with Maximum Mean Discrepancy objective

🔎 Similar Papers

No similar papers found.