Regression Augmentation With Data-Driven Segmentation

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

In regression tasks, skewed target distributions cause imbalance—models overfit dense regions while underrepresenting sparse regions with few samples. To address this, we propose a fully data-driven augmentation framework. Methodologically, we introduce a novel Mahalanobis-Gaussian Mixture Model (Mahalanobis-GMM) that adaptively identifies genuine sparse regions in the joint feature–target space without relying on manually defined thresholds. We further combine deterministic nearest-neighbor matching with generative adversarial network (GAN)-based synthesis to precisely augment rare samples. Extensive experiments across 32 benchmark regression datasets demonstrate that our approach significantly outperforms state-of-the-art data augmentation methods. Notably, it substantially improves generalization and prediction robustness for minority samples—those residing in low-density target regions—while maintaining accuracy across the full distribution.

Technology Category

Application Category

📝 Abstract

Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples. Despite its relevance across many applications, few methods have been designed specifically for this challenge. Existing approaches often rely on fixed, ad hoc thresholds to label samples as rare or common, overlooking the continuous complexity of the joint feature-target space and fail to represent the true underlying rare regions. To address these limitations, we propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples and employs deterministic nearest-neighbour matching to enrich sparse regions. Rather than preset thresholds, our method lets the data determine which observations are truly rare. Evaluation on 32 benchmark imbalanced regression datasets demonstrates that our approach consistently outperforms state-of-the-art data augmentation methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced regression with skewed target distributions

Identifies minority samples without fixed ad hoc thresholds

Enhances sparse regions using data-driven GAN-based augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

GAN-based augmentation for imbalanced regression

Mahalanobis-GMM identifies minority samples automatically

Deterministic nearest-neighbour matching enriches sparse regions

🔎 Similar Papers

Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods