Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge in imbalanced classification where models struggle to distinguish boundary-critical samples from core-redundant ones, this paper proposes a core-boundary-aware data resampling framework. Methodologically, it is the first to systematically model data distribution geometry to adaptively identify decision-boundary-sensitive regions and high-density core regions, applying boundary-enhancing oversampling and core-compressing undersampling respectively. The framework is generalizable to text, multimodal, and self-supervised learning settings. Empirically, it achieves up to a 10% improvement in F1-score across 96% of benchmark datasets, while attaining a 90% data compression ratio without accuracy loss and accelerating training tenfold—significantly reducing computational overhead. Its core innovation lies in shifting sample importance assessment from class-frequency-driven heuristics to geometry-aware, boundary-sensitivity-driven evaluation.

Technology Category

Application Category

📝 Abstract
The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10% on 96% of the datasets, whereas our core-aware reduction method compresses datasets up to 90% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced classification by differentiating boundary and core data instances
Proposes oversampling boundary data and reducing core data to enhance model performance
Aims to improve computational efficiency in training like LLMs through quality-driven sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Core-boundary aware oversampling for imbalanced classification
Core-aware reduction compresses datasets while preserving accuracy
Prioritizes high-quality data for efficient model training
🔎 Similar Papers
No similar papers found.
Samir Brahim Belhaouari
Samir Brahim Belhaouari
Hamad Bin Khalifa University
Mathematics of Machine LearningScientific Machine LearningComputational Science
Y
Yunis Carreon Kahalan
Hamad Bin Khalifa University, Doha, Qatar.
H
Humaira Shaffique
Hamad Bin Khalifa University, Doha, Qatar.
I
Ismael Belhaouari
Maastricht University, Maastricht, Netherlands.
A
Ashhadul Islam
KTH Royal Institute of Technology, Stockholm, Sweden.