Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the challenge in imbalanced classification where models struggle to distinguish boundary-critical samples from core-redundant ones, this paper proposes a core-boundary-aware data resampling framework. Methodologically, it is the first to systematically model data distribution geometry to adaptively identify decision-boundary-sensitive regions and high-density core regions, applying boundary-enhancing oversampling and core-compressing undersampling respectively. The framework is generalizable to text, multimodal, and self-supervised learning settings. Empirically, it achieves up to a 10% improvement in F1-score across 96% of benchmark datasets, while attaining a 90% data compression ratio without accuracy loss and accelerating training tenfold—significantly reducing computational overhead. Its core innovation lies in shifting sample importance assessment from class-frequency-driven heuristics to geometry-aware, boundary-sensitivity-driven evaluation.

Technology Category

Application Category

📝 Abstract

The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10% on 96% of the datasets, whereas our core-aware reduction method compresses datasets up to 90% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .

Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced classification by differentiating boundary and core data instances

Proposes oversampling boundary data and reducing core data to enhance model performance

Aims to improve computational efficiency in training like LLMs through quality-driven sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Core-boundary aware oversampling for imbalanced classification

Core-aware reduction compresses datasets while preserving accuracy

Prioritizes high-quality data for efficient model training

🔎 Similar Papers

No similar papers found.