Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based Oversampling

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

159K/year
🤖 AI Summary
This work addresses the challenge of label imbalance in multi-label classification, which often biases models toward frequent labels and degrades overall performance. To mitigate this issue, the authors propose a novel oversampling method based on label-specific distances. The approach introduces a label-aware distance metric that dynamically constructs a weighted feature subspace tailored to each label, enabling the selection of label-consistent nearest neighbors for synthetic sample generation. This strategy enhances both the label consistency and boundary representativeness of the synthesized instances. Experimental results across multiple multi-label datasets demonstrate that the proposed method significantly outperforms existing oversampling techniques, effectively alleviating label imbalance and improving classification performance.
📝 Abstract
The complex imbalanced label distribution poses a crucial challenge to multi-label classification, as most classifiers are biased towards the majority class and high-frequent labels. Oversampling is an efficient and flexible solution that augments instances to provide a more balanced training dataset for multi-label classifiers. Most existing oversampling methods create synthetic instances in a heuristic way that essentially relies on neighborhood information retrieved using Euclidean distance within the entire feature space. However, they fail to consider the varying semantic relevance of features to different labels, leading to label inconsistency among proximate neighbors and further introducing label confusion and overfitting to synthetic instances. To overcome the above issue, we propose a novel sampling approach called Label-Specific Distance-based Multi-Label Oversampling (LSDMLO) that creates more useful and well-labeled synthetic instances to address the imbalance in multi-label datasets. LSDMLO derives the label-specific distance to identify label-consistent neighbors based on the weighted pertinent feature space, which facilitates selecting seed instances that express more label correlations in boundary areas and generating synthetic instances aligned with the label distribution of original data. The comprehensive experiments verify that the proposed LSDMLO outperforms the state-of-the-art multi-label sampling approaches under various base classifiers.
Problem

Research questions and friction points this paper is trying to address.

multi-label classification
class imbalance
oversampling
label inconsistency
feature relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

label-specific distance
multi-label oversampling
feature weighting
label consistency
imbalanced data
🔎 Similar Papers
No similar papers found.
B
Bin Liu
Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, China
J
Jun Wu
Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, China; School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, China
H
Haoyu Peng
Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, China; School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, China
Ao Zhou
Ao Zhou
Nanjing University
MLLMsLLMsData Mining
Jin Wang
Jin Wang
Yunnan University
Sentiment AnalysisNatural Language Processing
Q
QiaoSong Chen
Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, China
Grigorios Tsoumakas
Grigorios Tsoumakas
Aristotle University of Thessaloniki
Machine LearningData MiningKnowledge DiscoveryNatural Language Processing