Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of poor minority-class recognition in text classification under severe class imbalance, this paper proposes Markov Chain-based Cross-Class Oversampling (MCOS). MCOS leverages transition probabilities derived from majority-class word sequences to model the intrinsic linguistic structure of minority classes, enabling extrapolative generation of semantically coherent and distributionally consistent synthetic texts—without substantially expanding the high-dimensional feature space. Its core innovation lies in the first integration of Markov chain modeling with cross-class knowledge transfer, facilitating structural-aware synthetic sample generation. Extensive experiments on multiple real-world datasets demonstrate that MCOS significantly outperforms mainstream oversampling methods—including SMOTE, ADASYN, and BERT-based approaches—in both macro-F1 score and minority-class accuracy. Notably, improvements are robust and substantial even under extreme imbalance ratios (e.g., 1:100).

Technology Category

Application Category

📝 Abstract
Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe.
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced text classification challenges
Proposes Markov chain oversampling for minority class expansion
Mitigates feature space growth in synthetic text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Markov chain based oversampling for text
Expands minority feature space using majority
Competitive results in severe imbalance cases
🔎 Similar Papers
No similar papers found.
A
Aleksi Avela
Aalto University, School of Science, Department of Mathematics and Systems Analysis
Pauliina Ilmonen
Pauliina Ilmonen
Aalto University School of Science, Department of Mathematics and Systems Analysis
FDAICAICSExtreme value theoryEpidemiology