π€ AI Summary
This study addresses the estimation bias and degraded classification performance of semi-supervised Gaussian mixture models under missing-at-random (MAR) label scenarios. To mitigate these issues, the authors propose a novel approach that jointly models the data generation process and the label missing mechanism. Specifically, the probability of a label being missing is modeled as a function of classification uncertainty, quantified via marginal confidence. An ArandaβOrdaz link function is introduced to flexibly capture the asymmetric relationship between this uncertainty and the missingness probability. Parameter estimation and label imputation are carried out through an Expectation/Conditional Maximization (ECM) algorithm. Experimental results demonstrate that, under high proportions of MAR missing labels, the proposed method significantly improves classification accuracy and robustness, effectively alleviating the systematic bias induced by ignoring the missingness mechanism.
π Abstract
This paper presents a semi-supervised learning framework for Gaussian mixture modelling under a Missing at Random (MAR) mechanism. The method explicitly parameterizes the missingness mechanism by modelling the probability of missingness as a function of classification uncertainty. To quantify classification uncertainty, we introduce margin confidence and incorporate the Aranda Ordaz (AO) link function to flexibly capture the asymmetric relationships between uncertainty and missing probability. Based on this formulation, we develop an efficient Expectation Conditional Maximization (ECM) algorithm that jointly estimates all parameters appearing in both the Gaussian mixture model (GMM) and the missingness mechanism, and subsequently imputes the missing labels by a Bayesian classifier derived from the fitted mixture model. This method effectively alleviates the bias induced by ignoring the missingness mechanism while enhancing the robustness of semi-supervised learning. The resulting uncertainty-aware framework delivers reliable classification performance in realistic MAR scenarios with substantial proportions of missing labels.