🤖 AI Summary
This work addresses the fundamental question of why unlabeled data improve generalization in semi-supervised learning (SSL), focusing on the challenge of modeling complex causal relationships between features and labels in realistic settings. We propose the first SSL framework applicable to general causal graph structures: leveraging a causal generative model, it infers latent causal mechanisms from unlabeled data and synthesizes high-fidelity pseudo-labels—thereby enhancing discriminative model performance without relying on strong distributional or structural assumptions. Our approach integrates causal graph identifiability, counterfactual reasoning, and consistency regularization, supporting diverse causal structures including confounding, mediation, and backdoor paths. Extensive experiments on synthetic data and multiple real-world benchmarks—including medical and image domains—demonstrate consistent superiority over state-of-the-art SSL and causal learning methods, achieving average accuracy gains of 3.2–7.8 percentage points.
📝 Abstract
Semi-supervised learning (SSL) aims to train a machine learning model using both labelled and unlabelled data. While the unlabelled data have been used in various ways to improve the prediction accuracy, the reason why unlabelled data could help is not fully understood. One interesting and promising direction is to understand SSL from a causal perspective. In light of the independent causal mechanisms principle, the unlabelled data can be helpful when the label causes the features but not vice versa. However, the causal relations between the features and labels can be complex in real world applications. In this paper, we propose a SSL framework that works with general causal models in which the variables have flexible causal relations. More specifically, we explore the causal graph structures and design corresponding causal generative models which can be learned with the help of unlabelled data. The learned causal generative model can generate synthetic labelled data for training a more accurate predictive model. We verify the effectiveness of our proposed method by empirical studies on both simulated and real data.