Dependable Exploitation of High-Dimensional Unlabeled Data in an Assumption-Lean Framework

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-dimensional semi-supervised regression, where existing methods may fail due to reliance on correct specification of the conditional mean function. We propose a novel debiased estimator that safely leverages unlabeled data to improve estimation efficiency even when the conditional mean is misspecified or not consistently estimable. Our approach is the first to achieve robustness against such model misspecification and extends naturally to general M-estimation frameworks. Theoretical analysis and empirical experiments demonstrate that the proposed method never underperforms the purely supervised benchmark across various settings and yields substantial efficiency gains when the conditional mean can be reasonably well estimated, thereby ensuring both reliability and efficiency in statistical inference.
📝 Abstract
Semi-supervised learning has attracted significant attention due to the proliferation of applications featuring limited labeled data but abundant unlabeled data. In this paper, we examine the statistical inference problem in an assumption-lean framework which involves a high-dimensional regression parameter, defined by minimizing the least squares, within the context of semi-supervised learning. We investigate when and how unlabeled data can enhance the estimation efficiency of a regression parameter functional. First, we demonstrate that a straightforward debiased estimator can only be more efficient than its supervised counterpart if the unknown conditional mean function can be consistently estimated at an appropriate rate. Otherwise, incorporating unlabeled data can actually be counterproductive. To address this vulnerability, we propose a novel estimator guaranteed to be at least as efficient as the supervised baseline, even when the conditional mean function is misspecified. This ensures the dependable use of unlabeled data for statistical inference. Finally, we extend our approach to the general M-estimation framework, and demonstrate the effectiveness of our methodology through comprehensive simulation studies and a real data application.
Problem

Research questions and friction points this paper is trying to address.

semi-supervised learning
high-dimensional data
unlabeled data
statistical inference
assumption-lean framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning
assumption-lean framework
debiased estimator
statistical efficiency
M-estimation
🔎 Similar Papers
No similar papers found.
C
Chao Ying
University of Wisconsin-Madison
S
Siyi Deng
Cornell University
Yang Ning
Yang Ning
Cornell University
Jiwei Zhao
Jiwei Zhao
University of Wisconsin-Madison
StatisticsMachine LearningData ScienceBiostatisticsBiomedical Data Science
H
Heping Zhang
Yale University