🤖 AI Summary
This work addresses the challenge that traditional clustering methods struggle to quantify uncertainty in cluster label assignments and lack statistical coverage guarantees under finite-sample settings. The authors propose a weighted conformal clustering approach that frames clustering as a problem of conditional label distribution shift. By applying weighting adjustments to correct the mismatch between pseudo-labels generated by data-dependent algorithms and true labels, the method constructs cluster-wise confidence sets with marginal coverage guarantees. This study is the first to integrate conformal inference with a weighting mechanism for clustering, introducing a computationally feasible enhanced calibration strategy and deriving an explicit bound on coverage error under estimated weights. Empirical results demonstrate that, in nonlinear and high-dimensional settings, the proposed method substantially reduces confidence set sizes while maintaining nominal coverage.
📝 Abstract
Clustering is a central tool for discovering latent structure in unlabeled data; yet modern clustering pipelines often end with a hard assignment of each observation to a cluster without rigorous measures of assignment uncertainty. We propose a novel weighted conformal approach for constructing valid confidence sets for cluster labels. The key difficulty is that the labels available for calibration are not observed ground-truth labels, but synthetic labels produced by a data-dependent clustering algorithm. Our method develops a conformal inference algorithm that corrects the resulting mismatch with the latent target labels through weights by formulating conformal clustering as a conditional label-distribution shift problem. We first derive an oracle procedure that attains finite-sample marginal coverage and then develop a computationally tractable and implementable version using estimated conditional label probabilities and novel augmented calibration. We show that the coverage of the estimated-weight procedure depends on the estimator, giving an explicit bound on the loss relative to the nominal level. Empirical studies demonstrate that the proposed weighted approach offers improvements over the recently proposed split conformal clustering procedure in terms of informative confidence set size, especially in nonlinear and high-dimensional clustering applications.