🤖 AI Summary
Multivariate latent-variable regression models—such as Partial Least Squares (PLS), Principal Component Regression (PCR), and their kernelized variants—lack intrinsic uncertainty quantification, limiting their credibility in scientific modeling. This work introduces conformal inference to this class of models for the first time, proposing an input-dependent, data-adaptive prediction interval calibration framework that delivers finite-sample statistical guarantees without distributional assumptions. The method is theoretically rigorous, computationally efficient, and interpretable, and is applicable to complex regression tasks including near-infrared (NIR) spectroscopy and hyperspectral remote sensing. Empirical evaluation on synthetic data and real-world plant trait prediction demonstrates that the constructed 95% prediction intervals consistently achieve coverage rates close to the nominal level, significantly enhancing model reliability and robustness across diverse scenarios.
📝 Abstract
Uncertainty quantification is essential for scientific analysis, as it allows for the evaluation and interpretation of variability and reliability in complex systems and datasets. In their original form, multivariate statistical regression models (partial least-squares regression, PLS, principal component regression, PCR) along with their kernelized versions (kernel partial least-squares regression, K-PLS, kernel principal component regression, K-PCR), do not incorporate uncertainty quantification as part of their output. In this study, we propose a method inspired by conformal inference to estimate and calibrate the uncertainty of multivariate statistical models. The result of this method is a point prediction accompanied by prediction intervals that depend on the input data. We tested the proposed method on both traditional and kernelized versions of PLS and PCR. The method is demonstrated using synthetic data, as well as laboratory near-infrared (NIR) and airborne hyperspectral regression models for estimating functional plant traits. The model was able to successfully identify the uncertain regions in the simulated data and match the magnitude of the uncertainty. In real-case scenarios, the optimised model was not overconfident nor underconfident when estimating from test data: for example, for a 95% prediction interval, 95% of the true observations were inside the prediction interval.