๐ค AI Summary
This paper addresses the failure of variable selection in Bayesian multivariate linear regression under strong collinearity and sparse information (weak signals, small sample sizes, high inter-variable correlation) in the design matrix. It demonstrates that jointly estimating regression coefficients and the off-diagonal elements of the error covariance matrix exacerbates estimation bias and degrades predictive performance. To mitigate this, we propose a two-step Bayesian variable selection strategy: first, estimate the mean structure (i.e., regression coefficients) under a diagonal error covariance assumption; second, independently model residual dependence. Simulation studies and empirical analysis on NIR spectroscopy data confirm that the method substantially improves variable selection accuracy, coefficient estimation precision, and out-of-sample prediction in low-information regimes. The key contribution is identifying the โoverfitting riskโ inherent in full error covariance modeling and establishing that decoupling mean and covariance estimation achieves a favorable trade-off between robustness and statistical efficiency.
๐ Abstract
We consider the problem of variable selection in Bayesian multivariate linear regression models, involving multiple response and predictor variables, under multivariate normal errors. In the absence of a known covariance structure, specifying a model with a non-diagonal covariance matrix is appealing. Modeling dependency in the random errors through a non-diagonal covariance matrix is generally expected to lead to improved estimation of the regression coefficients. In this article, we highlight an interesting exception: modeling the dependency in errors can significantly worsen both estimation and prediction. We demonstrate that Bayesian multi-outcome regression models using several popular variable selection priors can suffer from poor estimation properties in low-information settings--such as scenarios with weak signals, high correlation among predictors and responses, and small sample sizes. In such cases, the simultaneous estimation of all unknown parameters in the model becomes difficult when using a non-diagonal covariance matrix. Through simulation studies and a dataset with measurements from NIR spectroscopy, we illustrate that a two-step procedure--estimating the mean and the covariance matrix separately--can provide more accurate estimates in such cases. Thus, a potential solution to avoid the problem altogether is to routinely perform an additional analysis with a diagonal covariance matrix, even if the errors are expected to be correlated.