Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This paper addresses the lack of theoretical foundations for variable importance assessment in machine learning. We systematically analyze the behavior of two widely used methods—Permute-and-Predict (PaP) and Leave-One-Covariate-Out (LOCO)—under linear regression. First, we derive closed-form expressions for both measures under general covariance structures. We show that PaP importance depends solely on the product of a feature’s marginal variance and its true regression coefficient, rendering it invariant to multicollinearity; in contrast, LOCO importance is attenuated by covariate correlations due to its reliance on conditional variance estimation. Using square-root transformations, theoretical analysis, and Monte Carlo simulations, we quantify the effects of true coefficients, dimensionality, and covariance structure, and extend key insights to nonlinear models such as random forests. Our results establish the first unified theoretical framework for interpreting variable importance and provide practical diagnostic guidelines for method selection.

Technology Category

Application Category

📝 Abstract

In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to $t$-statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: $ ext{PaP}_i = β_i sqrt{2operatorname{Var}(mathbf{x}^v_i)}$, while LOCO is proportional to the coefficient but dampened by collinearity (captured by $Δ$): $ ext{LOCO}_i = β_i (1 -Δ)sqrt{1 + c}$. These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.

Problem

Research questions and friction points this paper is trying to address.

Formally analyzes collinearity effects on variable importance measures in machine learning

Derives closed-form expressions for Permute-and-Predict and Leave-One-Covariate-Out methods

Explains differential sensitivity of importance measures to multicollinearity in predictors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derives closed-form expressions for variable importance measures

Shows PaP proportional to coefficient and predictor variability

Reveals LOCO dampened by collinearity through derived formula

🔎 Similar Papers

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence