π€ AI Summary
This study addresses the susceptibility of variable importance scores in random forests to bias induced by inter-variable correlations, which often leads to underestimation or masking of collinear predictors. To mitigate this issue, the authors propose two efficient strategies: first, grouping features based on their conditional correlation given the response variable to disentangle the target from its correlated covariates; and second, clustering variables using pairwise conditional correlations. By explicitly modeling the underlying correlation structure among predictors, the proposed approaches effectively correct the bias in importance assessments. Experimental results demonstrate that the new methods yield more accurate estimates of each variableβs true contribution, substantially enhancing the reliability of model interpretation.
π Abstract
Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.