Conditional variable importance for random forests

Conditional variable importance for random forests

11 July 2008 | Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin and Achim Zeileis
This article introduces a new method for assessing variable importance in random forests, addressing the bias in traditional measures that favor correlated predictor variables. Random forests are widely used in high-dimensional data analysis due to their ability to handle complex interactions and correlated predictors. However, conventional variable importance measures, such as permutation importance and Gini importance, tend to overestimate the importance of correlated variables. The authors identify two mechanisms responsible for this bias: (1) a preference for correlated predictors during tree construction and (2) an additional advantage for correlated variables due to the unconditional permutation scheme used in computing variable importance. To address this issue, the authors propose a conditional permutation scheme that better reflects the true impact of each predictor variable by conditioning on relevant covariates. This approach is motivated by the need to distinguish between marginal and conditional effects of variables, which is crucial in non-experimental studies where predictor variables cannot be manipulated independently. The new conditional permutation scheme is implemented by partitioning the feature space based on the fitted random forest model, allowing for more accurate assessment of variable importance. The method is evaluated through simulations and applied to a real-world dataset on peptide-binding, where it demonstrates improved performance in identifying truly influential predictors compared to traditional methods. The results show that the conditional permutation importance better reflects the true impact of predictor variables, especially in the presence of correlations. The method is also shown to be more reliable in identifying important variables when the number of randomly selected splitting variables (mtry) is varied. The conditional permutation importance is implemented in the R package 'party' and is freely available for use. The study highlights the importance of considering conditional effects in variable importance assessment and provides a more accurate and reliable method for evaluating predictor importance in random forests.This article introduces a new method for assessing variable importance in random forests, addressing the bias in traditional measures that favor correlated predictor variables. Random forests are widely used in high-dimensional data analysis due to their ability to handle complex interactions and correlated predictors. However, conventional variable importance measures, such as permutation importance and Gini importance, tend to overestimate the importance of correlated variables. The authors identify two mechanisms responsible for this bias: (1) a preference for correlated predictors during tree construction and (2) an additional advantage for correlated variables due to the unconditional permutation scheme used in computing variable importance. To address this issue, the authors propose a conditional permutation scheme that better reflects the true impact of each predictor variable by conditioning on relevant covariates. This approach is motivated by the need to distinguish between marginal and conditional effects of variables, which is crucial in non-experimental studies where predictor variables cannot be manipulated independently. The new conditional permutation scheme is implemented by partitioning the feature space based on the fitted random forest model, allowing for more accurate assessment of variable importance. The method is evaluated through simulations and applied to a real-world dataset on peptide-binding, where it demonstrates improved performance in identifying truly influential predictors compared to traditional methods. The results show that the conditional permutation importance better reflects the true impact of predictor variables, especially in the presence of correlations. The method is also shown to be more reliable in identifying important variables when the number of randomly selected splitting variables (mtry) is varied. The conditional permutation importance is implemented in the R package 'party' and is freely available for use. The study highlights the importance of considering conditional effects in variable importance assessment and provides a more accurate and reliable method for evaluating predictor importance in random forests.
Reach us at info@study.space