2019 November 15; 134: 93–101 | Jaime Lynn Speiser, Michael E. Miller, Janet Tooze, Edward Ip
This study evaluates various random forest variable selection methods for classification modeling using 311 publicly available datasets. The authors compare the prediction error rates, number of variables, computation times, and area under the receiver operating curve (AUC) for different types of datasets (binary outcomes, many predictors, and imbalanced outcomes) and methods (standard random forest versus conditional random forest, test-based versus performance-based). The best methods for most datasets are Jiang’s method and the method implemented in the *VSURFR* package. For datasets with many predictors, the *varSelRF* and *Boruta* methods are recommended due to their computational efficiency. The study provides a comprehensive assessment of different variable selection techniques in random forest classification, offering guidance on selecting appropriate methods based on specific dataset characteristics.This study evaluates various random forest variable selection methods for classification modeling using 311 publicly available datasets. The authors compare the prediction error rates, number of variables, computation times, and area under the receiver operating curve (AUC) for different types of datasets (binary outcomes, many predictors, and imbalanced outcomes) and methods (standard random forest versus conditional random forest, test-based versus performance-based). The best methods for most datasets are Jiang’s method and the method implemented in the *VSURFR* package. For datasets with many predictors, the *varSelRF* and *Boruta* methods are recommended due to their computational efficiency. The study provides a comprehensive assessment of different variable selection techniques in random forest classification, offering guidance on selecting appropriate methods based on specific dataset characteristics.