2010 | Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot
This paper proposes using random forests to address two classical issues in variable selection: identifying important variables for interpretation and designing a good prediction model. The main contributions are providing insights into the behavior of variable importance indices based on random forests and proposing a two-step strategy involving variable ranking and stepwise variable introduction. The strategy uses random forests' importance scores to rank variables and then applies a stepwise ascending variable introduction method. The paper discusses the sensitivity of variable importance to sample size, number of variables, and method parameters, and examines the behavior of variable importance in the presence of highly correlated variables. It also presents a two-step procedure for variable selection, where the first step is common to both interpretation and prediction, and the second step depends on the objective. The procedure is applied to simulated and real datasets, including the Prostate and ozone datasets, demonstrating its effectiveness in variable selection. The paper also discusses the application of the method to high-dimensional data and highlights the importance of variable importance in variable selection. The results show that the proposed method performs well in both interpretation and prediction tasks, and that the variable importance scores are reliable for variable selection. The paper concludes with a discussion on future research directions, including the need for further theoretical analysis of random forests and the potential for improving variable importance measures.This paper proposes using random forests to address two classical issues in variable selection: identifying important variables for interpretation and designing a good prediction model. The main contributions are providing insights into the behavior of variable importance indices based on random forests and proposing a two-step strategy involving variable ranking and stepwise variable introduction. The strategy uses random forests' importance scores to rank variables and then applies a stepwise ascending variable introduction method. The paper discusses the sensitivity of variable importance to sample size, number of variables, and method parameters, and examines the behavior of variable importance in the presence of highly correlated variables. It also presents a two-step procedure for variable selection, where the first step is common to both interpretation and prediction, and the second step depends on the objective. The procedure is applied to simulated and real datasets, including the Prostate and ozone datasets, demonstrating its effectiveness in variable selection. The paper also discusses the application of the method to high-dimensional data and highlights the importance of variable importance in variable selection. The results show that the proposed method performs well in both interpretation and prediction tasks, and that the variable importance scores are reliable for variable selection. The paper concludes with a discussion on future research directions, including the need for further theoretical analysis of random forests and the potential for improving variable importance measures.