September 2010 | Miron B. Kursa, Witold R. Rudnicki
The article introduces the R package Boruta, which implements a novel feature selection algorithm designed to identify all relevant variables in a dataset. The algorithm is based on a wrapper approach around the Random Forest classification algorithm, iteratively removing features that are less relevant than random probes. Boruta provides a convenient interface for users and includes examples of its application on both real-world and artificial datasets. The method is particularly useful for understanding mechanisms related to the subject of interest, as it identifies all attributes that are relevant for classification, not just non-redundant ones. The article discusses the challenges of the "all-relevant" problem and explains how Boruta addresses these challenges by using statistical tests and wrapper algorithms. The package is available on CRAN and is demonstrated with examples using the ozone and Madelon datasets, showing its effectiveness in reducing the number of attributes and improving classification accuracy.The article introduces the R package Boruta, which implements a novel feature selection algorithm designed to identify all relevant variables in a dataset. The algorithm is based on a wrapper approach around the Random Forest classification algorithm, iteratively removing features that are less relevant than random probes. Boruta provides a convenient interface for users and includes examples of its application on both real-world and artificial datasets. The method is particularly useful for understanding mechanisms related to the subject of interest, as it identifies all attributes that are relevant for classification, not just non-redundant ones. The article discusses the challenges of the "all-relevant" problem and explains how Boruta addresses these challenges by using statistical tests and wrapper algorithms. The package is available on CRAN and is demonstrated with examples using the ozone and Madelon datasets, showing its effectiveness in reducing the number of attributes and improving classification accuracy.