Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

12/02; published 10/03 | Gary M. Weiss, Foster Provost
This article explores the impact of class distribution on the performance of classification trees, particularly in scenarios where training data is costly. The authors analyze 26 datasets to determine the optimal class distribution for learning. They find that the naturally occurring class distribution generally performs well when evaluated using undifferentiated error rate (0/1 loss), while a balanced distribution performs well when evaluated using the area under the ROC curve (AUC). To address the practical challenge of limited training data, they introduce a "budget-sensitive" progressive sampling algorithm to select training examples based on class distribution. Empirical results show that this algorithm yields classifiers with good classification performance. The study provides guidelines for selecting appropriate class distributions and highlights the importance of adjusting posterior probabilities to account for differences between the training and test class distributions. The findings are relevant for scenarios where training data is costly, such as fraud detection and data transformation costs.This article explores the impact of class distribution on the performance of classification trees, particularly in scenarios where training data is costly. The authors analyze 26 datasets to determine the optimal class distribution for learning. They find that the naturally occurring class distribution generally performs well when evaluated using undifferentiated error rate (0/1 loss), while a balanced distribution performs well when evaluated using the area under the ROC curve (AUC). To address the practical challenge of limited training data, they introduce a "budget-sensitive" progressive sampling algorithm to select training examples based on class distribution. Empirical results show that this algorithm yields classifiers with good classification performance. The study provides guidelines for selecting appropriate class distributions and highlights the importance of adjusting posterior probabilities to account for differences between the training and test class distributions. The findings are relevant for scenarios where training data is costly, such as fraud detection and data transformation costs.
Reach us at info@study.space