2003 | Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer
Smoteboost is a novel approach for learning from imbalanced data sets, combining the smote algorithm and the boosting procedure. Unlike standard boosting, which gives equal weights to all misclassified examples, smoteboost creates synthetic examples from the rare or minority class, indirectly changing the updating weights and compensating for skewed distributions. Smoteboost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved f-values.
Rare events are events that occur very infrequently, and classification of rare events is a common problem in many domains, such as detecting fraudulent transactions, network intrusion detection, web mining, direct marketing, and medical diagnostics. In these scenarios, the majority class typically represents 98-99% of the entire population, and a trivial classifier that labels everything with the majority class can achieve high accuracy. However, for domains with imbalanced and/or skewed distributions, classification accuracy is not sufficient as a standard performance measure. Metrics such as precision, recall, and f-value have been used to evaluate the performance of learning algorithms on the minority class.
A confusion matrix is typically used to evaluate the performance of a machine learning algorithm for rare class problems. In classification problems, assuming class "c" as the minority class of interest, and "nc" as a conjunction of all other classes, there are four possible outcomes when detecting class "c". Recall, precision, and f-value can be defined based on the confusion matrix. The f-value incorporates both precision and recall, and the "goodness" of a learning algorithm for the minority class can be measured by the f-value.
It is well known in machine learning that a combination of classifiers can be an effective technique for improving prediction accuracy. Boosting is one of the most popular combining techniques, which uses adaptive sampling of instances to generate a highly accurate ensemble of classifiers. Recent research has focused on embedding cost-sensitivity in the boosting algorithm. Smoteboost is a novel approach that improves the prediction of the minority class in boosting.Smoteboost is a novel approach for learning from imbalanced data sets, combining the smote algorithm and the boosting procedure. Unlike standard boosting, which gives equal weights to all misclassified examples, smoteboost creates synthetic examples from the rare or minority class, indirectly changing the updating weights and compensating for skewed distributions. Smoteboost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved f-values.
Rare events are events that occur very infrequently, and classification of rare events is a common problem in many domains, such as detecting fraudulent transactions, network intrusion detection, web mining, direct marketing, and medical diagnostics. In these scenarios, the majority class typically represents 98-99% of the entire population, and a trivial classifier that labels everything with the majority class can achieve high accuracy. However, for domains with imbalanced and/or skewed distributions, classification accuracy is not sufficient as a standard performance measure. Metrics such as precision, recall, and f-value have been used to evaluate the performance of learning algorithms on the minority class.
A confusion matrix is typically used to evaluate the performance of a machine learning algorithm for rare class problems. In classification problems, assuming class "c" as the minority class of interest, and "nc" as a conjunction of all other classes, there are four possible outcomes when detecting class "c". Recall, precision, and f-value can be defined based on the confusion matrix. The f-value incorporates both precision and recall, and the "goodness" of a learning algorithm for the minority class can be measured by the f-value.
It is well known in machine learning that a combination of classifiers can be an effective technique for improving prediction accuracy. Boosting is one of the most popular combining techniques, which uses adaptive sampling of instances to generate a highly accurate ensemble of classifiers. Recent research has focused on embedding cost-sensitivity in the boosting algorithm. Smoteboost is a novel approach that improves the prediction of the minority class in boosting.