Understanding SMOTE%3A Synthetic Minority Over-sampling Technique

The paper introduces the Synthetic Minority Over-sampling Technique (SMOTE), a method to address the issue of imbalanced datasets in machine learning. Imbalanced datasets, where one class significantly outnumbers the other, are common in applications such as fraud detection and medical diagnosis. The authors propose combining under-sampling of the majority class with over-sampling of the minority class to improve classifier performance. SMOTE specifically creates synthetic minority class examples by generating new samples along the line segments joining k-nearest neighbors of minority samples. This approach helps to create larger and less specific decision regions for the minority class, improving classification accuracy. The paper evaluates SMOTE using various datasets and classifiers (C4.5, Ripper, and Naive Bayes) and compares its performance with other methods like under-sampling, over-sampling, and modifying loss ratios or class priors. The results show that SMOTE outperforms these methods in terms of Area Under the ROC Curve (AUC) and ROC convex hull strategy, indicating that it can achieve better classifier performance in ROC space. The paper also discusses future work, including extensions of SMOTE to handle mixed feature types and its potential application in information retrieval.The paper introduces the Synthetic Minority Over-sampling Technique (SMOTE), a method to address the issue of imbalanced datasets in machine learning. Imbalanced datasets, where one class significantly outnumbers the other, are common in applications such as fraud detection and medical diagnosis. The authors propose combining under-sampling of the majority class with over-sampling of the minority class to improve classifier performance. SMOTE specifically creates synthetic minority class examples by generating new samples along the line segments joining k-nearest neighbors of minority samples. This approach helps to create larger and less specific decision regions for the minority class, improving classification accuracy. The paper evaluates SMOTE using various datasets and classifiers (C4.5, Ripper, and Naive Bayes) and compares its performance with other methods like under-sampling, over-sampling, and modifying loss ratios or class priors. The results show that SMOTE outperforms these methods in terms of Area Under the ROC Curve (AUC) and ROC convex hull strategy, indicating that it can achieve better classifier performance in ROC space. The paper also discusses future work, including extensions of SMOTE to handle mixed feature types and its potential application in information retrieval.

SMOTE: Synthetic Minority Over-sampling Technique

09/01; published 06/02 | Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer