2004 | Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz
This paper discusses the challenges of applying Support Vector Machines (SVM) to imbalanced datasets, where the number of negative instances far exceeds the positive ones. The authors explore why common strategies like undersampling may not be effective and propose a new algorithm, SMOTE with Different Costs (SDC), which combines the Synthetic Minority Over-sampling Technique (SMOTE) with different error costs to improve SVM performance. They compare SDC with other methods, including regular SVM, undersampling, and SMOTE, using various UCI datasets. The results show that SDC outperforms all other methods in terms of the g-means metric, which combines sensitivity and specificity, demonstrating its effectiveness in handling imbalanced datasets. The paper also highlights the limitations of undersampling and the benefits of retaining valuable information from the majority class.This paper discusses the challenges of applying Support Vector Machines (SVM) to imbalanced datasets, where the number of negative instances far exceeds the positive ones. The authors explore why common strategies like undersampling may not be effective and propose a new algorithm, SMOTE with Different Costs (SDC), which combines the Synthetic Minority Over-sampling Technique (SMOTE) with different error costs to improve SVM performance. They compare SDC with other methods, including regular SVM, undersampling, and SMOTE, using various UCI datasets. The results show that SDC outperforms all other methods in terms of the g-means metric, which combines sensitivity and specificity, demonstrating its effectiveness in handling imbalanced datasets. The paper also highlights the limitations of undersampling and the benefits of retaining valuable information from the majority class.