Applying Support Vector Machines to Imbalanced Datasets

Applying Support Vector Machines to Imbalanced Datasets

2004 | Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz
This paper discusses the challenges of applying Support Vector Machines (SVM) to imbalanced datasets, where the number of negative instances far exceeds the number of positive instances. The authors argue that common strategies like undersampling the majority class may not be optimal for SVM. They propose an algorithm combining a variant of the SMOTE algorithm with a different error cost approach to improve performance on imbalanced data. SVMs are effective in many applications but struggle with imbalanced data due to their reliance on maximizing the margin between classes. In imbalanced datasets, the simplest hypothesis often classifies all instances as negative, leading to poor performance. The authors identify three main causes of performance loss: positive instances lying further from the ideal boundary, the weakness of soft margins, and an imbalanced ratio of support vectors. The paper evaluates the effectiveness of various approaches, including undersampling, oversampling, and different error costs. It shows that undersampling can lead to information loss, while oversampling can improve performance. The proposed algorithm, SMOTE with Different Costs (SDC), combines oversampling with different error costs to better handle imbalanced data. The authors compare their algorithm with other methods on multiple UCI datasets and find that SDC outperforms all other approaches in terms of g-means metric, which combines sensitivity and specificity. The algorithm improves both the distance and orientation of the hyperplane, leading to better classification performance. The study highlights the importance of considering both the distance and orientation of the hyperplane when dealing with imbalanced data.This paper discusses the challenges of applying Support Vector Machines (SVM) to imbalanced datasets, where the number of negative instances far exceeds the number of positive instances. The authors argue that common strategies like undersampling the majority class may not be optimal for SVM. They propose an algorithm combining a variant of the SMOTE algorithm with a different error cost approach to improve performance on imbalanced data. SVMs are effective in many applications but struggle with imbalanced data due to their reliance on maximizing the margin between classes. In imbalanced datasets, the simplest hypothesis often classifies all instances as negative, leading to poor performance. The authors identify three main causes of performance loss: positive instances lying further from the ideal boundary, the weakness of soft margins, and an imbalanced ratio of support vectors. The paper evaluates the effectiveness of various approaches, including undersampling, oversampling, and different error costs. It shows that undersampling can lead to information loss, while oversampling can improve performance. The proposed algorithm, SMOTE with Different Costs (SDC), combines oversampling with different error costs to better handle imbalanced data. The authors compare their algorithm with other methods on multiple UCI datasets and find that SDC outperforms all other approaches in terms of g-means metric, which combines sensitivity and specificity. The algorithm improves both the distance and orientation of the hyperplane, leading to better classification performance. The study highlights the importance of considering both the distance and orientation of the hyperplane when dealing with imbalanced data.
Reach us at info@study.space
Understanding Applying Support Vector Machines to Imbalanced Datasets