FEBRUARY 2009 | Yuchun Tang, Member, IEEE, Yan-Qing Zhang, Member, IEEE, Nitesh V. Chawla, Member, IEEE, and Sven Krasser, Member, IEEE
This correspondence presents a study on support vector machines (SVMs) for highly imbalanced classification. Traditional classification algorithms struggle with highly imbalanced data sets, prompting the use of sampling strategies. The authors propose modifications to SVMs, including cost-sensitive learning and over- and undersampling, to address class imbalance. They compare these SVM-based strategies with state-of-the-art approaches on various data sets using metrics like G-mean, AUC-ROC, F-measure, and AUC-PR. The novel granular SVMs—repetitive undersampling algorithm (GSVM-RU) performs best in terms of both effectiveness and efficiency. GSVM-RU minimizes information loss while maximizing data cleaning in undersampling and reduces the number of support vectors, speeding up SVM prediction. The study also introduces a new "combine" aggregation method and evaluates GSVM-RU against other SVM modeling techniques. GSVM-RU outperforms or matches previous best algorithms on most data sets. The algorithm is effective due to its ability to extract informative samples and eliminate redundant or noisy samples. It is efficient because it reduces the number of support vectors. The study shows that GSVM-RU achieves optimal performance with the "discard" operation for most data sets, while the "combine" operation is better for some metrics. The authors also investigate the effectiveness of cost-sensitive learning (SVM-WEIGHT) and other methods. GSVM-RU is more efficient than SVM-WEIGHT and SVM-SMOTE, and performs well on various metrics. The study concludes that GSVM-RU is a promising approach for highly imbalanced classification.This correspondence presents a study on support vector machines (SVMs) for highly imbalanced classification. Traditional classification algorithms struggle with highly imbalanced data sets, prompting the use of sampling strategies. The authors propose modifications to SVMs, including cost-sensitive learning and over- and undersampling, to address class imbalance. They compare these SVM-based strategies with state-of-the-art approaches on various data sets using metrics like G-mean, AUC-ROC, F-measure, and AUC-PR. The novel granular SVMs—repetitive undersampling algorithm (GSVM-RU) performs best in terms of both effectiveness and efficiency. GSVM-RU minimizes information loss while maximizing data cleaning in undersampling and reduces the number of support vectors, speeding up SVM prediction. The study also introduces a new "combine" aggregation method and evaluates GSVM-RU against other SVM modeling techniques. GSVM-RU outperforms or matches previous best algorithms on most data sets. The algorithm is effective due to its ability to extract informative samples and eliminate redundant or noisy samples. It is efficient because it reduces the number of support vectors. The study shows that GSVM-RU achieves optimal performance with the "discard" operation for most data sets, while the "combine" operation is better for some metrics. The authors also investigate the effectiveness of cost-sensitive learning (SVM-WEIGHT) and other methods. GSVM-RU is more efficient than SVM-WEIGHT and SVM-SMOTE, and performs well on various metrics. The study concludes that GSVM-RU is a promising approach for highly imbalanced classification.