Support Vector Machines for Spam Categorization

Support Vector Machines for Spam Categorization

SEPTEMBER 1999 | Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir N. Vapnik
This paper compares the performance of support vector machines (SVMs) with three other classification algorithms—Ripper, Rocchio, and boosting decision trees—in classifying e-mail as spam or nonspam. The study uses two data sets: one with 1000 best features and another with over 7000 features. SVMs performed best when using binary features. Both data sets showed that boosting trees and SVMs had acceptable accuracy and speed, but SVMs had significantly less training time. The paper discusses various design choices, including feature representation, number of features, performance criteria, training and classification speed, and the choice of learning algorithms. Feature representation involves using words as features, with options such as term frequency (TF), TF-IDF, and binary representation. The number of features is a critical factor, with the optimal number depending on the learning algorithm. Performance criteria include recall, precision, error rate, false alarm rate, and miss rate. SVMs were found to be better in terms of error dispersion, while boosting algorithms had lower error rates but worse error dispersion. The study also compares the performance of different learning algorithms, including boosting algorithms, SVMs, Ripper, and Rocchio. Boosting algorithms, which combine weak learners, were found to have lower error rates but worse error dispersion. SVMs, on the other hand, were found to be better in terms of error dispersion. The paper also discusses the advantages and disadvantages of each algorithm, including training time, classification speed, and the ability to handle imbalanced data. The results show that SVMs with binary features performed best in terms of error dispersion, while boosting algorithms had lower error rates. The study also found that using all features rather than a subset was preferable, as it did not degrade SVM performance. Training time for boosting decision trees was found to be inordinately long, while SVMs had faster training times. The paper concludes that SVMs are superior in terms of training time and error dispersion, making them a better choice for spam classification.This paper compares the performance of support vector machines (SVMs) with three other classification algorithms—Ripper, Rocchio, and boosting decision trees—in classifying e-mail as spam or nonspam. The study uses two data sets: one with 1000 best features and another with over 7000 features. SVMs performed best when using binary features. Both data sets showed that boosting trees and SVMs had acceptable accuracy and speed, but SVMs had significantly less training time. The paper discusses various design choices, including feature representation, number of features, performance criteria, training and classification speed, and the choice of learning algorithms. Feature representation involves using words as features, with options such as term frequency (TF), TF-IDF, and binary representation. The number of features is a critical factor, with the optimal number depending on the learning algorithm. Performance criteria include recall, precision, error rate, false alarm rate, and miss rate. SVMs were found to be better in terms of error dispersion, while boosting algorithms had lower error rates but worse error dispersion. The study also compares the performance of different learning algorithms, including boosting algorithms, SVMs, Ripper, and Rocchio. Boosting algorithms, which combine weak learners, were found to have lower error rates but worse error dispersion. SVMs, on the other hand, were found to be better in terms of error dispersion. The paper also discusses the advantages and disadvantages of each algorithm, including training time, classification speed, and the ability to handle imbalanced data. The results show that SVMs with binary features performed best in terms of error dispersion, while boosting algorithms had lower error rates. The study also found that using all features rather than a subset was preferable, as it did not degrade SVM performance. Training time for boosting decision trees was found to be inordinately long, while SVMs had faster training times. The paper concludes that SVMs are superior in terms of training time and error dispersion, making them a better choice for spam classification.
Reach us at info@study.space
[slides and audio] Support vector machines for spam categorization