A re-examination of text categorization methods

A re-examination of text categorization methods

1999 | Yiming Yang and Xin Liu
This paper presents a controlled study with statistical significance tests on five text categorization methods: Support Vector Machines (SVM), k-Nearest Neighbor (kNN), neural network (NNet), Linear Least-squares Fit (LLSF), and Naive Bayes (NB). The study focuses on the robustness of these methods in handling skewed category distributions and their performance as a function of training-set category frequency. Results show that SVM, kNN, and LLSF significantly outperform NNet and NB when the number of positive training instances per category is small (less than ten), while all methods perform similarly when categories are sufficiently common (over 300 instances). The study uses the Reuters-21578 corpus, which contains 90 categories with varying frequencies. Performance is evaluated using standard metrics such as recall, precision, and F1 measure, with both micro-averaging and macro-averaging approaches. The results indicate that SVM, kNN, and LLSF outperform NB and NNet in most cases, particularly when dealing with rare categories. However, the performance of these methods varies depending on the category frequency, with SVM and kNN showing better performance on common categories, while NB and NNet struggle with rare categories. The study also introduces various statistical significance tests, including sign tests, t-tests, and rank-based tests, to compare the performance of the classifiers. These tests reveal that SVM, kNN, and LLSF are significantly better than NB and NNet, especially in scenarios with skewed category distributions. The results highlight the importance of considering both micro- and macro-level performance measures when evaluating text categorization methods. The study concludes that significance tests should be used jointly for cross-method comparisons to provide a more comprehensive analysis of classifier performance.This paper presents a controlled study with statistical significance tests on five text categorization methods: Support Vector Machines (SVM), k-Nearest Neighbor (kNN), neural network (NNet), Linear Least-squares Fit (LLSF), and Naive Bayes (NB). The study focuses on the robustness of these methods in handling skewed category distributions and their performance as a function of training-set category frequency. Results show that SVM, kNN, and LLSF significantly outperform NNet and NB when the number of positive training instances per category is small (less than ten), while all methods perform similarly when categories are sufficiently common (over 300 instances). The study uses the Reuters-21578 corpus, which contains 90 categories with varying frequencies. Performance is evaluated using standard metrics such as recall, precision, and F1 measure, with both micro-averaging and macro-averaging approaches. The results indicate that SVM, kNN, and LLSF outperform NB and NNet in most cases, particularly when dealing with rare categories. However, the performance of these methods varies depending on the category frequency, with SVM and kNN showing better performance on common categories, while NB and NNet struggle with rare categories. The study also introduces various statistical significance tests, including sign tests, t-tests, and rank-based tests, to compare the performance of the classifiers. These tests reveal that SVM, kNN, and LLSF are significantly better than NB and NNet, especially in scenarios with skewed category distributions. The results highlight the importance of considering both micro- and macro-level performance measures when evaluating text categorization methods. The study concludes that significance tests should be used jointly for cross-method comparisons to provide a more comprehensive analysis of classifier performance.
Reach us at info@study.space
[slides and audio] A re-examination of text categorization methods