Hierarchical Classification of Web Content

Hierarchical Classification of Web Content

2000 | Susan Dumais, Hao Chen
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level. We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures. The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation. Each sub-problem is smaller than the original problem, and it is sometimes possible to use a much smaller set of features for each. The hierarchical structure can also be used to set the negative set for discriminative training and at classification time to combine information from different levels. In addition, there is some evidence that decomposing the problem can lead to more accurate specialized classifiers. Intuitively, many potentially good features are not useful discriminators in non-hierarchical representations. In a hierarchical model, the word "computer" would be very discriminating at the first level. At the second level more specialized words could be used as features within the top-level "Computer" category. The same features could be used at the second level for two different top-level classes. Informal failure analyses of classification errors for non-hierarchical models support this intuition. Many of the classification errors are for related categories, so category specific features should improve accuracy. Recently several researchers have investigated the use of hierarchies for text classification, with promising results. Our work differs from earlier work in a couple of important respects. First, we test the approach on a large collection of very heterogeneous web content, which we believe is increasingly characteristic of information organization problems. Second, we use a learning model, support vector machine (SVM), that has not previously been explored in the context of hierarchical classification. SVMs have been found to be more accurate for text classification than popular approaches like naive Bayes, neural nets, and Rocchio. We use a reduced-dimension binary-feature version of the SVM model that is very efficient for both initial learning and real-time classification, thus making it applicable to large dynamic collections. We found small advantages in the F1 accuracy score for the hierarchical models, compared with a baseline flat non-hierarchical modelThis paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level. We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures. The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation. Each sub-problem is smaller than the original problem, and it is sometimes possible to use a much smaller set of features for each. The hierarchical structure can also be used to set the negative set for discriminative training and at classification time to combine information from different levels. In addition, there is some evidence that decomposing the problem can lead to more accurate specialized classifiers. Intuitively, many potentially good features are not useful discriminators in non-hierarchical representations. In a hierarchical model, the word "computer" would be very discriminating at the first level. At the second level more specialized words could be used as features within the top-level "Computer" category. The same features could be used at the second level for two different top-level classes. Informal failure analyses of classification errors for non-hierarchical models support this intuition. Many of the classification errors are for related categories, so category specific features should improve accuracy. Recently several researchers have investigated the use of hierarchies for text classification, with promising results. Our work differs from earlier work in a couple of important respects. First, we test the approach on a large collection of very heterogeneous web content, which we believe is increasingly characteristic of information organization problems. Second, we use a learning model, support vector machine (SVM), that has not previously been explored in the context of hierarchical classification. SVMs have been found to be more accurate for text classification than popular approaches like naive Bayes, neural nets, and Rocchio. We use a reduced-dimension binary-feature version of the SVM model that is very efficient for both initial learning and real-time classification, thus making it applicable to large dynamic collections. We found small advantages in the F1 accuracy score for the hierarchical models, compared with a baseline flat non-hierarchical model
Reach us at info@study.space
[slides and audio] Hierarchical classification of Web content