Understanding Hierarchical classification of Web content

This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. The authors use support vector machines (SVMs) for classification, which have been shown to be efficient and effective but have not been previously explored in the context of hierarchical classification. The hierarchical approach is compared with a flat non-hierarchical approach, and the results show small advantages in accuracy for hierarchical models over flat models. The sequential Boolean decision rule and the multiplicative decision rule are compared, with the sequential approach being more efficient and achieving similar accuracy. The study uses a large collection of web pages from LookSmart's directory, focusing on the top 13 and 150 second-level categories. The hierarchical models achieve an F1 score of 0.572 for the top-level categories and 0.495 for the second-level categories, with the sequential Boolean rule being the most efficient and effective method. The research highlights the potential of hierarchical structures in improving the efficiency and efficacy of text classification, particularly in handling large, heterogeneous web content.This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. The authors use support vector machines (SVMs) for classification, which have been shown to be efficient and effective but have not been previously explored in the context of hierarchical classification. The hierarchical approach is compared with a flat non-hierarchical approach, and the results show small advantages in accuracy for hierarchical models over flat models. The sequential Boolean decision rule and the multiplicative decision rule are compared, with the sequential approach being more efficient and achieving similar accuracy. The study uses a large collection of web pages from LookSmart's directory, focusing on the top 13 and 150 second-level categories. The hierarchical models achieve an F1 score of 0.572 for the top-level categories and 0.495 for the second-level categories, with the sequential Boolean rule being the most efficient and effective method. The research highlights the potential of hierarchical structures in improving the efficiency and efficacy of text classification, particularly in handling large, heterogeneous web content.

Hierarchical Classification of Web Content

2000 | Susan Dumais, Hao Chen