Enhanced hypertext categorization using hyperlinks

Enhanced hypertext categorization using hyperlinks

1998 | Soumen Chakrabarti, Byron Dom, Piotr Indyk
This paper presents a method for enhanced hypertext categorization using hyperlinks. The key challenge is to automatically extract metadata that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves search quality. The authors propose robust statistical models and a relaxation labeling technique to better classify hypertext by exploiting link information in a small neighborhood around documents. The technique adapts gracefully to the fraction of neighboring documents having known topics. The authors experimented with pre-classified samples from Yahoo! and the US Patent Database. In previous work, a text classifier misclassified 13% of the Reuters benchmark documents, comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that hypertext classification can be more difficult than text classification. Using terms in neighboring documents increased error to 38%, while the hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas the hypertext classifier reduced this to only 21%. The paper discusses the challenges of hypertext classification, including the noisy nature of link information and the need for a more general model that considers both local and non-local features. The authors propose a model that uses the topic of a document to determine its text and its propensity to link to documents from related topics. This model is implemented using an iterative algorithm that initially guesses the classes based on text alone, then updates them iteratively. The paper also discusses the use of hyperlinks in text classification, including the use of terms from neighboring documents and the use of class labels from pre-classified neighbors. The authors show that using class labels from pre-classified neighbors can provide a highly distilled and adequate representation of the neighborhood. The paper also discusses the use of relaxation labeling to classify hypertext documents when only some or none of the neighbor classes are known. The authors conclude that their method achieves significantly improved accuracy at a moderate computational overhead. The method is effective for both the patent corpus and the Yahoo! corpus, with the patent corpus showing a reduction in classification error from 36% to 21%, and the Yahoo! corpus showing a reduction from 68% to 21%. The method also shows graceful variation of accuracy as the fraction of documents whose classes are pre-specified changes. The relaxation scheme enhances accuracy even when no document in the neighborhood has a pre-specified class.This paper presents a method for enhanced hypertext categorization using hyperlinks. The key challenge is to automatically extract metadata that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves search quality. The authors propose robust statistical models and a relaxation labeling technique to better classify hypertext by exploiting link information in a small neighborhood around documents. The technique adapts gracefully to the fraction of neighboring documents having known topics. The authors experimented with pre-classified samples from Yahoo! and the US Patent Database. In previous work, a text classifier misclassified 13% of the Reuters benchmark documents, comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that hypertext classification can be more difficult than text classification. Using terms in neighboring documents increased error to 38%, while the hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas the hypertext classifier reduced this to only 21%. The paper discusses the challenges of hypertext classification, including the noisy nature of link information and the need for a more general model that considers both local and non-local features. The authors propose a model that uses the topic of a document to determine its text and its propensity to link to documents from related topics. This model is implemented using an iterative algorithm that initially guesses the classes based on text alone, then updates them iteratively. The paper also discusses the use of hyperlinks in text classification, including the use of terms from neighboring documents and the use of class labels from pre-classified neighbors. The authors show that using class labels from pre-classified neighbors can provide a highly distilled and adequate representation of the neighborhood. The paper also discusses the use of relaxation labeling to classify hypertext documents when only some or none of the neighbor classes are known. The authors conclude that their method achieves significantly improved accuracy at a moderate computational overhead. The method is effective for both the patent corpus and the Yahoo! corpus, with the patent corpus showing a reduction in classification error from 36% to 21%, and the Yahoo! corpus showing a reduction from 68% to 21%. The method also shows graceful variation of accuracy as the fraction of documents whose classes are pre-specified changes. The relaxation scheme enhances accuracy even when no document in the neighborhood has a pre-specified class.
Reach us at info@study.space
Understanding Enhanced hypertext categorization using hyperlinks