Unsupervised Models for Named Entity Classification

Unsupervised Models for Named Entity Classification

| Michael Collins and Yoram Singer
This paper presents unsupervised methods for named entity classification, demonstrating that unlabeled data can significantly reduce the need for supervision. The authors show that only 7 seed rules are needed to achieve high accuracy, leveraging redundancy in the data where either spelling or context alone can determine the entity type. Two algorithms are introduced: the first is based on decision list learning, inspired by Yarowsky (1995) and modified from Blum and Mitchell (1998), while the second is a boosting-based algorithm, CoBoost, which extends AdaBoost to the cotraining framework. The decision list algorithm iteratively builds rules by alternating between spelling and contextual features, while CoBoost trains two classifiers in parallel to minimize disagreement on unlabeled examples. Both methods achieve over 91% accuracy using only 7 seed rules and 90,000 unlabeled examples. The paper also discusses the theoretical foundations of these methods, including the use of unlabeled data to improve classifier agreement and the challenges of applying these techniques to multiclass problems. Evaluation shows that CoBoost performs well, with high accuracy and agreement between classifiers. The study highlights the effectiveness of unsupervised learning in named entity classification, reducing the reliance on labeled data while maintaining high performance.This paper presents unsupervised methods for named entity classification, demonstrating that unlabeled data can significantly reduce the need for supervision. The authors show that only 7 seed rules are needed to achieve high accuracy, leveraging redundancy in the data where either spelling or context alone can determine the entity type. Two algorithms are introduced: the first is based on decision list learning, inspired by Yarowsky (1995) and modified from Blum and Mitchell (1998), while the second is a boosting-based algorithm, CoBoost, which extends AdaBoost to the cotraining framework. The decision list algorithm iteratively builds rules by alternating between spelling and contextual features, while CoBoost trains two classifiers in parallel to minimize disagreement on unlabeled examples. Both methods achieve over 91% accuracy using only 7 seed rules and 90,000 unlabeled examples. The paper also discusses the theoretical foundations of these methods, including the use of unlabeled data to improve classifier agreement and the challenges of applying these techniques to multiclass problems. Evaluation shows that CoBoost performs well, with high accuracy and agreement between classifiers. The study highlights the effectiveness of unsupervised learning in named entity classification, reducing the reliance on labeled data while maintaining high performance.
Reach us at info@futurestudyspace.com
Understanding Unsupervised Models for Named Entity Classification