July 27–31, 2011 | Alan Ritter, Sam Clark, Mausam and Oren Etzioni
This paper addresses the challenge of Named Entity Recognition (NER) in tweets, which are characterized by their noisy, informal nature and the presence of a wide variety of entity types. The authors propose a novel system, T-NER, that leverages in-domain, out-of-domain, and unlabeled data to improve performance. T-NER uses LabeledLDA to exploit Freebase dictionaries as a source of distant supervision, achieving a 25% increase in F1 score over co-training on ten common entity types. The system also incorporates features from part-of-speech tagging, shallow parsing, and capitalization classification to enhance entity recognition. The paper evaluates the effectiveness of these techniques through experiments and demonstrates significant improvements over state-of-the-art news-trained NER systems. The tools and datasets used in the study are available for public use.This paper addresses the challenge of Named Entity Recognition (NER) in tweets, which are characterized by their noisy, informal nature and the presence of a wide variety of entity types. The authors propose a novel system, T-NER, that leverages in-domain, out-of-domain, and unlabeled data to improve performance. T-NER uses LabeledLDA to exploit Freebase dictionaries as a source of distant supervision, achieving a 25% increase in F1 score over co-training on ten common entity types. The system also incorporates features from part-of-speech tagging, shallow parsing, and capitalization classification to enhance entity recognition. The paper evaluates the effectiveness of these techniques through experiments and demonstrates significant improvements over state-of-the-art news-trained NER systems. The tools and datasets used in the study are available for public use.