July 27–31, 2011 | Alan Ritter, Sam Clark, Mausam and Oren Etzioni
This paper presents an experimental study on Named Entity Recognition (NER) in tweets, focusing on improving the performance of NLP tools on this challenging task. The authors propose a new system called T-NER, which significantly outperforms existing NLP tools like the Stanford NER system. T-NER leverages the redundancy in tweets and uses LabeledLDA to exploit Freebase dictionaries for distant supervision, achieving a 25% increase in F1 score over co-training methods. The system is built by re-engineering the NLP pipeline, starting with part-of-speech tagging, through chunking, to named-entity recognition.
The study shows that standard NLP tools perform poorly on tweets due to their informal and noisy nature. T-NER improves performance by utilizing in-domain, out-of-domain, and unlabeled data. The system includes a novel capitalization classifier, T-CAP, which helps in identifying informative capitalization in tweets.
The paper also introduces a new approach to distant supervision using topic models, which allows the system to handle the large number of infrequent and distinctive entity types found in tweets. T-NER outperforms the Stanford NER system by 52% in F1 score on named entity segmentation. The system is evaluated on a dataset of 2,400 tweets and shows significant improvements in both segmentation and classification tasks.
The study highlights the importance of using domain-specific data and techniques for NER on tweets, which differ significantly from traditional news corpora. The authors also discuss related work and compare their approach with existing methods, demonstrating the effectiveness of their system in handling the unique challenges of NER on social media text. The tools developed in this study are available for use by the research community.This paper presents an experimental study on Named Entity Recognition (NER) in tweets, focusing on improving the performance of NLP tools on this challenging task. The authors propose a new system called T-NER, which significantly outperforms existing NLP tools like the Stanford NER system. T-NER leverages the redundancy in tweets and uses LabeledLDA to exploit Freebase dictionaries for distant supervision, achieving a 25% increase in F1 score over co-training methods. The system is built by re-engineering the NLP pipeline, starting with part-of-speech tagging, through chunking, to named-entity recognition.
The study shows that standard NLP tools perform poorly on tweets due to their informal and noisy nature. T-NER improves performance by utilizing in-domain, out-of-domain, and unlabeled data. The system includes a novel capitalization classifier, T-CAP, which helps in identifying informative capitalization in tweets.
The paper also introduces a new approach to distant supervision using topic models, which allows the system to handle the large number of infrequent and distinctive entity types found in tweets. T-NER outperforms the Stanford NER system by 52% in F1 score on named entity segmentation. The system is evaluated on a dataset of 2,400 tweets and shows significant improvements in both segmentation and classification tasks.
The study highlights the importance of using domain-specific data and techniques for NER on tweets, which differ significantly from traditional news corpora. The authors also discuss related work and compare their approach with existing methods, demonstrating the effectiveness of their system in handling the unique challenges of NER on social media text. The tools developed in this study are available for use by the research community.