[slides and audio] BERTweet%3A A pre-trained language model for English Tweets

BERTweet is the first large-scale pre-trained language model specifically designed for English Tweets. It is based on the BERTbase architecture and trained using the RoBERTa pre-training procedure. BERTweet outperforms strong baselines such as RoBERTa\_base and XLM-R\_base on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition, and text classification. The model is trained on an 80GB corpus of 850 million English Tweets, addressing the unique characteristics of Tweets, such as their short length and informal grammar. BERTweet is released under the MIT License to facilitate future research and applications in Tweet data analysis. The model's effectiveness is confirmed through experiments, showing that it improves upon previous state-of-the-art models and provides a strong baseline for Tweet analytic tasks.BERTweet is the first large-scale pre-trained language model specifically designed for English Tweets. It is based on the BERTbase architecture and trained using the RoBERTa pre-training procedure. BERTweet outperforms strong baselines such as RoBERTa\_base and XLM-R\_base on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition, and text classification. The model is trained on an 80GB corpus of 850 million English Tweets, addressing the unique characteristics of Tweets, such as their short length and informal grammar. BERTweet is released under the MIT License to facilitate future research and applications in Tweet data analysis. The model's effectiveness is confirmed through experiments, showing that it improves upon previous state-of-the-art models and provides a strong baseline for Tweet analytic tasks.

BERTTweet: A pre-trained language model for English Tweets

5 Oct 2020 | Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen