BERTTweet: A pre-trained language model for English Tweets

BERTTweet: A pre-trained language model for English Tweets

5 Oct 2020 | Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen
BERTweet is the first large-scale pre-trained language model for English tweets. It uses the same architecture as BERT_base and is trained using the RoBERTa pre-training procedure. BERTweet outperforms strong baselines such as RoBERTa_base and XLM-R_base on three tweet NLP tasks: part-of-speech tagging, named-entity recognition, and text classification. The model is released under the MIT License and is available at https://github.com/VinAIResearch/BERTweet. BERTweet is trained on an 80GB corpus of 850M English tweets, with each tweet containing between 10 and 64 word tokens. The dataset includes general tweets from 2012 to 2019 and tweets related to the COVID-19 pandemic from 2020. The model uses fastBPE to segment tweets into subword units and is pre-trained for 40 epochs using the fairseq library. It achieves high performance on various NLP tasks, including POS tagging, NER, and text classification. Experiments show that BERTweet outperforms RoBERTa_base and XLM-R_base on all three tasks. It also improves upon previous state-of-the-art results, achieving a 14% improvement in novel and emerging entity recognition on the WNUT17 dataset and 5% and 4% improvements in text classification on the SemEval2017-Task4A and SemEval2018-Task3A datasets, respectively. The results confirm the effectiveness of a large-scale and domain-specific pre-trained language model for English tweets. The study also evaluates the impact of lexical normalization strategies on model performance. It finds that the "soft" normalization strategy generally performs better than the "hard" strategy. BERTweet is also compared with larger models such as RoBERTa_large and XLM-R_large, which have significantly larger model configurations but do not outperform BERTweet on all tasks. The paper concludes that BERTweet is a strong baseline for future research and applications in tweet analysis. It also mentions plans to release a "large" version of BERTweet that may perform better than RoBERTa_large and XLM-R_large on all three evaluation tasks. The model is publicly available for use in future research and applications.BERTweet is the first large-scale pre-trained language model for English tweets. It uses the same architecture as BERT_base and is trained using the RoBERTa pre-training procedure. BERTweet outperforms strong baselines such as RoBERTa_base and XLM-R_base on three tweet NLP tasks: part-of-speech tagging, named-entity recognition, and text classification. The model is released under the MIT License and is available at https://github.com/VinAIResearch/BERTweet. BERTweet is trained on an 80GB corpus of 850M English tweets, with each tweet containing between 10 and 64 word tokens. The dataset includes general tweets from 2012 to 2019 and tweets related to the COVID-19 pandemic from 2020. The model uses fastBPE to segment tweets into subword units and is pre-trained for 40 epochs using the fairseq library. It achieves high performance on various NLP tasks, including POS tagging, NER, and text classification. Experiments show that BERTweet outperforms RoBERTa_base and XLM-R_base on all three tasks. It also improves upon previous state-of-the-art results, achieving a 14% improvement in novel and emerging entity recognition on the WNUT17 dataset and 5% and 4% improvements in text classification on the SemEval2017-Task4A and SemEval2018-Task3A datasets, respectively. The results confirm the effectiveness of a large-scale and domain-specific pre-trained language model for English tweets. The study also evaluates the impact of lexical normalization strategies on model performance. It finds that the "soft" normalization strategy generally performs better than the "hard" strategy. BERTweet is also compared with larger models such as RoBERTa_large and XLM-R_large, which have significantly larger model configurations but do not outperform BERTweet on all tasks. The paper concludes that BERTweet is a strong baseline for future research and applications in tweet analysis. It also mentions plans to release a "large" version of BERTweet that may perform better than RoBERTa_large and XLM-R_large on all three evaluation tasks. The model is publicly available for use in future research and applications.
Reach us at info@study.space
[slides and audio] BERTweet%3A A pre-trained language model for English Tweets