26 Jul 2019 | Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
The paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Yinhan Liu et al. presents a replication study of BERT pretraining, aiming to carefully measure the impact of various hyperparameters and training data size. The authors find that BERT was undertrained and propose an improved training recipe called RoBERTa, which can match or exceed the performance of post-BERT models. Key modifications include longer training, larger batches, removal of the next sentence prediction objective, training on longer sequences, and dynamic masking. They also introduce a new dataset, CC-NEWS, to better control for training set size effects. RoBERTa achieves state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the importance of overlooked design choices and questioning the source of recent improvements. The paper releases the models and code for pretraining and fine-tuning.The paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Yinhan Liu et al. presents a replication study of BERT pretraining, aiming to carefully measure the impact of various hyperparameters and training data size. The authors find that BERT was undertrained and propose an improved training recipe called RoBERTa, which can match or exceed the performance of post-BERT models. Key modifications include longer training, larger batches, removal of the next sentence prediction objective, training on longer sequences, and dynamic masking. They also introduce a new dataset, CC-NEWS, to better control for training set size effects. RoBERTa achieves state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the importance of overlooked design choices and questioning the source of recent improvements. The paper releases the models and code for pretraining and fine-tuning.