26 Jul 2019 | Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
RoBERTa is a robustly optimized BERT pretraining approach that improves upon the original BERT model by making key design choices. The paper presents a replication study of BERT pretraining, carefully measuring the impact of hyperparameters and training data size. It finds that BERT was significantly undertrained and proposes an improved training recipe, RoBERTa, which can match or exceed the performance of post-BERT models. RoBERTa includes training longer with larger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern. It also uses a new dataset, CC-NEWS, to better control for training data effects.
When controlling for training data, RoBERTa improves upon published BERT results on GLUE and SQuAD. It achieves a score of 88.5 on the public GLUE leaderboard, matching the 88.4 reported by Yang et al. (2019). RoBERTa establishes a new state-of-the-art on four of the nine GLUE tasks: MNLI, QNLI, RTE, and STS-B. It also matches state-of-the-art results on SQuAD and RACE. The results highlight the importance of previously overlooked design choices and raise questions about the source of recent improvements.
The paper also explores various training procedures, including dynamic masking, different input formats, and large batch training. It finds that training with larger batches improves perplexity and end-task accuracy. It also finds that training without the next sentence prediction objective can match or slightly improve downstream task performance.
RoBERTa is trained with dynamic masking, full sentences without NSP loss, large mini-batches, and a larger byte-level BPE. It is evaluated on GLUE, SQuAD, and RACE benchmarks. On GLUE, RoBERTa achieves state-of-the-art results on all nine development sets. On SQuAD, RoBERTa sets a new state-of-the-art, improving over XLNet by 0.4 points (EM) and 0.6 points (F1). On RACE, RoBERTa achieves state-of-the-art results on both middle-school and high-school settings.
The paper also discusses related work, including other pretraining methods and their performance. It concludes that RoBERTa's design choices are important for achieving state-of-the-art results and that BERT's pretraining objective remains competitive with recent alternatives. The paper releases its models and code for pretraining and fine-tuning.RoBERTa is a robustly optimized BERT pretraining approach that improves upon the original BERT model by making key design choices. The paper presents a replication study of BERT pretraining, carefully measuring the impact of hyperparameters and training data size. It finds that BERT was significantly undertrained and proposes an improved training recipe, RoBERTa, which can match or exceed the performance of post-BERT models. RoBERTa includes training longer with larger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern. It also uses a new dataset, CC-NEWS, to better control for training data effects.
When controlling for training data, RoBERTa improves upon published BERT results on GLUE and SQuAD. It achieves a score of 88.5 on the public GLUE leaderboard, matching the 88.4 reported by Yang et al. (2019). RoBERTa establishes a new state-of-the-art on four of the nine GLUE tasks: MNLI, QNLI, RTE, and STS-B. It also matches state-of-the-art results on SQuAD and RACE. The results highlight the importance of previously overlooked design choices and raise questions about the source of recent improvements.
The paper also explores various training procedures, including dynamic masking, different input formats, and large batch training. It finds that training with larger batches improves perplexity and end-task accuracy. It also finds that training without the next sentence prediction objective can match or slightly improve downstream task performance.
RoBERTa is trained with dynamic masking, full sentences without NSP loss, large mini-batches, and a larger byte-level BPE. It is evaluated on GLUE, SQuAD, and RACE benchmarks. On GLUE, RoBERTa achieves state-of-the-art results on all nine development sets. On SQuAD, RoBERTa sets a new state-of-the-art, improving over XLNet by 0.4 points (EM) and 0.6 points (F1). On RACE, RoBERTa achieves state-of-the-art results on both middle-school and high-school settings.
The paper also discusses related work, including other pretraining methods and their performance. It concludes that RoBERTa's design choices are important for achieving state-of-the-art results and that BERT's pretraining objective remains competitive with recent alternatives. The paper releases its models and code for pretraining and fine-tuning.