SCIBERT: A Pretrained Language Model for Scientific Text

SCIBERT: A Pretrained Language Model for Scientific Text

10 Sep 2019 | Iz Beltagy, Kyle Lo, Arman Cohan
**SCIBERT: A Pretrained Language Model for Scientific Text** Iz Beltagy, Kyle Lo, Arman Cohan Allen Institute for Artificial Intelligence, Seattle, WA, USA {beltagy,kylel,armanc}@allenai.org **Abstract** Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate SciBERT on a suite of tasks including sequence tagging, sentence classification, and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at <https://github.com/allenai/scibert/>. The exponential increase in the volume of scientific publications has made NLP an essential tool for large-scale knowledge extraction and machine reading of these documents. Recent progress in NLP has been driven by the adoption of deep neural models, but training such models often requires large amounts of labeled data. In general domains, large-scale training data is often possible to obtain through crowdsourcing, but in scientific domains, annotated data is difficult and expensive to collect due to the expertise required for quality annotation. Unsupervised pretraining of language models on large corpora significantly improves performance on many NLP tasks. These models return contextualized embeddings for each token, which can be passed into minimal task-specific neural architectures. Leveraging the success of unsupervised pretraining has become especially important in scientific NLP, where task-specific annotations are difficult to obtain. While both BERT and ELMo have released pretrained models, they are still trained on general domain corpora such as news articles and Wikipedia. In this work, we make the following contributions: (i) We release SciBERT, a new resource demonstrated to improve performance on a range of NLP tasks in the scientific domain. SciBERT is a pretrained language model based on BERT but trained on a large corpus of scientific text. (ii) We perform extensive experimentation to investigate the performance of finetuning versus task-specific architectures atop frozen embeddings, and the effect of having an in-domain vocabulary. (iii) We evaluate SciBERT on a suite of tasks in the scientific domain, and achieve new state-of-the-art (SOTA) results on many of these tasks. The BERT model architecture (Devlin et al., 2019) is based on a multilayer bidirectional Transformer (Vaswani et al., 2017). Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting**SCIBERT: A Pretrained Language Model for Scientific Text** Iz Beltagy, Kyle Lo, Arman Cohan Allen Institute for Artificial Intelligence, Seattle, WA, USA {beltagy,kylel,armanc}@allenai.org **Abstract** Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate SciBERT on a suite of tasks including sequence tagging, sentence classification, and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at <https://github.com/allenai/scibert/>. The exponential increase in the volume of scientific publications has made NLP an essential tool for large-scale knowledge extraction and machine reading of these documents. Recent progress in NLP has been driven by the adoption of deep neural models, but training such models often requires large amounts of labeled data. In general domains, large-scale training data is often possible to obtain through crowdsourcing, but in scientific domains, annotated data is difficult and expensive to collect due to the expertise required for quality annotation. Unsupervised pretraining of language models on large corpora significantly improves performance on many NLP tasks. These models return contextualized embeddings for each token, which can be passed into minimal task-specific neural architectures. Leveraging the success of unsupervised pretraining has become especially important in scientific NLP, where task-specific annotations are difficult to obtain. While both BERT and ELMo have released pretrained models, they are still trained on general domain corpora such as news articles and Wikipedia. In this work, we make the following contributions: (i) We release SciBERT, a new resource demonstrated to improve performance on a range of NLP tasks in the scientific domain. SciBERT is a pretrained language model based on BERT but trained on a large corpus of scientific text. (ii) We perform extensive experimentation to investigate the performance of finetuning versus task-specific architectures atop frozen embeddings, and the effect of having an in-domain vocabulary. (iii) We evaluate SciBERT on a suite of tasks in the scientific domain, and achieve new state-of-the-art (SOTA) results on many of these tasks. The BERT model architecture (Devlin et al., 2019) is based on a multilayer bidirectional Transformer (Vaswani et al., 2017). Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting
Reach us at info@study.space