SCIBERT: A Pretrained Language Model for Scientific Text

SCIBERT: A Pretrained Language Model for Scientific Text

10 Sep 2019 | Iz Beltagy, Kyle Lo, Arman Cohan
SCIBERT is a pretrained language model designed for scientific text, based on BERT. It addresses the challenge of obtaining high-quality, large-scale labeled scientific data for NLP tasks. SCIBERT is pretrained on a large corpus of scientific publications to improve performance on downstream scientific NLP tasks. It achieves statistically significant improvements over BERT and sets new state-of-the-art results on several tasks. The model is trained on a corpus of 1.14 million scientific papers from Semantic Scholar, with a focus on biomedical and computer science domains. SCIBERT uses a custom vocabulary (SCIVOCAB) tailored for scientific text, which differs significantly from the general domain vocabulary (BASEVOCAB) used in BERT. The model is evaluated on various scientific NLP tasks, including named entity recognition, PICO extraction, text classification, relation classification, and dependency parsing. SCIBERT outperforms BERT-Base on most tasks, especially in biomedical and computer science domains. It also performs well when using frozen BERT embeddings for task-specific models. The model is implemented in PyTorch using AllenNLP and is available for use. SCIBERT's performance highlights the importance of domain-specific pretraining and vocabulary in improving NLP tasks on scientific texts. The model's success demonstrates the potential of pretrained language models in scientific NLP tasks.SCIBERT is a pretrained language model designed for scientific text, based on BERT. It addresses the challenge of obtaining high-quality, large-scale labeled scientific data for NLP tasks. SCIBERT is pretrained on a large corpus of scientific publications to improve performance on downstream scientific NLP tasks. It achieves statistically significant improvements over BERT and sets new state-of-the-art results on several tasks. The model is trained on a corpus of 1.14 million scientific papers from Semantic Scholar, with a focus on biomedical and computer science domains. SCIBERT uses a custom vocabulary (SCIVOCAB) tailored for scientific text, which differs significantly from the general domain vocabulary (BASEVOCAB) used in BERT. The model is evaluated on various scientific NLP tasks, including named entity recognition, PICO extraction, text classification, relation classification, and dependency parsing. SCIBERT outperforms BERT-Base on most tasks, especially in biomedical and computer science domains. It also performs well when using frozen BERT embeddings for task-specific models. The model is implemented in PyTorch using AllenNLP and is available for use. SCIBERT's performance highlights the importance of domain-specific pretraining and vocabulary in improving NLP tasks on scientific texts. The model's success demonstrates the potential of pretrained language models in scientific NLP tasks.
Reach us at info@study.space
[slides and audio] SciBERT%3A A Pretrained Language Model for Scientific Text