Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

January 2021 | YU GU*, ROBERT TINN*, HAO CHENG*, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, and HOIFUNG POON, Microsoft Research
This paper challenges the assumption that domain-specific pretraining benefits from starting with general-domain language models. Instead, it shows that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. The authors compile a comprehensive biomedical NLP benchmark from publicly available datasets and conduct in-depth comparisons of modeling choices for pretraining and task-specific fine-tuning. Their experiments show that domain-specific pretraining from scratch provides a solid foundation for biomedical NLP, leading to new state-of-the-art results across a wide range of tasks. They also discover that some common practices, such as using complex tagging schemes in named entity recognition (NER), are unnecessary with BERT models. To accelerate research in biomedical NLP, the authors release their state-of-the-art pretrained and task-specific models for the community and create a leaderboard featuring their BLURB benchmark at https://aka.ms/BLURB. The paper presents a detailed overview of neural language model pretraining, including vocabulary, model architecture, self-supervision, and advanced pretraining techniques. It also discusses biomedical language model pretraining, comparing mixed-domain pretraining with domain-specific pretraining from scratch. The authors show that domain-specific pretraining from scratch is a better strategy for biomedical language model pretraining, as biomedical text has abundant unlabeled data. They also introduce BLURB, a comprehensive benchmark for biomedical NLP, which includes named entity recognition (NER), evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering. The authors evaluate the performance of various pretraining techniques and fine-tuning methods on the BLURB benchmark, showing that domain-specific pretraining from scratch consistently outperforms mixed-domain pretraining. They also find that adversarial pretraining leads to slight degradation in performance, and that pretraining on general-domain text provides no benefit even with an in-domain vocabulary. The authors conclude that domain-specific pretraining from scratch is a better strategy for biomedical NLP and that their BLURB benchmark provides a valuable resource for the community.This paper challenges the assumption that domain-specific pretraining benefits from starting with general-domain language models. Instead, it shows that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. The authors compile a comprehensive biomedical NLP benchmark from publicly available datasets and conduct in-depth comparisons of modeling choices for pretraining and task-specific fine-tuning. Their experiments show that domain-specific pretraining from scratch provides a solid foundation for biomedical NLP, leading to new state-of-the-art results across a wide range of tasks. They also discover that some common practices, such as using complex tagging schemes in named entity recognition (NER), are unnecessary with BERT models. To accelerate research in biomedical NLP, the authors release their state-of-the-art pretrained and task-specific models for the community and create a leaderboard featuring their BLURB benchmark at https://aka.ms/BLURB. The paper presents a detailed overview of neural language model pretraining, including vocabulary, model architecture, self-supervision, and advanced pretraining techniques. It also discusses biomedical language model pretraining, comparing mixed-domain pretraining with domain-specific pretraining from scratch. The authors show that domain-specific pretraining from scratch is a better strategy for biomedical language model pretraining, as biomedical text has abundant unlabeled data. They also introduce BLURB, a comprehensive benchmark for biomedical NLP, which includes named entity recognition (NER), evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering. The authors evaluate the performance of various pretraining techniques and fine-tuning methods on the BLURB benchmark, showing that domain-specific pretraining from scratch consistently outperforms mixed-domain pretraining. They also find that adversarial pretraining leads to slight degradation in performance, and that pretraining on general-domain text provides no benefit even with an in-domain vocabulary. The authors conclude that domain-specific pretraining from scratch is a better strategy for biomedical NLP and that their BLURB benchmark provides a valuable resource for the community.
Reach us at info@study.space
Understanding Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing