[slides] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

This paper challenges the prevailing assumption that domain-specific pretraining can benefit from general-domain language models, particularly in domains with abundant unlabeled text, such as biomedicine. The authors conduct a comprehensive study to evaluate the impact of domain-specific pretraining on downstream biomedical NLP tasks. They compile a benchmark, BLURB, which includes a wide range of biomedical NLP tasks from publicly available datasets. The experiments show that domain-specific pretraining from scratch, using in-domain text, significantly outperforms continual pretraining of general-domain language models. The authors also find that common practices, such as using complex tagging schemes in named entity recognition, are unnecessary with BERT models. To accelerate research in biomedical NLP, they release their state-of-the-art pretrained and task-specific models and create a leaderboard featuring the BLURB benchmark.This paper challenges the prevailing assumption that domain-specific pretraining can benefit from general-domain language models, particularly in domains with abundant unlabeled text, such as biomedicine. The authors conduct a comprehensive study to evaluate the impact of domain-specific pretraining on downstream biomedical NLP tasks. They compile a benchmark, BLURB, which includes a wide range of biomedical NLP tasks from publicly available datasets. The experiments show that domain-specific pretraining from scratch, using in-domain text, significantly outperforms continual pretraining of general-domain language models. The authors also find that common practices, such as using complex tagging schemes in named entity recognition, are unnecessary with BERT models. To accelerate research in biomedical NLP, they release their state-of-the-art pretrained and task-specific models and create a leaderboard featuring the BLURB benchmark.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

January 2021 | YU GU*, ROBERT TINN*, HAO CHENG*, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, and HOIFUNG POON, Microsoft Research

January 2021 | YU GU, ROBERT TINN, HAO CHENG*, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, and HOIFUNG POON, Microsoft Research