2019 | Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang
BioBERT is a pre-trained language model designed for biomedical text mining. It is developed to address the challenges of applying general-purpose NLP models like BERT to biomedical texts, where word distributions differ significantly from general domain texts. BioBERT is pre-trained on large-scale biomedical corpora, including PubMed abstracts and PMC full-text articles, to better understand complex biomedical texts.
The model outperforms BERT and previous state-of-the-art models in three key biomedical text mining tasks: biomedical named entity recognition (NER), biomedical relation extraction (RE), and biomedical question answering (QA). BioBERT achieves a 0.62% F1 score improvement in NER, a 2.80% F1 score improvement in RE, and a 12.24% MRR improvement in QA. These results demonstrate that pre-training BERT on biomedical corpora significantly enhances its performance in biomedical text mining.
BioBERT is structured similarly to BERT but is adapted for the biomedical domain. It uses WordPiece tokenization to handle out-of-vocabulary words and is pre-trained on biomedical texts to capture domain-specific knowledge. The model is fine-tuned on three major biomedical text mining tasks: NER, RE, and QA. The pre-trained weights and source code for fine-tuning are publicly available.
The study shows that pre-training BERT on biomedical corpora is crucial for effective biomedical text mining. BioBERT requires minimal architectural modifications to perform well on various biomedical NLP tasks. The results indicate that BioBERT significantly improves performance on biomedical NER, RE, and QA compared to previous models. The pre-trained weights and source code are available for further research and application.BioBERT is a pre-trained language model designed for biomedical text mining. It is developed to address the challenges of applying general-purpose NLP models like BERT to biomedical texts, where word distributions differ significantly from general domain texts. BioBERT is pre-trained on large-scale biomedical corpora, including PubMed abstracts and PMC full-text articles, to better understand complex biomedical texts.
The model outperforms BERT and previous state-of-the-art models in three key biomedical text mining tasks: biomedical named entity recognition (NER), biomedical relation extraction (RE), and biomedical question answering (QA). BioBERT achieves a 0.62% F1 score improvement in NER, a 2.80% F1 score improvement in RE, and a 12.24% MRR improvement in QA. These results demonstrate that pre-training BERT on biomedical corpora significantly enhances its performance in biomedical text mining.
BioBERT is structured similarly to BERT but is adapted for the biomedical domain. It uses WordPiece tokenization to handle out-of-vocabulary words and is pre-trained on biomedical texts to capture domain-specific knowledge. The model is fine-tuned on three major biomedical text mining tasks: NER, RE, and QA. The pre-trained weights and source code for fine-tuning are publicly available.
The study shows that pre-training BERT on biomedical corpora is crucial for effective biomedical text mining. BioBERT requires minimal architectural modifications to perform well on various biomedical NLP tasks. The results indicate that BioBERT significantly improves performance on biomedical NER, RE, and QA compared to previous models. The pre-trained weights and source code are available for further research and application.