Publicly Available Clinical BERT Embeddings

Publicly Available Clinical BERT Embeddings

20 Jun 2019 | Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, Matthew B. A. McDermott
This paper introduces publicly available clinical BERT embeddings for two types of clinical text: general clinical text and discharge summaries. The authors demonstrate that domain-specific BERT models improve performance on three common clinical NLP tasks compared to general embeddings. However, these models perform worse on two de-identification tasks, which the authors argue is due to differences between de-identified source text and synthetically non-identified task text. The authors trained two BERT models on clinical text: Clinical BERT, which uses text from all note types, and Discharge Summary BERT, which uses only discharge summaries. They also trained a Clinical BioBERT model, initialized from BioBERT. These models were fine-tuned for various clinical NLP tasks, including named entity recognition (NER) and natural language inference (NLI). The results show that Clinical BERT achieves state-of-the-art performance on the MedNLI task, but does not improve performance on de-identification tasks. The authors argue that this is due to the different data distributions in these tasks. Clinical BERT performs better on NER tasks than BioBERT or general BERT. However, Discharge Summary BERT offers performance improvements over Clinical BERT on some tasks. The authors also compare the embeddings of Clinical BERT and BioBERT, finding that Clinical BERT shows greater cohesion with medical terms. They note that their models do not perform well on de-identification tasks, which they attribute to the different data distributions in these tasks. The authors suggest that introducing synthetic de-identification into the source clinical text could improve performance on these tasks. The authors release their models for public use, emphasizing the importance of domain-specific embeddings for clinical NLP tasks. They note that their work is the first to release clinically trained BERT models and hope that these embeddings will be useful to the clinical NLP community.This paper introduces publicly available clinical BERT embeddings for two types of clinical text: general clinical text and discharge summaries. The authors demonstrate that domain-specific BERT models improve performance on three common clinical NLP tasks compared to general embeddings. However, these models perform worse on two de-identification tasks, which the authors argue is due to differences between de-identified source text and synthetically non-identified task text. The authors trained two BERT models on clinical text: Clinical BERT, which uses text from all note types, and Discharge Summary BERT, which uses only discharge summaries. They also trained a Clinical BioBERT model, initialized from BioBERT. These models were fine-tuned for various clinical NLP tasks, including named entity recognition (NER) and natural language inference (NLI). The results show that Clinical BERT achieves state-of-the-art performance on the MedNLI task, but does not improve performance on de-identification tasks. The authors argue that this is due to the different data distributions in these tasks. Clinical BERT performs better on NER tasks than BioBERT or general BERT. However, Discharge Summary BERT offers performance improvements over Clinical BERT on some tasks. The authors also compare the embeddings of Clinical BERT and BioBERT, finding that Clinical BERT shows greater cohesion with medical terms. They note that their models do not perform well on de-identification tasks, which they attribute to the different data distributions in these tasks. The authors suggest that introducing synthetic de-identification into the source clinical text could improve performance on these tasks. The authors release their models for public use, emphasizing the importance of domain-specific embeddings for clinical NLP tasks. They note that their work is the first to release clinically trained BERT models and hope that these embeddings will be useful to the clinical NLP community.
Reach us at info@study.space
[slides] Publicly Available Clinical BERT Embeddings | StudySpace