20 Jun 2019 | Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, Matthew B. A. McDermott
This paper presents publicly available clinical BERT embeddings, specifically designed for clinical text and discharge summaries. The authors address the lack of pre-trained BERT models for clinical text by developing two models: one for general clinical text and another for discharge summaries. They demonstrate that these domain-specific models improve performance on three common clinical NLP tasks compared to non-specific embeddings. However, they also find that these models perform worse on two de-identification tasks, which they argue is due to the differences between de-identified source text and synthetically non-de-identified task text.
The authors train and release BERT-Base and BioBERT-finetuned models on clinical notes and discharge summaries. They show that clinical-specific embeddings improve performance on tasks such as named entity recognition (NER) and natural language inference (NLI), but not on de-identification tasks. They argue that the de-identification tasks are more challenging because the text is synthetically masked, leading to different text distributions than traditional de-identified text.
The paper also includes qualitative comparisons of the embeddings, showing that clinical BERT retains greater cohesion around medical terms than BioBERT. However, the authors note limitations, including the lack of more advanced model architectures and the use of data from a single healthcare institution. They suggest that using synthetic de-identification data in the source text could improve performance on de-identification tasks.
The authors conclude that their clinical BERT models are useful for non-de-identification clinical NLP tasks and that using note-type specific corpora can lead to further performance benefits. They release both models for public use, hoping to benefit the clinical NLP community without requiring the significant computational resources needed to train models on the MIMIC corpus.This paper presents publicly available clinical BERT embeddings, specifically designed for clinical text and discharge summaries. The authors address the lack of pre-trained BERT models for clinical text by developing two models: one for general clinical text and another for discharge summaries. They demonstrate that these domain-specific models improve performance on three common clinical NLP tasks compared to non-specific embeddings. However, they also find that these models perform worse on two de-identification tasks, which they argue is due to the differences between de-identified source text and synthetically non-de-identified task text.
The authors train and release BERT-Base and BioBERT-finetuned models on clinical notes and discharge summaries. They show that clinical-specific embeddings improve performance on tasks such as named entity recognition (NER) and natural language inference (NLI), but not on de-identification tasks. They argue that the de-identification tasks are more challenging because the text is synthetically masked, leading to different text distributions than traditional de-identified text.
The paper also includes qualitative comparisons of the embeddings, showing that clinical BERT retains greater cohesion around medical terms than BioBERT. However, the authors note limitations, including the lack of more advanced model architectures and the use of data from a single healthcare institution. They suggest that using synthetic de-identification data in the source text could improve performance on de-identification tasks.
The authors conclude that their clinical BERT models are useful for non-de-identification clinical NLP tasks and that using note-type specific corpora can lead to further performance benefits. They release both models for public use, hoping to benefit the clinical NLP community without requiring the significant computational resources needed to train models on the MIMIC corpus.