BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights

BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights

2024 | François Remy, Kris Demuynck, Thomas Demeester
BioLORD-2023 is a state-of-the-art model for semantic textual similarity (STS) and biomedical concept representation (BCR) in the clinical domain. It integrates large language models (LLMs) with clinical knowledge graphs to enhance biomedical semantic representation learning. The model is trained using three phases: contrastive learning, self-distillation, and weight averaging. The contrastive learning phase minimizes the distance between concept names and their definitions while maximizing the distance between unrelated concepts. The self-distillation phase accelerates convergence and improves biomedical knowledge acquisition without sacrificing general language understanding. The weight averaging phase further refines the model's performance. BioLORD-2023 also includes a multilingual variant, BioLORD-2023-M, which leverages cross-lingual distillation to enable the model to handle up to 50 languages. This model is evaluated on various downstream tasks, including STS, BCR, and named entity linking (NEL), and shows significant improvements over previous models. The multilingual model achieves comparable or superior performance on BCR benchmarks and outperforms SapBERT on STS and BCR tasks. The model also demonstrates strong performance on NEL tasks, particularly for non-English languages. BioLORD-2023 is a valuable tool for future biomedical applications, offering improved accuracy and robustness in clinical settings. The model's performance is validated through extensive experiments and evaluations on multiple benchmark datasets. The study highlights the potential of integrating LLMs with clinical knowledge graphs to enhance biomedical semantic representation learning and improve the accuracy of clinical NLP tasks.BioLORD-2023 is a state-of-the-art model for semantic textual similarity (STS) and biomedical concept representation (BCR) in the clinical domain. It integrates large language models (LLMs) with clinical knowledge graphs to enhance biomedical semantic representation learning. The model is trained using three phases: contrastive learning, self-distillation, and weight averaging. The contrastive learning phase minimizes the distance between concept names and their definitions while maximizing the distance between unrelated concepts. The self-distillation phase accelerates convergence and improves biomedical knowledge acquisition without sacrificing general language understanding. The weight averaging phase further refines the model's performance. BioLORD-2023 also includes a multilingual variant, BioLORD-2023-M, which leverages cross-lingual distillation to enable the model to handle up to 50 languages. This model is evaluated on various downstream tasks, including STS, BCR, and named entity linking (NEL), and shows significant improvements over previous models. The multilingual model achieves comparable or superior performance on BCR benchmarks and outperforms SapBERT on STS and BCR tasks. The model also demonstrates strong performance on NEL tasks, particularly for non-English languages. BioLORD-2023 is a valuable tool for future biomedical applications, offering improved accuracy and robustness in clinical settings. The model's performance is validated through extensive experiments and evaluations on multiple benchmark datasets. The study highlights the potential of integrating LLMs with clinical knowledge graphs to enhance biomedical semantic representation learning and improve the accuracy of clinical NLP tasks.
Reach us at info@study.space