MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

July 30, 2024 | Iñigo Alonsoa, Maite Oronozb, Rodrigo Agerri
MedExpQA is a multilingual benchmark for evaluating Large Language Models (LLMs) in Medical Question Answering (QA). It is the first benchmark that includes gold reference explanations written by medical doctors to justify correct and incorrect answers. The benchmark is based on the Antidote CasiMedicos dataset, which contains clinical cases, multiple-choice questions, and gold reference explanations. The dataset is available in Spanish, English, French, and Italian. The benchmark includes three types of gold knowledge: full gold explanations (E), explanations of incorrect options (EI), and full explanations with hidden references (H). It also includes automatic knowledge retrieval methods such as Retrieval Augmented Generation (RAG), including MEDRAG. The results show that LLMs, even with RAG, still have a long way to go to match the performance of models using gold knowledge. Performance is lower for languages other than English, indicating a need for further research on multilingual LLMs. The benchmark provides a comprehensive evaluation of LLMs in Medical QA, including zero-shot and fine-tuned settings. The results demonstrate that fine-tuning significantly improves performance, but RAG methods still have room for improvement. The benchmark also highlights the importance of high-quality gold explanations for evaluating LLMs in Medical QA. Data, code, and fine-tuned models are publicly available for reproducibility and benchmarking.MedExpQA is a multilingual benchmark for evaluating Large Language Models (LLMs) in Medical Question Answering (QA). It is the first benchmark that includes gold reference explanations written by medical doctors to justify correct and incorrect answers. The benchmark is based on the Antidote CasiMedicos dataset, which contains clinical cases, multiple-choice questions, and gold reference explanations. The dataset is available in Spanish, English, French, and Italian. The benchmark includes three types of gold knowledge: full gold explanations (E), explanations of incorrect options (EI), and full explanations with hidden references (H). It also includes automatic knowledge retrieval methods such as Retrieval Augmented Generation (RAG), including MEDRAG. The results show that LLMs, even with RAG, still have a long way to go to match the performance of models using gold knowledge. Performance is lower for languages other than English, indicating a need for further research on multilingual LLMs. The benchmark provides a comprehensive evaluation of LLMs in Medical QA, including zero-shot and fine-tuned settings. The results demonstrate that fine-tuning significantly improves performance, but RAG methods still have room for improvement. The benchmark also highlights the importance of high-quality gold explanations for evaluating LLMs in Medical QA. Data, code, and fine-tuned models are publicly available for reproducibility and benchmarking.
Reach us at info@study.space
Understanding MedExpQA%3A Multilingual Benchmarking of Large Language Models for Medical Question Answering