July 30, 2024 | Iiigo Alonso, Maite Oronoz, Rodrigo Agerri
The paper introduces MedExpQA, a multilingual benchmark for evaluating Large Language Models (LLMs) in Medical Question Answering (QA). MedExpQA is the first benchmark to include reference gold explanations written by medical doctors, addressing the lack of such explanations in previous benchmarks. The benchmark aims to evaluate LLMs' performance with different types of gold knowledge, including full gold reference explanations, explanations for incorrect options, and explanations with explicit references hidden. The study uses the Antidote CasiMedicos dataset, which includes Spanish, English, French, and Italian versions of Resident Medical Exams. The experiments involve both zero-shot and fine-tuned settings with state-of-the-art LLMs such as PMC-LLaMA, BioMistral, LLaMA-2, and Mistral. Results show that while LLMs perform well with gold knowledge, they still have significant room for improvement, especially for languages other than English. The study also highlights the effectiveness of fine-tuning and the limitations of automatic knowledge retrieval methods like Retrieval Augmented Generation (RAG). The paper concludes by emphasizing the need for more multilingual LLMs and further research to improve performance in medical QA across different languages.The paper introduces MedExpQA, a multilingual benchmark for evaluating Large Language Models (LLMs) in Medical Question Answering (QA). MedExpQA is the first benchmark to include reference gold explanations written by medical doctors, addressing the lack of such explanations in previous benchmarks. The benchmark aims to evaluate LLMs' performance with different types of gold knowledge, including full gold reference explanations, explanations for incorrect options, and explanations with explicit references hidden. The study uses the Antidote CasiMedicos dataset, which includes Spanish, English, French, and Italian versions of Resident Medical Exams. The experiments involve both zero-shot and fine-tuned settings with state-of-the-art LLMs such as PMC-LLaMA, BioMistral, LLaMA-2, and Mistral. Results show that while LLMs perform well with gold knowledge, they still have significant room for improvement, especially for languages other than English. The study also highlights the effectiveness of fine-tuning and the limitations of automatic knowledge retrieval methods like Retrieval Augmented Generation (RAG). The paper concludes by emphasizing the need for more multilingual LLMs and further research to improve performance in medical QA across different languages.