25 Jun 2024 | Hanjie Chen*, Zhourxiang Fang*, Yash Singla, Mark Dredze
This paper introduces two new medical question-answering datasets, JAMA Clinical Challenge and Medbullets, designed to evaluate the performance of large language models (LLMs) in answering and explaining complex medical questions. The datasets contain expert-written explanations, making them suitable for assessing not only the accuracy of model answers but also the quality of their explanations. JAMA Clinical Challenge consists of clinical cases from the JAMA Network, while Medbullets includes simulated clinical questions from open-access tweets. Both datasets are structured as multiple-choice question-answering tasks with detailed explanations.
The authors evaluate seven LLMs, including both closed-source and open-source models, on these datasets. The results show that the new datasets are more challenging than previous benchmarks, highlighting the difficulty of answering complex medical questions. While in-context learning and prompting strategies provide some improvements, they do not significantly enhance model performance. Human and automatic evaluations of model-generated explanations reveal limitations in LLMs' ability to explain complex medical decisions. Additionally, the weak correlation between human and automatic scores underscores the need for better evaluation metrics for explainable medical QA.
The study also explores different prompting strategies, including chain-of-thought (CoT) prompting, which improves model reasoning on some tasks but not all. Automatic evaluations of model explanations show that some models, like GPT-4 and Llama 3, perform well, while others, like MedAlpaca, struggle. Human evaluations further highlight the shortcomings of LLMs in generating accurate and relevant explanations. The findings suggest that future research should focus on improving LLMs' ability to explain medical decisions in addition to making them. The datasets presented in this paper represent a new challenge for medical LLM research, as they are more reflective of real-world clinical scenarios than previous benchmarks.This paper introduces two new medical question-answering datasets, JAMA Clinical Challenge and Medbullets, designed to evaluate the performance of large language models (LLMs) in answering and explaining complex medical questions. The datasets contain expert-written explanations, making them suitable for assessing not only the accuracy of model answers but also the quality of their explanations. JAMA Clinical Challenge consists of clinical cases from the JAMA Network, while Medbullets includes simulated clinical questions from open-access tweets. Both datasets are structured as multiple-choice question-answering tasks with detailed explanations.
The authors evaluate seven LLMs, including both closed-source and open-source models, on these datasets. The results show that the new datasets are more challenging than previous benchmarks, highlighting the difficulty of answering complex medical questions. While in-context learning and prompting strategies provide some improvements, they do not significantly enhance model performance. Human and automatic evaluations of model-generated explanations reveal limitations in LLMs' ability to explain complex medical decisions. Additionally, the weak correlation between human and automatic scores underscores the need for better evaluation metrics for explainable medical QA.
The study also explores different prompting strategies, including chain-of-thought (CoT) prompting, which improves model reasoning on some tasks but not all. Automatic evaluations of model explanations show that some models, like GPT-4 and Llama 3, perform well, while others, like MedAlpaca, struggle. Human evaluations further highlight the shortcomings of LLMs in generating accurate and relevant explanations. The findings suggest that future research should focus on improving LLMs' ability to explain medical decisions in addition to making them. The datasets presented in this paper represent a new challenge for medical LLM research, as they are more reflective of real-world clinical scenarios than previous benchmarks.