25 Jun 2024 | Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze
This paper addresses the limitations of current benchmarks for evaluating large language models (LLMs) in medical question answering (QA). While LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations, these tasks do not fully capture the complexity of real-world clinical cases. Additionally, the lack of reference explanations makes it difficult to evaluate the reasoning behind model decisions, which is crucial for supporting doctors in making complex medical decisions.
To address these challenges, the authors construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. The authors evaluate seven LLMs on these datasets using various prompts and find that the datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations highlight the promise and deficiencies of LLMs in explainable medical QA.
The paper also discusses the limitations of existing benchmarks and the need for more realistic and challenging clinical benchmarks. It explores different prompting strategies and evaluation metrics to assess model performance and explainability. The results show that while LLMs produce promising explanations, they also exhibit deficiencies such as irrelevance and errors. The weak correlation between human and automatic scores underscores the necessity of developing metrics that better align with human judgments.
Overall, the introduction of these new datasets represents a significant step forward in evaluating LLMs' capabilities in complex clinical scenarios and highlights the ongoing need for research to improve explainable medical QA.This paper addresses the limitations of current benchmarks for evaluating large language models (LLMs) in medical question answering (QA). While LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations, these tasks do not fully capture the complexity of real-world clinical cases. Additionally, the lack of reference explanations makes it difficult to evaluate the reasoning behind model decisions, which is crucial for supporting doctors in making complex medical decisions.
To address these challenges, the authors construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. The authors evaluate seven LLMs on these datasets using various prompts and find that the datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations highlight the promise and deficiencies of LLMs in explainable medical QA.
The paper also discusses the limitations of existing benchmarks and the need for more realistic and challenging clinical benchmarks. It explores different prompting strategies and evaluation metrics to assess model performance and explainability. The results show that while LLMs produce promising explanations, they also exhibit deficiencies such as irrelevance and errors. The weak correlation between human and automatic scores underscores the necessity of developing metrics that better align with human judgments.
Overall, the introduction of these new datasets represents a significant step forward in evaluating LLMs' capabilities in complex clinical scenarios and highlights the ongoing need for research to improve explainable medical QA.