20 Feb 2024 | Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa
This paper addresses the issue of cognitive bias in medical language models (LLMs) and evaluates their performance under biased conditions. The authors develop a benchmark called BiasMedQA to assess LLMs' resilience to cognitive biases in clinical decision-making tasks. They evaluate six LLMs—GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and PMC Llama 13B—using 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to include common clinically relevant cognitive biases. The results show that while GPT-4 performs well, other models like Llama 2 70B-chat and PMC Llama 13B are significantly affected by these biases. The study highlights the need for bias mitigation in medical LLMs to ensure safer and more reliable applications in healthcare. The authors also propose three strategies for bias mitigation: bias education, one-shot bias demonstration, and few-shot bias demonstration, which show varying degrees of effectiveness. The paper concludes by emphasizing the importance of further research to improve the robustness of medical LLMs and the potential of these models in shaping the future of accessible healthcare.This paper addresses the issue of cognitive bias in medical language models (LLMs) and evaluates their performance under biased conditions. The authors develop a benchmark called BiasMedQA to assess LLMs' resilience to cognitive biases in clinical decision-making tasks. They evaluate six LLMs—GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and PMC Llama 13B—using 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to include common clinically relevant cognitive biases. The results show that while GPT-4 performs well, other models like Llama 2 70B-chat and PMC Llama 13B are significantly affected by these biases. The study highlights the need for bias mitigation in medical LLMs to ensure safer and more reliable applications in healthcare. The authors also propose three strategies for bias mitigation: bias education, one-shot bias demonstration, and few-shot bias demonstration, which show varying degrees of effectiveness. The paper concludes by emphasizing the importance of further research to improve the robustness of medical LLMs and the potential of these models in shaping the future of accessible healthcare.