Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology

Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology

January 17, 2024 | Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Zhe Fei, Ph.D., Fabien Scalzo, Ph.D., Ira Kurtz, M.D.
This study evaluates the performance of several open-source and proprietary large language models (LLMs) in answering multiple-choice questions from the Nephrology Self-Assessment Program (nephSAP). The models compared include Llama2-70B, Koala 7B, Falcon 7B, Stable-Vicuna 13B, Orca-Mini 13B, GPT-4, and Claude 2. The nephSAP dataset consists of 858 questions covering various nephrology topics. The study found that open-source LLMs performed poorly, with an overall success rate of 17.1% to 30.6%. In contrast, GPT-4 achieved a score of 73.3%, and Claude 2 scored 54.4%. The results highlight significant knowledge gaps in open-source LLMs for nephrology, which may impact their effectiveness in medical training and patient care. The study also discusses the limitations of open-source LLMs, including the quality and quantity of training data, and suggests that domain-specific fine-tuning and enhanced reasoning capabilities are necessary for better performance. The findings underscore the potential for proprietary models like GPT-4 and Claude 2 to significantly improve in medical applications, particularly in subspecialty fields.This study evaluates the performance of several open-source and proprietary large language models (LLMs) in answering multiple-choice questions from the Nephrology Self-Assessment Program (nephSAP). The models compared include Llama2-70B, Koala 7B, Falcon 7B, Stable-Vicuna 13B, Orca-Mini 13B, GPT-4, and Claude 2. The nephSAP dataset consists of 858 questions covering various nephrology topics. The study found that open-source LLMs performed poorly, with an overall success rate of 17.1% to 30.6%. In contrast, GPT-4 achieved a score of 73.3%, and Claude 2 scored 54.4%. The results highlight significant knowledge gaps in open-source LLMs for nephrology, which may impact their effectiveness in medical training and patient care. The study also discusses the limitations of open-source LLMs, including the quality and quantity of training data, and suggests that domain-specific fine-tuning and enhanced reasoning capabilities are necessary for better performance. The findings underscore the potential for proprietary models like GPT-4 and Claude 2 to significantly improve in medical applications, particularly in subspecialty fields.
Reach us at info@study.space
[slides] Benchmarking Open-Source Large Language Models%2C GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology | StudySpace