[slides] Performance of Large Language Models on Medical Oncology Examination Questions

This study evaluates the performance of large language models (LLMs) on medical oncology examination questions. The researchers tested eight LLMs, including two proprietary models (ChatGPT-3.5 and ChatGPT-4) and six open-source models, on 147 questions from the American Society of Clinical Oncology (ASCO), the European Society for Medical Oncology (ESMO), and original questions developed by the authors. Proprietary LLM 2 correctly answered 85.0% of the questions, outperforming proprietary LLM 1 and the best open-source model. However, 81.8% of incorrect answers were rated as having a medium or high likelihood of moderate to severe harm if acted upon in clinical practice. The explanations provided by proprietary LLM 2 contained no or minor errors for 93.9% of the questions. Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. The study highlights the potential of LLMs to improve healthcare by providing accurate information, but also raises safety concerns due to the potential for harm from incorrect answers. The results suggest that while LLMs can accurately answer complex medical oncology questions, errors may have serious consequences. The study recommends further research to improve the safety and reliability of LLMs in medical oncology.This study evaluates the performance of large language models (LLMs) on medical oncology examination questions. The researchers tested eight LLMs, including two proprietary models (ChatGPT-3.5 and ChatGPT-4) and six open-source models, on 147 questions from the American Society of Clinical Oncology (ASCO), the European Society for Medical Oncology (ESMO), and original questions developed by the authors. Proprietary LLM 2 correctly answered 85.0% of the questions, outperforming proprietary LLM 1 and the best open-source model. However, 81.8% of incorrect answers were rated as having a medium or high likelihood of moderate to severe harm if acted upon in clinical practice. The explanations provided by proprietary LLM 2 contained no or minor errors for 93.9% of the questions. Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. The study highlights the potential of LLMs to improve healthcare by providing accurate information, but also raises safety concerns due to the potential for harm from incorrect answers. The results suggest that while LLMs can accurately answer complex medical oncology questions, errors may have serious consequences. The study recommends further research to improve the safety and reliability of LLMs in medical oncology.

Performance of Large Language Models on Medical Oncology Examination Questions

2024 | Jack B. Longwell, HBSc; Ian Hirsch, MD, MSc; Fernando Binder, MD, MPH; Galileo Arturo Gonzalez Conchas, MD; Daniel Mau, HBSc; Raymond Jang, MD, MSc; Rahul G. Krishnan, PhD; Robert C. Grant, MD, PhD