12 July 2023 | Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärtl, Aaankasha Chowdhry, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweiss, Nenad Tomasev, Yun Liu, Alvin Rajkumar, Joelle Barra, Christopher Semturs, Alan Karthikesalingam, Vivek Natarajan
Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.