3 August 2023 | Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakar Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, Vivek Natarajan
Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.