Large language models encode clinical knowledge

Large language models encode clinical knowledge

3 August 2023 | Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakar Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, Vivek Natarajan
Large language models (LLMs) have shown impressive capabilities, but clinical applications remain challenging. This study introduces MultiMedQA, a benchmark combining six existing medical question-answering datasets and a new dataset of online health questions, HealthSearchQA. The researchers evaluated PaLM and Flan-PaLM on MultiMedQA, finding that Flan-PaLM achieved state-of-the-art accuracy on multiple-choice datasets, including 67.6% on MedQA. However, human evaluations revealed key gaps in factuality, harm, and bias. To address this, they introduced instruction prompt tuning, resulting in Med-PaLM, which performed better than Flan-PaLM in alignment with scientific consensus and reduced harmful outcomes. Med-PaLM's answers were comparable to clinicians in several aspects. The study highlights the potential of LLMs in medicine but emphasizes the need for further research to ensure safety, fairness, and equity. The results show that LLMs can improve medical question-answering, but they still fall short of clinical expertise. The study also discusses the importance of human evaluation frameworks and method development to create safe, helpful LLMs for clinical applications. The research underscores the need for continued efforts to address limitations in LLMs, including bias, fairness, and health equity.Large language models (LLMs) have shown impressive capabilities, but clinical applications remain challenging. This study introduces MultiMedQA, a benchmark combining six existing medical question-answering datasets and a new dataset of online health questions, HealthSearchQA. The researchers evaluated PaLM and Flan-PaLM on MultiMedQA, finding that Flan-PaLM achieved state-of-the-art accuracy on multiple-choice datasets, including 67.6% on MedQA. However, human evaluations revealed key gaps in factuality, harm, and bias. To address this, they introduced instruction prompt tuning, resulting in Med-PaLM, which performed better than Flan-PaLM in alignment with scientific consensus and reduced harmful outcomes. Med-PaLM's answers were comparable to clinicians in several aspects. The study highlights the potential of LLMs in medicine but emphasizes the need for further research to ensure safety, fairness, and equity. The results show that LLMs can improve medical question-answering, but they still fall short of clinical expertise. The study also discusses the importance of human evaluation frameworks and method development to create safe, helpful LLMs for clinical applications. The research underscores the need for continued efforts to address limitations in LLMs, including bias, fairness, and health equity.
Reach us at info@study.space
[slides] Large language models encode clinical knowledge | StudySpace