[slides] Large language models encode clinical knowledge

Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.Large language models (LLMs) have demonstrated impressive capabilities, but their clinical applications require high standards. This study introduces MultiMedQA, a benchmark that combines six existing medical question-answering datasets and a new dataset of online medical queries. The authors propose a human evaluation framework to assess model answers across multiple axes, including factuality, comprehension, reasoning, potential harm, and bias. They evaluate PaLM, a 540-billion parameter LLM, and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on multiple-choice datasets, surpassing previous models by more than 17%. However, human evaluations reveal key gaps, leading to the introduction of instruction prompt tuning, a parameter-efficient approach to align LLMs with new domains. The resulting model, Med-PaLM, performs encouragingly but still falls short of clinicians. The study highlights the need for further improvements in safety, equity, and bias to make LLMs viable for clinical applications. Key contributions include the development of MultiMedQA, the evaluation of Flan-PaLM and Med-PaLM, and the introduction of instruction prompt tuning. The results suggest that while LLMs show promise in medicine, significant challenges remain in ensuring their safety and effectiveness in real-world clinical settings.