[slides and audio] Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

This study investigates whether general-domain large language models (LLMs), such as GPT-4 Turbo, can perform perioperative risk stratification and prognostication by analyzing clinical notes and procedure descriptions from electronic health records (EHRs). The study evaluates the performance of LLMs on eight tasks, including ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, and duration predictions for PACU Phase 1, hospital, and ICU stays. The results show that LLMs can achieve high F1 scores for classification tasks like ASA Physical Status Classification (0.50), ICU admission (0.81), and hospital mortality (0.86). However, performance on duration prediction tasks was poor across all prompt strategies. Few-shot and chain-of-thought prompting improved performance for several tasks, but LLMs struggled with numerical predictions, often predicting values that were not clinically meaningful. The study also found that LLM-generated summaries of clinical notes performed similarly to original notes in some tasks but offered advantages in scaling to large numbers of in-context examples. The results suggest that LLMs can assist clinicians in perioperative risk stratification, particularly for classification tasks, and can generate high-quality natural language summaries and explanations. However, LLMs still face challenges in accurately predicting continuous-valued outcomes like hospital and ICU durations. The study highlights the potential of LLMs in clinical settings but also underscores the need for further research to improve their performance in numerical prediction tasks and to validate their effectiveness in real-world clinical scenarios.This study investigates whether general-domain large language models (LLMs), such as GPT-4 Turbo, can perform perioperative risk stratification and prognostication by analyzing clinical notes and procedure descriptions from electronic health records (EHRs). The study evaluates the performance of LLMs on eight tasks, including ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, and duration predictions for PACU Phase 1, hospital, and ICU stays. The results show that LLMs can achieve high F1 scores for classification tasks like ASA Physical Status Classification (0.50), ICU admission (0.81), and hospital mortality (0.86). However, performance on duration prediction tasks was poor across all prompt strategies. Few-shot and chain-of-thought prompting improved performance for several tasks, but LLMs struggled with numerical predictions, often predicting values that were not clinically meaningful. The study also found that LLM-generated summaries of clinical notes performed similarly to original notes in some tasks but offered advantages in scaling to large numbers of in-context examples. The results suggest that LLMs can assist clinicians in perioperative risk stratification, particularly for classification tasks, and can generate high-quality natural language summaries and explanations. However, LLMs still face challenges in accurately predicting continuous-valued outcomes like hospital and ICU durations. The study highlights the potential of LLMs in clinical settings but also underscores the need for further research to improve their performance in numerical prediction tasks and to validate their effectiveness in real-world clinical scenarios.

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

3 Jan 2024 | Philip Chung, Christine T. Fong, Andrew M. Walters, Nima Aghaeepour, Meliha Yetisgen, Vikas N. O'Reilly-Shah