3 Jan 2024 | Philip Chung, Christine T. Fong, Andrew M. Walters, Nima Aghaeepour, Meliha Yetisgen, Vikas N. O'Reilly-Shah
This study investigates the capabilities of general-domain large language models (LLMs) such as GPT-4 Turbo in performing perioperative risk prediction and prognostication using clinical notes and procedure descriptions. The researchers examined eight different tasks: ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. They found that few-shot and chain-of-thought (CoT) prompting improved predictive performance for several tasks. The best F1 scores were achieved for ASA Physical Status Classification (0.50), ICU admission (0.81), and hospital mortality (0.86). However, performance on duration prediction tasks was generally poor across all prompt strategies. The study also explored the impact of note length and the use of LLM-generated summaries on predictive performance. Overall, the results indicate that current LLMs can assist clinicians in perioperative risk stratification, particularly in classification tasks, and produce high-quality natural language summaries and explanations. However, they still struggle with regression tasks involving continuous outcomes. Future research should focus on developing more advanced prompting strategies and exploring the potential of domain-specific LLMs to improve performance in these areas.This study investigates the capabilities of general-domain large language models (LLMs) such as GPT-4 Turbo in performing perioperative risk prediction and prognostication using clinical notes and procedure descriptions. The researchers examined eight different tasks: ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. They found that few-shot and chain-of-thought (CoT) prompting improved predictive performance for several tasks. The best F1 scores were achieved for ASA Physical Status Classification (0.50), ICU admission (0.81), and hospital mortality (0.86). However, performance on duration prediction tasks was generally poor across all prompt strategies. The study also explored the impact of note length and the use of LLM-generated summaries on predictive performance. Overall, the results indicate that current LLMs can assist clinicians in perioperative risk stratification, particularly in classification tasks, and produce high-quality natural language summaries and explanations. However, they still struggle with regression tasks involving continuous outcomes. Future research should focus on developing more advanced prompting strategies and exploring the potential of domain-specific LLMs to improve performance in these areas.