Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale

Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale

2024 | Jae Hyuk Kim, Sun Kyung Kim, Jongmyung Choi and Youngho Lee
This study evaluates the reliability of ChatGPT in determining emergency department (ED) triage accuracy using the Korean Triage and Acuity Scale (KTAS). Two hundred and two virtual patient cases were created, and the gold standard triage classification was established by an experienced ED physician. Three human raters (ED paramedics) and two versions of ChatGPT (3.5 and 4.0) were used to rate the cases. Inter-rater reliability was assessed using Fleiss' kappa and intra-class correlation coefficient (ICC). The results showed substantial inter-rater reliability among human raters (kappa = .646). However, the agreement between human raters and ChatGPT was lower, with kappa values of .320 for ChatGPT 3.5 and .523 for ChatGPT 4.0. The ICC for ChatGPT 3.5 was moderate (.520), while it was good (.802) for ChatGPT 4.0. The study concluded that while ChatGPT has potential in ED settings, further improvements are needed to enhance its accuracy and reliability, particularly in handling complex scenarios and specific triage levels.This study evaluates the reliability of ChatGPT in determining emergency department (ED) triage accuracy using the Korean Triage and Acuity Scale (KTAS). Two hundred and two virtual patient cases were created, and the gold standard triage classification was established by an experienced ED physician. Three human raters (ED paramedics) and two versions of ChatGPT (3.5 and 4.0) were used to rate the cases. Inter-rater reliability was assessed using Fleiss' kappa and intra-class correlation coefficient (ICC). The results showed substantial inter-rater reliability among human raters (kappa = .646). However, the agreement between human raters and ChatGPT was lower, with kappa values of .320 for ChatGPT 3.5 and .523 for ChatGPT 4.0. The ICC for ChatGPT 3.5 was moderate (.520), while it was good (.802) for ChatGPT 4.0. The study concluded that while ChatGPT has potential in ED settings, further improvements are needed to enhance its accuracy and reliability, particularly in handling complex scenarios and specific triage levels.
Reach us at info@study.space
Understanding Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale