2024 | Friederike Holderried, MD, MME; Christian Stegemann-Philipps, Dr rer nat; Anne Herrmann-Werner, Prof Dr Med, MME; Teresa Festl-Wietek, Dr rer nat; Martin Holderried, Prof Dr, Dr med; Carsten Eickhoff, Prof Dr; Moritz Mahling, MD, MHBA
This study evaluates the effectiveness of a Generative Pretrained Transformer (GPT-4) model in providing structured feedback on medical students' performance in history taking with a simulated patient. The research aimed to assess the realism and educational value of GPT-4's feedback compared to human raters. The study involved 106 medical students who interacted with a GPT-powered chatbot designed to simulate patient responses and provide immediate feedback on the comprehensiveness of their history taking. The chatbot's role-play and responses were found to be medically plausible in over 99% of cases, with high interrater reliability (Cohen κ=0.832) between GPT-4 and human raters. However, some feedback categories showed lower agreement (κ≤0.6), indicating specific areas where the model's assessments diverged from human judgment. The study concludes that GPT-4 can effectively provide structured feedback on history-taking dialogs, suggesting its potential as a valuable tool in medical education. Despite some limitations, the findings advocate for the careful integration of AI-driven feedback mechanisms in medical training.This study evaluates the effectiveness of a Generative Pretrained Transformer (GPT-4) model in providing structured feedback on medical students' performance in history taking with a simulated patient. The research aimed to assess the realism and educational value of GPT-4's feedback compared to human raters. The study involved 106 medical students who interacted with a GPT-powered chatbot designed to simulate patient responses and provide immediate feedback on the comprehensiveness of their history taking. The chatbot's role-play and responses were found to be medically plausible in over 99% of cases, with high interrater reliability (Cohen κ=0.832) between GPT-4 and human raters. However, some feedback categories showed lower agreement (κ≤0.6), indicating specific areas where the model's assessments diverged from human judgment. The study concludes that GPT-4 can effectively provide structured feedback on history-taking dialogs, suggesting its potential as a valuable tool in medical education. Despite some limitations, the findings advocate for the careful integration of AI-driven feedback mechanisms in medical training.