2024 | Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling
This study evaluates GPT-4's performance on psychosomatic medicine multiple-choice questions using a mixed-methods approach, comparing its answers to those of medical students. The results show that GPT-4 achieved high accuracy, with 93% and 91% success rates for detailed and short prompts, respectively. Incorrect answers were primarily categorized under the "remember" and "understand" levels of Bloom's taxonomy, indicating issues with recalling specific facts and understanding conceptual relationships. GPT-4's lowest performance was 78.9%, still exceeding the pass threshold. Qualitative analysis revealed that GPT-4 often failed to apply concepts to new situations, leading to errors in reasoning. These errors were attributed to model biases and the tendency to generate outputs that maximize likelihood. The study highlights that while GPT-4 performs well in answering questions, it occasionally overlooks specific facts, provides illogical reasoning, or fails to apply concepts appropriately. The findings suggest that GPT-4's performance aligns with previous studies, but its limitations in handling complex, context-dependent tasks remain. The study underscores the importance of verifying AI-generated responses, especially in medical contexts, and suggests that future research should focus on improving model training to address these shortcomings.This study evaluates GPT-4's performance on psychosomatic medicine multiple-choice questions using a mixed-methods approach, comparing its answers to those of medical students. The results show that GPT-4 achieved high accuracy, with 93% and 91% success rates for detailed and short prompts, respectively. Incorrect answers were primarily categorized under the "remember" and "understand" levels of Bloom's taxonomy, indicating issues with recalling specific facts and understanding conceptual relationships. GPT-4's lowest performance was 78.9%, still exceeding the pass threshold. Qualitative analysis revealed that GPT-4 often failed to apply concepts to new situations, leading to errors in reasoning. These errors were attributed to model biases and the tendency to generate outputs that maximize likelihood. The study highlights that while GPT-4 performs well in answering questions, it occasionally overlooks specific facts, provides illogical reasoning, or fails to apply concepts appropriately. The findings suggest that GPT-4's performance aligns with previous studies, but its limitations in handling complex, context-dependent tasks remain. The study underscores the importance of verifying AI-generated responses, especially in medical contexts, and suggests that future research should focus on improving model training to address these shortcomings.
[slides] Assessing ChatGPT%E2%80%99s Mastery of Bloom%E2%80%99s Taxonomy Using Psychosomatic Medicine Exam Questions%3A Mixed-Methods Study | StudySpace