Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

2024 | Anne Herrmann-Werner,1,2, MME, Prof Dr Med; Teresa Festl-Wietek,1, MSc, Dr Rer Nat; Friederike Holderried1,3, MME, Dr Med; Lea Herschbach1, MSc; Jan Griewatz1, MA; Ken Masters4, Prof Dr; Stephan Zipfel2, Prof Dr Med; Moritz Mahling1,5, Dr Med, MHBA
This study evaluates the performance of GPT-4 in answering psychosomatic medicine exam questions using Bloom’s taxonomy. The researchers used a large dataset of 307 multiple-choice questions from medical school exams and compared GPT-4's answers with those of medical students. GPT-4 achieved a high success rate, answering 93% of detailed and 91% of short prompt questions correctly. Correct answers were statistically more difficult than incorrect answers. Qualitative analysis revealed that most errors occurred at the "remember" and "understand" cognitive levels, with specific issues in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. The study highlights the potential of GPT-4 in medical education but also underscores the need for caution due to the model's occasional hallucinations and biases. The findings provide critical insights into the practical applications and limitations of large language models in medical education.This study evaluates the performance of GPT-4 in answering psychosomatic medicine exam questions using Bloom’s taxonomy. The researchers used a large dataset of 307 multiple-choice questions from medical school exams and compared GPT-4's answers with those of medical students. GPT-4 achieved a high success rate, answering 93% of detailed and 91% of short prompt questions correctly. Correct answers were statistically more difficult than incorrect answers. Qualitative analysis revealed that most errors occurred at the "remember" and "understand" cognitive levels, with specific issues in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. The study highlights the potential of GPT-4 in medical education but also underscores the need for caution due to the model's occasional hallucinations and biases. The findings provide critical insights into the practical applications and limitations of large language models in medical education.
Reach us at info@study.space