ChatGPT's Performance on the Hand Surgery Self-Assessment Exam: A Critical Analysis

ChatGPT's Performance on the Hand Surgery Self-Assessment Exam: A Critical Analysis

January 2, 2024 | Yuri Han, BA, * Hassaam S. Choudhry, BA, † Michael E. Simon, MD, * Brian M. Katt, MD *
This study evaluates the performance of ChatGPT, a language model by OpenAI, on the American Society for Surgery of the Hand (ASSH) self-assessment exams from 2004 to 2013. The primary outcomes were ChatGPT's total score, score on text-only questions, and score on image-based questions. The secondary outcomes included the proportion of questions for which ChatGPT provided additional explanations, the length of those elaborations, and the number of questions for which ChatGPT provided answers with certainty. **Methods:** - **Data Source:** 10 self-assessment exams from 2004 to 2013 provided by the ASSH. - **Question Types:** Text-only questions (1,127) and image-based questions (456). - **Performance Metrics:** Total score, score on text-only questions, score on image-based questions, proportion of questions with elaborations, average length of elaborations, and percentage of confident and unconfident answers. **Results:** - Out of 1,583 questions, ChatGPT answered 573 (36.2%) correctly. - ChatGPT performed better on text-only questions (39.2% correct) than on image-based questions (28.7% correct). - There was no significant difference in the proportion of elaborations between text-only and image-based questions. - The average length of elaborations was longer for image-based questions. - Out of 1,441 confident answers, 548 (38.0%) were correct, while out of 142 unconfident answers, 25 (17.6%) were correct. **Conclusions:** - ChatGPT performed poorly on the ASSH self-assessment exams, with a total score of 36.2%. - It performed better on text-only questions but still would not have received continuing medical education credit from ASSH or the American Board of Surgery. - Even with its highest score of 44% for the year 2012, the AI platform would not have passed the examination. - Medical professionals, trainees, and patients should use ChatGPT with caution due to its limited proficiency in hand subspecialty knowledge. **Clinical Relevance:** - The study highlights the need for caution in using AI tools like ChatGPT for medical education and certification preparation, especially in specialized fields such as hand surgery.This study evaluates the performance of ChatGPT, a language model by OpenAI, on the American Society for Surgery of the Hand (ASSH) self-assessment exams from 2004 to 2013. The primary outcomes were ChatGPT's total score, score on text-only questions, and score on image-based questions. The secondary outcomes included the proportion of questions for which ChatGPT provided additional explanations, the length of those elaborations, and the number of questions for which ChatGPT provided answers with certainty. **Methods:** - **Data Source:** 10 self-assessment exams from 2004 to 2013 provided by the ASSH. - **Question Types:** Text-only questions (1,127) and image-based questions (456). - **Performance Metrics:** Total score, score on text-only questions, score on image-based questions, proportion of questions with elaborations, average length of elaborations, and percentage of confident and unconfident answers. **Results:** - Out of 1,583 questions, ChatGPT answered 573 (36.2%) correctly. - ChatGPT performed better on text-only questions (39.2% correct) than on image-based questions (28.7% correct). - There was no significant difference in the proportion of elaborations between text-only and image-based questions. - The average length of elaborations was longer for image-based questions. - Out of 1,441 confident answers, 548 (38.0%) were correct, while out of 142 unconfident answers, 25 (17.6%) were correct. **Conclusions:** - ChatGPT performed poorly on the ASSH self-assessment exams, with a total score of 36.2%. - It performed better on text-only questions but still would not have received continuing medical education credit from ASSH or the American Board of Surgery. - Even with its highest score of 44% for the year 2012, the AI platform would not have passed the examination. - Medical professionals, trainees, and patients should use ChatGPT with caution due to its limited proficiency in hand subspecialty knowledge. **Clinical Relevance:** - The study highlights the need for caution in using AI tools like ChatGPT for medical education and certification preparation, especially in specialized fields such as hand surgery.
Reach us at info@study.space