[slides and audio] ChatGPT Vision for Radiological Interpretation%3A An Investigation Using Medical School Radiology Examinations

This study evaluates the performance of ChatGPT, an advanced large language model with vision capabilities, in interpreting radiological images from medical school examinations. The research was conducted at Seoul National University College of Medicine over three academic years (2018-2020) using GPT-4-1106-vision-preview. The study involved multiple-choice questions, including text- and image-based ones, from third-year medical students. ChatGPT's responses were assessed by a board-certified radiologist on a 5-point scale, and its performance was compared to that of students. Key findings include: - ChatGPT scored significantly lower than students in all three years, with its scores ranking in the bottom percentiles. - ChatGPT performed worse in image-based questions compared to text-only questions, with 42% of its interpretations rated as poor or very poor. - The consistency of ChatGPT's responses across three sessions was moderate, but it was not significantly different from students. - The study suggests that while ChatGPT has potential, it currently underperforms in radiological interpretation, indicating areas for improvement in reliable clinical usage. The study highlights the limitations of using a single institution's examinations and the potential impact of language barriers on ChatGPT's performance. Despite these limitations, the findings suggest that further development and customization of ChatGPT could enhance its utility in radiology.This study evaluates the performance of ChatGPT, an advanced large language model with vision capabilities, in interpreting radiological images from medical school examinations. The research was conducted at Seoul National University College of Medicine over three academic years (2018-2020) using GPT-4-1106-vision-preview. The study involved multiple-choice questions, including text- and image-based ones, from third-year medical students. ChatGPT's responses were assessed by a board-certified radiologist on a 5-point scale, and its performance was compared to that of students. Key findings include: - ChatGPT scored significantly lower than students in all three years, with its scores ranking in the bottom percentiles. - ChatGPT performed worse in image-based questions compared to text-only questions, with 42% of its interpretations rated as poor or very poor. - The consistency of ChatGPT's responses across three sessions was moderate, but it was not significantly different from students. - The study suggests that while ChatGPT has potential, it currently underperforms in radiological interpretation, indicating areas for improvement in reliable clinical usage. The study highlights the limitations of using a single institution's examinations and the potential impact of language barriers on ChatGPT's performance. Despite these limitations, the findings suggest that further development and customization of ChatGPT could enhance its utility in radiology.

ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations

2024 | Hyunjin Kim, Paul Kim, Ijin Joo, Jung Hoon Kim, Chang Min Park, Soon Ho Yoon