Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

2024 | Takahiro Nakao, Soichiro Miki, Yuta Nakamura, Tomohiro Kikuchi, Yukihiro Nomura, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe
This study evaluates the image recognition capability of GPT-4V, a multimodal large language model, in answering questions from the 117th Japanese National Medical Licensing Examination (NMLE). The researchers tested GPT-4V's performance under two conditions: one with both the question text and associated images, and one with only the question text. They analyzed 108 questions that included images, finding that GPT-4V achieved an accuracy of 68% when presented with images and 72% when presented without images (P=.36). For clinical questions, accuracy was 71% with images and 78% without images (P=.21), while for general questions, accuracy was 30% with images and 20% without images (P≥.99). These results suggest that the additional information from images did not significantly improve GPT-4V's performance in the NMLE. The study highlights that GPT-4V struggles to interpret medical images, even though it is a multimodal model capable of processing both text and images. For clinical questions, where sufficient textual information was available, GPT-4V could often answer correctly without images. However, for general questions, where textual information was limited, GPT-4V's accuracy was low, even when images were provided. The researchers observed that GPT-4V often failed to interpret images or provided information that was not evident from the text. The study also notes that while GPT-4V performs well in general image recognition tasks, its performance in medical image interpretation is limited, possibly due to insufficient training on medical-related images. The researchers conclude that GPT-4V is not yet capable of effectively interpreting medical images and should not be relied upon as a primary source of information for medical education or practice. They emphasize the need for further research and the development of domain-specific models for medical image interpretation.This study evaluates the image recognition capability of GPT-4V, a multimodal large language model, in answering questions from the 117th Japanese National Medical Licensing Examination (NMLE). The researchers tested GPT-4V's performance under two conditions: one with both the question text and associated images, and one with only the question text. They analyzed 108 questions that included images, finding that GPT-4V achieved an accuracy of 68% when presented with images and 72% when presented without images (P=.36). For clinical questions, accuracy was 71% with images and 78% without images (P=.21), while for general questions, accuracy was 30% with images and 20% without images (P≥.99). These results suggest that the additional information from images did not significantly improve GPT-4V's performance in the NMLE. The study highlights that GPT-4V struggles to interpret medical images, even though it is a multimodal model capable of processing both text and images. For clinical questions, where sufficient textual information was available, GPT-4V could often answer correctly without images. However, for general questions, where textual information was limited, GPT-4V's accuracy was low, even when images were provided. The researchers observed that GPT-4V often failed to interpret images or provided information that was not evident from the text. The study also notes that while GPT-4V performs well in general image recognition tasks, its performance in medical image interpretation is limited, possibly due to insufficient training on medical-related images. The researchers conclude that GPT-4V is not yet capable of effectively interpreting medical images and should not be relied upon as a primary source of information for medical education or practice. They emphasize the need for further research and the development of domain-specific models for medical image interpretation.
Reach us at info@study.space