February 7, 2024 | Dimitrios P. Panagoulia, Maria Virvou, George A. Tsihrintzis
This paper evaluates the performance of large language models (LLMs) in generating multimodal diagnoses based on medical images and symptom analysis. The study uses GPT-4-Vision-Preview to assess its ability to correctly diagnose medical conditions using structured multiple-choice questions (MCQs) in the domain of Pathology. The evaluation involves two main steps: (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted from these interactions. The results show that GPT-4-Vision-Preview achieved approximately 84% correct diagnoses in the Pathology MCQs. The study also includes a detailed analysis of the model's performance using Image Metadata Analysis (IMA), Named Entity Recognition (NER), and Knowledge Graphs (KGs). The analysis revealed weaknesses of GPT-4-Vision-Preview on specific knowledge paths, leading to a deeper understanding of its shortcomings in certain areas. The methodology and findings are not limited to GPT-4-Vision-Preview but can be applied to evaluate other LLMs for their usefulness and accuracy. The study highlights the importance of transparency, rigorous evaluation, and continuous improvement in AI systems, especially in medical applications where errors can have significant consequences. The results demonstrate the potential of LLMs in medical diagnosis and the need for further research to enhance their performance and reliability.This paper evaluates the performance of large language models (LLMs) in generating multimodal diagnoses based on medical images and symptom analysis. The study uses GPT-4-Vision-Preview to assess its ability to correctly diagnose medical conditions using structured multiple-choice questions (MCQs) in the domain of Pathology. The evaluation involves two main steps: (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted from these interactions. The results show that GPT-4-Vision-Preview achieved approximately 84% correct diagnoses in the Pathology MCQs. The study also includes a detailed analysis of the model's performance using Image Metadata Analysis (IMA), Named Entity Recognition (NER), and Knowledge Graphs (KGs). The analysis revealed weaknesses of GPT-4-Vision-Preview on specific knowledge paths, leading to a deeper understanding of its shortcomings in certain areas. The methodology and findings are not limited to GPT-4-Vision-Preview but can be applied to evaluate other LLMs for their usefulness and accuracy. The study highlights the importance of transparency, rigorous evaluation, and continuous improvement in AI systems, especially in medical applications where errors can have significant consequences. The results demonstrate the potential of LLMs in medical diagnosis and the need for further research to enhance their performance and reliability.