28 Jan 2024 | Dimitrios P. Panagoulias, Maria Virvou, George A. Tsihrintzis
This paper evaluates the effectiveness of large language models (LLMs) in generating multimodal diagnoses from medical images and symptom analysis. The authors propose a novel evaluation paradigm that consists of two main steps: (1) multimodal LLM evaluation via structured interactions and (2) domain-specific analysis based on data extracted from the previous interactions. Using this methodology, they evaluate the correctness and accuracy of LLM-generated medical diagnoses with publicly available multiple-choice questions (MCQs) in the domain of Pathology. The LLM used in this study is GPT-4-Vision-Preview, which performed well, achieving approximately 84% correct diagnoses. The authors then conduct a comprehensive analysis of the findings, including Image Metadata Analysis (IMA), Named Entity Recognition (NER), and Knowledge Graphs (KG). This analysis reveals weaknesses in specific knowledge paths, providing insights into the LLM's shortcomings in certain areas. The methodology and findings are not limited to GPT-4-Vision-Preview but can be applied to evaluate other LLMs to improve their performance in medical diagnosis. The paper also discusses the potential of this approach for future research and optimization in the medical domain.This paper evaluates the effectiveness of large language models (LLMs) in generating multimodal diagnoses from medical images and symptom analysis. The authors propose a novel evaluation paradigm that consists of two main steps: (1) multimodal LLM evaluation via structured interactions and (2) domain-specific analysis based on data extracted from the previous interactions. Using this methodology, they evaluate the correctness and accuracy of LLM-generated medical diagnoses with publicly available multiple-choice questions (MCQs) in the domain of Pathology. The LLM used in this study is GPT-4-Vision-Preview, which performed well, achieving approximately 84% correct diagnoses. The authors then conduct a comprehensive analysis of the findings, including Image Metadata Analysis (IMA), Named Entity Recognition (NER), and Knowledge Graphs (KG). This analysis reveals weaknesses in specific knowledge paths, providing insights into the LLM's shortcomings in certain areas. The methodology and findings are not limited to GPT-4-Vision-Preview but can be applied to evaluate other LLMs to improve their performance in medical diagnosis. The paper also discusses the potential of this approach for future research and optimization in the medical domain.