21 Feb 2024 | Minh-Hao Van, Prateek Verma, Xintao Wu
This paper explores the application of large visual language models (VLMs) in medical imaging analysis. VLMs, such as CLIP, Flamingo, LLaVA, and ChatGPT-4, have shown impressive performance in various vision-linguistic tasks. The study evaluates the zero-shot and few-shot robustness of VLMs on medical imaging tasks, including brain MRIs, microscopic blood cell images, and chest X-rays. The results show that VLMs can effectively analyze medical images without requiring retraining or fine-tuning.
The study compares the performance of five VLMs (BiomedCLIP, OpenCLIP, OpenFlamingo, LLaVA, ChatGPT-4) and two CNN-based baselines (CNN, ResNet-18) on three medical imaging datasets (BTD, ALL-IDB2, CX-Ray). The results indicate that CNN-based methods achieve the best performance on all datasets, but VLMs also show strong performance, especially when using zero-shot and few-shot prompting. BiomedCLIP, which is trained on a domain-specific dataset for biomedical imaging, achieves the best overall performance.
The study also discusses the limitations of VLMs in medical imaging, including data quality, response safety, and potential privacy issues. While VLMs can provide useful insights and pre-diagnosis, they are not yet ready to replace human experts in medical imaging analysis. The study highlights the potential of VLMs as chat assistants for providing pre-diagnosis before making final decisions. Future work will explore more tasks, such as segmentation, using state-of-the-art VLMs specifically trained for biomedical imaging analysis.This paper explores the application of large visual language models (VLMs) in medical imaging analysis. VLMs, such as CLIP, Flamingo, LLaVA, and ChatGPT-4, have shown impressive performance in various vision-linguistic tasks. The study evaluates the zero-shot and few-shot robustness of VLMs on medical imaging tasks, including brain MRIs, microscopic blood cell images, and chest X-rays. The results show that VLMs can effectively analyze medical images without requiring retraining or fine-tuning.
The study compares the performance of five VLMs (BiomedCLIP, OpenCLIP, OpenFlamingo, LLaVA, ChatGPT-4) and two CNN-based baselines (CNN, ResNet-18) on three medical imaging datasets (BTD, ALL-IDB2, CX-Ray). The results indicate that CNN-based methods achieve the best performance on all datasets, but VLMs also show strong performance, especially when using zero-shot and few-shot prompting. BiomedCLIP, which is trained on a domain-specific dataset for biomedical imaging, achieves the best overall performance.
The study also discusses the limitations of VLMs in medical imaging, including data quality, response safety, and potential privacy issues. While VLMs can provide useful insights and pre-diagnosis, they are not yet ready to replace human experts in medical imaging analysis. The study highlights the potential of VLMs as chat assistants for providing pre-diagnosis before making final decisions. Future work will explore more tasks, such as segmentation, using state-of-the-art VLMs specifically trained for biomedical imaging analysis.