OmniMedVQA is a new large-scale comprehensive evaluation benchmark for medical Large Vision-Language Models (LVLMs). This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. All images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through extensive experiments, it was found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, medical-specialized LVLMs even exhibit inferior performance to general-domain models, highlighting the need for more versatile and robust LVLMs in the biomedical field. The evaluation results reveal the current limitations of LVLMs in understanding real medical images and highlight the significance of the dataset. The code and dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena.
OmniMedVQA is designed to evaluate the performance of LVLMs in the medical domain. It includes images from 12 different modalities, such as MRI, CT, X-Ray, histopathology, fundus photography, etc., resulting in a highly diverse dataset. It also covers over 20 distinct human anatomical regions, facilitating a more comprehensive evaluation of different LVLMs. The dataset contains 118,010 images with 127,995 test items, leading to a large-scale evaluation benchmark.
In the evaluation, twelve representative models, including eight general-domain LVLMs and four specialized medical LVLMs, were assessed. The results showed that medical-specialized LVLMs exhibit superior performance compared to general-domain LVLMs on some specific modalities, but they struggle to consistently outperform general models across all modalities. The evaluation results also highlight the need for a robust model that can effectively align image-text pairs in the medical field.
The paper also discusses related work, including the development of LVLMs for the medical field and the existing medical VQA datasets. It highlights the challenges in evaluating LVLMs in the medical domain and the importance of a comprehensive evaluation benchmark like OmniMedVQA. The paper concludes that medical-specialized LVLMs do not present outstanding performance and that more knowledge in specific modalities is needed for them to be more effective. The dataset provides a comprehensive evaluation benchmark for medical LVLMs and offers useful insights for future research.OmniMedVQA is a new large-scale comprehensive evaluation benchmark for medical Large Vision-Language Models (LVLMs). This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. All images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through extensive experiments, it was found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, medical-specialized LVLMs even exhibit inferior performance to general-domain models, highlighting the need for more versatile and robust LVLMs in the biomedical field. The evaluation results reveal the current limitations of LVLMs in understanding real medical images and highlight the significance of the dataset. The code and dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena.
OmniMedVQA is designed to evaluate the performance of LVLMs in the medical domain. It includes images from 12 different modalities, such as MRI, CT, X-Ray, histopathology, fundus photography, etc., resulting in a highly diverse dataset. It also covers over 20 distinct human anatomical regions, facilitating a more comprehensive evaluation of different LVLMs. The dataset contains 118,010 images with 127,995 test items, leading to a large-scale evaluation benchmark.
In the evaluation, twelve representative models, including eight general-domain LVLMs and four specialized medical LVLMs, were assessed. The results showed that medical-specialized LVLMs exhibit superior performance compared to general-domain LVLMs on some specific modalities, but they struggle to consistently outperform general models across all modalities. The evaluation results also highlight the need for a robust model that can effectively align image-text pairs in the medical field.
The paper also discusses related work, including the development of LVLMs for the medical field and the existing medical VQA datasets. It highlights the challenges in evaluating LVLMs in the medical domain and the importance of a comprehensive evaluation benchmark like OmniMedVQA. The paper concludes that medical-specialized LVLMs do not present outstanding performance and that more knowledge in specific modalities is needed for them to be more effective. The dataset provides a comprehensive evaluation benchmark for medical LVLMs and offers useful insights for future research.