11 Jun 2024 | Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shabaz Khan
This paper evaluates the effectiveness of recent large vision-language models (LVLMs) in both specialized and general tasks. The study assesses three open-source LVLMs—MiniGPT-v2, LLaVA-1.5, and Shikra—alongside GPT-4V, across six specialized tasks in natural, healthcare, and industrial scenarios, including salient/transparent/camouflaged object detection, polyp detection, skin lesion detection, and industrial anomaly detection. Additionally, the models are tested on general tasks such as object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning using the COCO dataset.
The results show that while these models demonstrate potential in specialized tasks, they exhibit limited proficiency and inadequate transferability. Key limitations include limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. In general tasks, the models also show suboptimal performance, particularly in object counting, spatial reasoning, and absurd question answering.
The study highlights the need for further improvements in LVLMs, including better prompt engineering, domain-specific fine-tuning, and mitigation of hallucination and other issues. It also suggests integrating additional visual information, such as depth and focus cues, to enhance their perceptual capabilities. The findings indicate that while LVLMs have shown promise in certain tasks, they still fall short in real-world applications, especially in critical domains like healthcare and industry. The study provides insights for future research and development of LVLMs to improve their performance in both specialized and general tasks.This paper evaluates the effectiveness of recent large vision-language models (LVLMs) in both specialized and general tasks. The study assesses three open-source LVLMs—MiniGPT-v2, LLaVA-1.5, and Shikra—alongside GPT-4V, across six specialized tasks in natural, healthcare, and industrial scenarios, including salient/transparent/camouflaged object detection, polyp detection, skin lesion detection, and industrial anomaly detection. Additionally, the models are tested on general tasks such as object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning using the COCO dataset.
The results show that while these models demonstrate potential in specialized tasks, they exhibit limited proficiency and inadequate transferability. Key limitations include limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. In general tasks, the models also show suboptimal performance, particularly in object counting, spatial reasoning, and absurd question answering.
The study highlights the need for further improvements in LVLMs, including better prompt engineering, domain-specific fine-tuning, and mitigation of hallucination and other issues. It also suggests integrating additional visual information, such as depth and focus cues, to enhance their perceptual capabilities. The findings indicate that while LVLMs have shown promise in certain tasks, they still fall short in real-world applications, especially in critical domains like healthcare and industry. The study provides insights for future research and development of LVLMs to improve their performance in both specialized and general tasks.