11 Jun 2024 | Yao Jiang†, Xinyu Yan†, Ge-Peng Ji, Keren Fu*, Meijun Sun, Huan Xiong*, Deng-Ping Fan†, Fahad Shahbaz Khan
This paper evaluates the effectiveness of recent large vision-language models (LVLMs) in both specialized and general tasks. The authors assess six challenging tasks in three application scenarios—natural, healthcare, and industrial—using three open-source LVLMs: MiniGPT-v2, LLaVA-1.5, and Shikra. They also conduct empirical investigations on a universal dataset (COCO) to evaluate the multi-modal understanding capabilities of these models in general tasks, including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. The results reveal that while LVLMs show promise in specialized tasks, they exhibit limited proficiency and cognitive capabilities. Specific issues include object hallucination, text-to-image interference, and decreased robustness in complex problems. In general tasks, LVLMs also show significant room for improvement, particularly in object counting, spatial reasoning, and absurd question answering. The study highlights the need for further research to enhance the transferability and robustness of LVLMs, addressing limitations such as limited cognition, object hallucination, and text-to-image interference.This paper evaluates the effectiveness of recent large vision-language models (LVLMs) in both specialized and general tasks. The authors assess six challenging tasks in three application scenarios—natural, healthcare, and industrial—using three open-source LVLMs: MiniGPT-v2, LLaVA-1.5, and Shikra. They also conduct empirical investigations on a universal dataset (COCO) to evaluate the multi-modal understanding capabilities of these models in general tasks, including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. The results reveal that while LVLMs show promise in specialized tasks, they exhibit limited proficiency and cognitive capabilities. Specific issues include object hallucination, text-to-image interference, and decreased robustness in complex problems. In general tasks, LVLMs also show significant room for improvement, particularly in object counting, spatial reasoning, and absurd question answering. The study highlights the need for further research to enhance the transferability and robustness of LVLMs, addressing limitations such as limited cognition, object hallucination, and text-to-image interference.