20 Jun 2024 | Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
Prism is a framework designed to decouple and assess the perception and reasoning capabilities of Vision Language Models (VLMs). It consists of two stages: a perception stage that extracts and describes visual information using a VLM, and a reasoning stage that generates answers using a Large Language Model (LLM). This modular design allows for systematic comparison of VLMs' strengths in perception and reasoning. Prism enables the evaluation of VLMs by separating their capabilities, providing insights into their performance on tasks like visual question answering. When combined with a streamlined VLM and a powerful LLM, Prism achieves superior results in general vision-language tasks while reducing training and operational costs. Quantitative evaluations show that Prism, using a 2B LLaVA and GPT-3.5, performs on par with larger VLMs on the MMStar benchmark. Prism also serves as an efficient vision-language task solver, integrating a lightweight VLM with a powerful LLM to achieve strong performance. The framework allows for the analysis of VLMs' capabilities, revealing that proprietary models like GPT-4o excel in perception, while open-source models show consistent performance regardless of language model size. Prism's modular design enables flexible evaluation and application, making it a valuable tool for assessing and improving VLMs.Prism is a framework designed to decouple and assess the perception and reasoning capabilities of Vision Language Models (VLMs). It consists of two stages: a perception stage that extracts and describes visual information using a VLM, and a reasoning stage that generates answers using a Large Language Model (LLM). This modular design allows for systematic comparison of VLMs' strengths in perception and reasoning. Prism enables the evaluation of VLMs by separating their capabilities, providing insights into their performance on tasks like visual question answering. When combined with a streamlined VLM and a powerful LLM, Prism achieves superior results in general vision-language tasks while reducing training and operational costs. Quantitative evaluations show that Prism, using a 2B LLaVA and GPT-3.5, performs on par with larger VLMs on the MMStar benchmark. Prism also serves as an efficient vision-language task solver, integrating a lightweight VLM with a powerful LLM to achieve strong performance. The framework allows for the analysis of VLMs' capabilities, revealing that proprietary models like GPT-4o excel in perception, while open-source models show consistent performance regardless of language model size. Prism's modular design enables flexible evaluation and application, making it a valuable tool for assessing and improving VLMs.