Understanding Prism%3A A Framework for Decoupling and Assessing the Capabilities of VLMs

**Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs** **Authors:** Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahu Lin, Kai Chen **Abstract:** Vision Language Models (VLMs) excel in addressing a wide range of visual questions, requiring strong perception and reasoning capabilities. Assessing these competencies independently is crucial for model refinement. To address this, Prism is introduced as an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism consists of two stages: a perception stage that uses a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables systematic comparison and assessment of VLMs' perception and reasoning strengths. The framework provides valuable insights and demonstrates superior performance in general vision-language tasks while significantly reducing training and operational costs. Quantitative evaluations show that Prism, when configured with a 2B LLM and GPT-3.5, achieves performance on par with VLMs 10× larger on the MMStar benchmark. **Key Contributions:** 1. **Introduction of Prism:** A highly adaptable framework designed to explicitly disentangle perception and reasoning processes. 2. **Decoupled Analysis:** Conducted a systematic analysis of VLMs' perception and reasoning capabilities. 3. **Efficient Vision-Language Task Solving:** Integrated a lightweight VLM focused on perception with a powerful LLM dedicated to reasoning, achieving outstanding performance and efficiency. **Methodology:** - **Prism Architecture:** Comprises a perception module (VLM) and a reasoning module (LLM). - **Benchmark Selection:** MMSStar is chosen for decoupling analysis due to its vision indispensability and minimal data leakage. - **Evaluation Framework:** Prism functions as an analytical tool to evaluate VLMs' perception and reasoning capabilities. - **Vision-Language Task Solving:** Prism can be used as an efficient solver by integrating a small-scale VLM for perception and a powerful LLM for reasoning. **Evaluation Results:** - **Perception Performance:** Various VLMs are evaluated using generic and query-specific instructions, showing that proprietary models like GPT-4o outperform others. - **Reasoning Performance:** Small-scale VLMs paired with powerful LLMs (e.g., ChatGPT) significantly improve overall performance, especially in reasoning-related tasks. - **Ablation Study:** Explained the impact of generic instructions, reasoning modules, and vision backbones on VLMs' performance. **Discussion:** - **Value as an Evaluation Framework:** Prism offers detailed insights into VLMs' capabilities, filling gaps in existing benchmarks. - **Value as a Vision-Language Solver:** Prism achieves superior**Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs** **Authors:** Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahu Lin, Kai Chen **Abstract:** Vision Language Models (VLMs) excel in addressing a wide range of visual questions, requiring strong perception and reasoning capabilities. Assessing these competencies independently is crucial for model refinement. To address this, Prism is introduced as an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism consists of two stages: a perception stage that uses a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables systematic comparison and assessment of VLMs' perception and reasoning strengths. The framework provides valuable insights and demonstrates superior performance in general vision-language tasks while significantly reducing training and operational costs. Quantitative evaluations show that Prism, when configured with a 2B LLM and GPT-3.5, achieves performance on par with VLMs 10× larger on the MMStar benchmark. **Key Contributions:** 1. **Introduction of Prism:** A highly adaptable framework designed to explicitly disentangle perception and reasoning processes. 2. **Decoupled Analysis:** Conducted a systematic analysis of VLMs' perception and reasoning capabilities. 3. **Efficient Vision-Language Task Solving:** Integrated a lightweight VLM focused on perception with a powerful LLM dedicated to reasoning, achieving outstanding performance and efficiency. **Methodology:** - **Prism Architecture:** Comprises a perception module (VLM) and a reasoning module (LLM). - **Benchmark Selection:** MMSStar is chosen for decoupling analysis due to its vision indispensability and minimal data leakage. - **Evaluation Framework:** Prism functions as an analytical tool to evaluate VLMs' perception and reasoning capabilities. - **Vision-Language Task Solving:** Prism can be used as an efficient solver by integrating a small-scale VLM for perception and a powerful LLM for reasoning. **Evaluation Results:** - **Perception Performance:** Various VLMs are evaluated using generic and query-specific instructions, showing that proprietary models like GPT-4o outperform others. - **Reasoning Performance:** Small-scale VLMs paired with powerful LLMs (e.g., ChatGPT) significantly improve overall performance, especially in reasoning-related tasks. - **Ablation Study:** Explained the impact of generic instructions, reasoning modules, and vision backbones on VLMs' performance. **Discussion:** - **Value as an Evaluation Framework:** Prism offers detailed insights into VLMs' capabilities, filling gaps in existing benchmarks. - **Value as a Vision-Language Solver:** Prism achieves superior

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

20 Jun 2024 | Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen