[slides] Prismatic VLMs%3A Investigating the Design Space of Visually-Conditioned Language Models

This paper investigates the design space of visually-conditioned language models (VLMs) and aims to identify key design decisions that influence model performance. The authors compile a standardized evaluation suite spanning visual question answering, object localization, and challenge sets to provide fine-grained insights into VLM capabilities. They also develop an optimized and flexible codebase for VLM training, allowing for easy experimentation with different components and optimization procedures. Through rigorous experiments, they explore four key design axes: optimization procedure, image processing and visual representations, language models, and scaling properties. Key findings include the effectiveness of single-stage training, the benefits of fused visual backbones, and the performance of base language models compared to instruct-tuned models. The authors combine these insights to train a family of VLMs called PRISMs, which outperform state-of-the-art models such as InstructBLIP and LLaVa v1.5. The paper also discusses limitations and future directions, emphasizing the importance of open-source contributions and addressing potential risks and biases in VLMs.This paper investigates the design space of visually-conditioned language models (VLMs) and aims to identify key design decisions that influence model performance. The authors compile a standardized evaluation suite spanning visual question answering, object localization, and challenge sets to provide fine-grained insights into VLM capabilities. They also develop an optimized and flexible codebase for VLM training, allowing for easy experimentation with different components and optimization procedures. Through rigorous experiments, they explore four key design axes: optimization procedure, image processing and visual representations, language models, and scaling properties. Key findings include the effectiveness of single-stage training, the benefits of fused visual backbones, and the performance of base language models compared to instruct-tuned models. The authors combine these insights to train a family of VLMs called PRISMs, which outperform state-of-the-art models such as InstructBLIP and LLaVa v1.5. The paper also discusses limitations and future directions, emphasizing the importance of open-source contributions and addressing potential risks and biases in VLMs.

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

2024 | Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh