Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

2024 | Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh
This paper investigates the design space of visually-conditioned language models (VLMs), aiming to understand the key design decisions that influence their performance. The authors compile a standardized evaluation suite that includes tasks such as visual question answering, object localization, and challenge tasks that probe properties like hallucination. They also investigate VLMs along key design axes, including pretrained visual representations, training from base vs. instruct-tuned language models, and scaling properties. The authors provide three key resource contributions: a unified framework for evaluating VLMs, optimized and flexible training code, and checkpoints for all models, including a family of VLMs at the 7-13B scale that outperform InstructBLIP and LLaVa v1.5. The authors find that single-stage training can reduce compute costs without harming performance, and that fused visual representations from DINOv2 and SigLIP lead to improved performance. They also find that base language models like Llama-2 match or exceed the performance of instruct-tuned models, with co-training on language-only data important for safety. Adding diverse data and extending training time significantly boosts performance. The authors also highlight the importance of safety data in preventing harmful outputs and the potential biases in the data and pretrained language models used. The authors present a new family of VLMs called PRISMs that outperform state-of-the-art open VLMs. They also discuss the limitations of their approach, including the generality of their model architecture and the evaluation focus on standardized metrics. The authors emphasize the importance of open data, open training code, and open evaluation code in the development of VLMs. They also highlight the risks and biases associated with VLMs, including the potential for generating toxic and unsafe content, and the biases in the data and pretrained language models used. The authors conclude that their work provides a foundation for future research in training and evaluating VLMs.This paper investigates the design space of visually-conditioned language models (VLMs), aiming to understand the key design decisions that influence their performance. The authors compile a standardized evaluation suite that includes tasks such as visual question answering, object localization, and challenge tasks that probe properties like hallucination. They also investigate VLMs along key design axes, including pretrained visual representations, training from base vs. instruct-tuned language models, and scaling properties. The authors provide three key resource contributions: a unified framework for evaluating VLMs, optimized and flexible training code, and checkpoints for all models, including a family of VLMs at the 7-13B scale that outperform InstructBLIP and LLaVa v1.5. The authors find that single-stage training can reduce compute costs without harming performance, and that fused visual representations from DINOv2 and SigLIP lead to improved performance. They also find that base language models like Llama-2 match or exceed the performance of instruct-tuned models, with co-training on language-only data important for safety. Adding diverse data and extending training time significantly boosts performance. The authors also highlight the importance of safety data in preventing harmful outputs and the potential biases in the data and pretrained language models used. The authors present a new family of VLMs called PRISMs that outperform state-of-the-art open VLMs. They also discuss the limitations of their approach, including the generality of their model architecture and the evaluation focus on standardized metrics. The authors emphasize the importance of open data, open training code, and open evaluation code in the development of VLMs. They also highlight the risks and biases associated with VLMs, including the potential for generating toxic and unsafe content, and the biases in the data and pretrained language models used. The authors conclude that their work provides a foundation for future research in training and evaluating VLMs.
Reach us at info@study.space
Understanding Prismatic VLMs%3A Investigating the Design Space of Visually-Conditioned Language Models