BRAVE: Broadening the visual encoding of vision-language models

BRAVE: Broadening the visual encoding of vision-language models

10 Apr 2024 | Oğuzhan Fatih Kar¹², Alessio Tonioni¹, Petra Poklukar¹, Achin Kulshrestha¹, Amir Zamir², Federico Tombari¹
BRAVE is a method to broaden the visual encoding capabilities of vision-language models (VLMs) by combining features from multiple vision encoders into a more versatile and compact representation. Unlike existing methods that use a single vision encoder, BRAVE integrates diverse features from various encoders, leading to state-of-the-art performance on captioning and visual question answering (VQA) tasks. It significantly improves performance on benchmarks like MMVP, where traditional encoders like CLIP fail. BRAVE uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently combine features from multiple encoders, reducing the number of trainable parameters and improving robustness against visual hallucinations and out-of-distribution inputs. The method achieves strong results on a wide range of tasks, including COCO captioning and VQA, while using fewer parameters and training data than previous approaches. The study highlights the importance of using diverse visual biases to enhance the performance and robustness of VLMs. BRAVE demonstrates that scaling along the vision axis can be as effective as scaling along the language axis, and that combining features from multiple encoders can lead to better performance and more robust models. The results show that BRAVE outperforms existing methods in several benchmarks and is more efficient in terms of parameters and training data. The method is flexible and can be combined with other efforts to further improve performance. The study also discusses limitations, including the need for adaptive mechanisms, improving sample efficiency, and exploring more diverse vision encoders. Overall, BRAVE provides a promising approach to enhance the visual capabilities of VLMs and improve their performance on a wide range of tasks.BRAVE is a method to broaden the visual encoding capabilities of vision-language models (VLMs) by combining features from multiple vision encoders into a more versatile and compact representation. Unlike existing methods that use a single vision encoder, BRAVE integrates diverse features from various encoders, leading to state-of-the-art performance on captioning and visual question answering (VQA) tasks. It significantly improves performance on benchmarks like MMVP, where traditional encoders like CLIP fail. BRAVE uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently combine features from multiple encoders, reducing the number of trainable parameters and improving robustness against visual hallucinations and out-of-distribution inputs. The method achieves strong results on a wide range of tasks, including COCO captioning and VQA, while using fewer parameters and training data than previous approaches. The study highlights the importance of using diverse visual biases to enhance the performance and robustness of VLMs. BRAVE demonstrates that scaling along the vision axis can be as effective as scaling along the language axis, and that combining features from multiple encoders can lead to better performance and more robust models. The results show that BRAVE outperforms existing methods in several benchmarks and is more efficient in terms of parameters and training data. The method is flexible and can be combined with other efforts to further improve performance. The study also discusses limitations, including the need for adaptive mechanisms, improving sample efficiency, and exploring more diverse vision encoders. Overall, BRAVE provides a promising approach to enhance the visual capabilities of VLMs and improve their performance on a wide range of tasks.
Reach us at info@study.space
Understanding BRAVE%3A Broadening the visual encoding of vision-language models