[slides] BRAVE%3A Broadening the visual encoding of vision-language models

**BRAVE 🦁: Broadening the Visual Encoding of Vision-Language Models** **Authors:** Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari **Institution:** Google, Swiss Federal Institute of Technology Lausanne (EPFL) **Abstract:** Vision-language models (VLMs) are composed of a vision encoder and a language model, but they suffer from limitations due to the limited capabilities of vision encoders. To address these issues, the authors propose BRAVE, a method that combines features from multiple vision encoders into a more versatile and compact representation. BRAVE achieves state-of-the-art performance on captioning and visual question answering (VQA) tasks and significantly improves robustness against out-of-distribution inputs and visual hallucinations. The method uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently process and combine features from different vision encoders, reducing the number of trainable parameters while maintaining or improving performance. **Key Contributions:** - Comprehensive evaluation of different vision encoders on VLM tasks. - Introduction of BRAVE, a method that combines features from multiple vision encoders into a compressed and contextual representation. - Achieves state-of-the-art performance on captioning and VQA tasks. - Significantly improves robustness against visual hallucinations and out-of-distribution inputs. - Efficient in terms of the number of trainable parameters. **Methods:** - **BRAVE:** Combines features from multiple vision encoders using MEQ-Former, a multi-encoder querying transformer. - **MEQ-Former:** A lightweight transformer that processes and combines features from different vision encoders, reducing the number of trainable parameters. - **Evaluation:** Extensive evaluations on captioning and VQA tasks, showing consistent improvements over state-of-the-art methods. **Results:** - BRAVE achieves the best results on several captioning and VQA benchmarks. - Significantly improves robustness against out-of-distribution inputs and visual hallucinations. - Uses significantly fewer trainable parameters compared to other methods. **Conclusion:** BRAVE broadens the visual capabilities of VLMs by combining diverse features from multiple vision encoders, leading to improved performance and robustness.**BRAVE 🦁: Broadening the Visual Encoding of Vision-Language Models** **Authors:** Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari **Institution:** Google, Swiss Federal Institute of Technology Lausanne (EPFL) **Abstract:** Vision-language models (VLMs) are composed of a vision encoder and a language model, but they suffer from limitations due to the limited capabilities of vision encoders. To address these issues, the authors propose BRAVE, a method that combines features from multiple vision encoders into a more versatile and compact representation. BRAVE achieves state-of-the-art performance on captioning and visual question answering (VQA) tasks and significantly improves robustness against out-of-distribution inputs and visual hallucinations. The method uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently process and combine features from different vision encoders, reducing the number of trainable parameters while maintaining or improving performance. **Key Contributions:** - Comprehensive evaluation of different vision encoders on VLM tasks. - Introduction of BRAVE, a method that combines features from multiple vision encoders into a compressed and contextual representation. - Achieves state-of-the-art performance on captioning and VQA tasks. - Significantly improves robustness against visual hallucinations and out-of-distribution inputs. - Efficient in terms of the number of trainable parameters. **Methods:** - **BRAVE:** Combines features from multiple vision encoders using MEQ-Former, a multi-encoder querying transformer. - **MEQ-Former:** A lightweight transformer that processes and combines features from different vision encoders, reducing the number of trainable parameters. - **Evaluation:** Extensive evaluations on captioning and VQA tasks, showing consistent improvements over state-of-the-art methods. **Results:** - BRAVE achieves the best results on several captioning and VQA benchmarks. - Significantly improves robustness against out-of-distribution inputs and visual hallucinations. - Uses significantly fewer trainable parameters compared to other methods. **Conclusion:** BRAVE broadens the visual capabilities of VLMs by combining diverse features from multiple vision encoders, leading to improved performance and robustness.

BRAVE: Broadening the visual encoding of vision-language models

10 Apr 2024 | Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari