10 Apr 2024 | Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari
**BRAVE 🦁: Broadening the Visual Encoding of Vision-Language Models**
**Authors:** Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari
**Institution:** Google, Swiss Federal Institute of Technology Lausanne (EPFL)
**Abstract:**
Vision-language models (VLMs) are composed of a vision encoder and a language model, but they suffer from limitations due to the limited capabilities of vision encoders. To address these issues, the authors propose BRAVE, a method that combines features from multiple vision encoders into a more versatile and compact representation. BRAVE achieves state-of-the-art performance on captioning and visual question answering (VQA) tasks and significantly improves robustness against out-of-distribution inputs and visual hallucinations. The method uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently process and combine features from different vision encoders, reducing the number of trainable parameters while maintaining or improving performance.
**Key Contributions:**
- Comprehensive evaluation of different vision encoders on VLM tasks.
- Introduction of BRAVE, a method that combines features from multiple vision encoders into a compressed and contextual representation.
- Achieves state-of-the-art performance on captioning and VQA tasks.
- Significantly improves robustness against visual hallucinations and out-of-distribution inputs.
- Efficient in terms of the number of trainable parameters.
**Methods:**
- **BRAVE:** Combines features from multiple vision encoders using MEQ-Former, a multi-encoder querying transformer.
- **MEQ-Former:** A lightweight transformer that processes and combines features from different vision encoders, reducing the number of trainable parameters.
- **Evaluation:** Extensive evaluations on captioning and VQA tasks, showing consistent improvements over state-of-the-art methods.
**Results:**
- BRAVE achieves the best results on several captioning and VQA benchmarks.
- Significantly improves robustness against out-of-distribution inputs and visual hallucinations.
- Uses significantly fewer trainable parameters compared to other methods.
**Conclusion:**
BRAVE broadens the visual capabilities of VLMs by combining diverse features from multiple vision encoders, leading to improved performance and robustness.**BRAVE 🦁: Broadening the Visual Encoding of Vision-Language Models**
**Authors:** Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari
**Institution:** Google, Swiss Federal Institute of Technology Lausanne (EPFL)
**Abstract:**
Vision-language models (VLMs) are composed of a vision encoder and a language model, but they suffer from limitations due to the limited capabilities of vision encoders. To address these issues, the authors propose BRAVE, a method that combines features from multiple vision encoders into a more versatile and compact representation. BRAVE achieves state-of-the-art performance on captioning and visual question answering (VQA) tasks and significantly improves robustness against out-of-distribution inputs and visual hallucinations. The method uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently process and combine features from different vision encoders, reducing the number of trainable parameters while maintaining or improving performance.
**Key Contributions:**
- Comprehensive evaluation of different vision encoders on VLM tasks.
- Introduction of BRAVE, a method that combines features from multiple vision encoders into a compressed and contextual representation.
- Achieves state-of-the-art performance on captioning and VQA tasks.
- Significantly improves robustness against visual hallucinations and out-of-distribution inputs.
- Efficient in terms of the number of trainable parameters.
**Methods:**
- **BRAVE:** Combines features from multiple vision encoders using MEQ-Former, a multi-encoder querying transformer.
- **MEQ-Former:** A lightweight transformer that processes and combines features from different vision encoders, reducing the number of trainable parameters.
- **Evaluation:** Extensive evaluations on captioning and VQA tasks, showing consistent improvements over state-of-the-art methods.
**Results:**
- BRAVE achieves the best results on several captioning and VQA benchmarks.
- Significantly improves robustness against out-of-distribution inputs and visual hallucinations.
- Uses significantly fewer trainable parameters compared to other methods.
**Conclusion:**
BRAVE broadens the visual capabilities of VLMs by combining diverse features from multiple vision encoders, leading to improved performance and robustness.