3 May 2024 | Piotr Padlewski*, Max Bain*, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Alekseev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay
**Vibe-Eval: A Hard Evaluation Suite for Measuring Progress of Multimodal Language Models**
**Abstract:**
Vibe-Eval is a new open benchmark and framework designed to evaluate multimodal chat models. It consists of 269 visual understanding prompts, including 100 highly challenging prompts with gold-standard responses authored by experts. The benchmark aims to check the capabilities of multimodal chat models for everyday tasks and rigorously test current frontier models. Notably, over 50% of the prompts are unsolvable by all existing models. Vibe-Eval includes an automated evaluation protocol using Reka Core, which correlates well with human judgment. The benchmark is freely accessible via an API, and human evaluations will be conducted for models performing well on the automatic scores. The evaluation code and data are available on GitHub.
**Introduction:**
As multimodal language models approach human-level performance, static benchmarks become less effective. Vibe-Eval addresses this by providing a set of diverse, high-quality image-text prompts with gold-standard responses. The benchmark has two categories: normal and hard, with 169 and 100 prompts respectively. The hard set contains prompts that Reka Core cannot solve, making it challenging for current models. The evaluation protocol uses Reka Core as an automated judge, and the results show strong correlation with human judgment. The benchmark also includes an initial ranking of leading multimodal language models, such as GPT-4V, Claude-3 Opus, and Gemini 1.5 Pro.
**Vibe-Eval:**
- **Overview:** Vibe-Eval consists of 269 prompts, each with an image and a task requiring visual understanding. Prompts are categorized into normal and hard sets.
- **Data Collection:** Prompts and responses are collected and reviewed multiple times to ensure quality.
- **Evaluation Protocol:** Reka Core evaluates responses based on accuracy, allowing partial credit for partially correct answers.
- **Results:** Vibe-Eval scores and human preference rankings show that Gemini Pro 1.5 and GPT4-V perform well, with some models outperforming others on hard prompts.
**Discussion & Insights:**
- **Hard Prompts:** Creating and evaluating hard prompts is challenging due to the need for multiple reasoning steps and error-free solutions.
- **Evaluation Methods:** Both human and automated evaluations have their limitations, but they complement each other well.
- **Inverse Scaling:** Some prompts show inverse scaling, where smaller models outperform larger ones, suggesting strong language bias in large models.
**Conclusion:**
Vibe-Eval is a valuable resource for evaluating multimodal chat models, providing a challenging set of prompts and a robust evaluation framework.**Vibe-Eval: A Hard Evaluation Suite for Measuring Progress of Multimodal Language Models**
**Abstract:**
Vibe-Eval is a new open benchmark and framework designed to evaluate multimodal chat models. It consists of 269 visual understanding prompts, including 100 highly challenging prompts with gold-standard responses authored by experts. The benchmark aims to check the capabilities of multimodal chat models for everyday tasks and rigorously test current frontier models. Notably, over 50% of the prompts are unsolvable by all existing models. Vibe-Eval includes an automated evaluation protocol using Reka Core, which correlates well with human judgment. The benchmark is freely accessible via an API, and human evaluations will be conducted for models performing well on the automatic scores. The evaluation code and data are available on GitHub.
**Introduction:**
As multimodal language models approach human-level performance, static benchmarks become less effective. Vibe-Eval addresses this by providing a set of diverse, high-quality image-text prompts with gold-standard responses. The benchmark has two categories: normal and hard, with 169 and 100 prompts respectively. The hard set contains prompts that Reka Core cannot solve, making it challenging for current models. The evaluation protocol uses Reka Core as an automated judge, and the results show strong correlation with human judgment. The benchmark also includes an initial ranking of leading multimodal language models, such as GPT-4V, Claude-3 Opus, and Gemini 1.5 Pro.
**Vibe-Eval:**
- **Overview:** Vibe-Eval consists of 269 prompts, each with an image and a task requiring visual understanding. Prompts are categorized into normal and hard sets.
- **Data Collection:** Prompts and responses are collected and reviewed multiple times to ensure quality.
- **Evaluation Protocol:** Reka Core evaluates responses based on accuracy, allowing partial credit for partially correct answers.
- **Results:** Vibe-Eval scores and human preference rankings show that Gemini Pro 1.5 and GPT4-V perform well, with some models outperforming others on hard prompts.
**Discussion & Insights:**
- **Hard Prompts:** Creating and evaluating hard prompts is challenging due to the need for multiple reasoning steps and error-free solutions.
- **Evaluation Methods:** Both human and automated evaluations have their limitations, but they complement each other well.
- **Inverse Scaling:** Some prompts show inverse scaling, where smaller models outperform larger ones, suggesting strong language bias in large models.
**Conclusion:**
Vibe-Eval is a valuable resource for evaluating multimodal chat models, providing a challenging set of prompts and a robust evaluation framework.