Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

3 May 2024 | Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugene Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay
Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is open-ended and challenging, with dual objectives: (i) to vibe-check multimodal chat models for day-to-day tasks and (ii) to rigorously test and probe the capabilities of present frontier models. Notably, our hard set contains over 50% of questions that all frontier models answer incorrectly. The prompts are diverse and challenging, requiring multiple reasoning steps to solve. The benchmark includes both normal and hard sets, with the hard set containing prompts that Reka Core (Reka, 2024) is not able to solve at the time of collection. The prompts are open-ended and can require multiple reasoning steps to solve. The benchmark is accompanied by an official evaluation protocol using Reka Core as the automated judge. The evaluation protocol shows that automatic evaluation correlates with human judgment. The benchmark is also accompanied by human evaluation, where human preference data is collected using a third-party data annotation company. The results show that the general relative rankings of models remain roughly the same as automatic evaluations. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators.Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is open-ended and challenging, with dual objectives: (i) to vibe-check multimodal chat models for day-to-day tasks and (ii) to rigorously test and probe the capabilities of present frontier models. Notably, our hard set contains over 50% of questions that all frontier models answer incorrectly. The prompts are diverse and challenging, requiring multiple reasoning steps to solve. The benchmark includes both normal and hard sets, with the hard set containing prompts that Reka Core (Reka, 2024) is not able to solve at the time of collection. The prompts are open-ended and can require multiple reasoning steps to solve. The benchmark is accompanied by an official evaluation protocol using Reka Core as the automated judge. The evaluation protocol shows that automatic evaluation correlates with human judgment. The benchmark is also accompanied by human evaluation, where human preference data is collected using a third-party data annotation company. The results show that the general relative rankings of models remain roughly the same as automatic evaluations. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators.
Reach us at info@study.space