Understanding Vibe-Eval%3A A hard evaluation suite for measuring progress of multimodal language models

Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is open-ended and challenging, with dual objectives: (i) to vibe-check multimodal chat models for day-to-day tasks and (ii) to rigorously test and probe the capabilities of present frontier models. Notably, our hard set contains over 50% of questions that all frontier models answer incorrectly. The prompts are diverse and challenging, requiring multiple reasoning steps to solve. The benchmark includes both normal and hard sets, with the hard set containing prompts that Reka Core (Reka, 2024) is not able to solve at the time of collection. The prompts are open-ended and can require multiple reasoning steps to solve. The benchmark is accompanied by an official evaluation protocol using Reka Core as the automated judge. The evaluation protocol shows that automatic evaluation correlates with human judgment. The benchmark is also accompanied by human evaluation, where human preference data is collected using a third-party data annotation company. The results show that the general relative rankings of models remain roughly the same as automatic evaluations. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators.Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is open-ended and challenging, with dual objectives: (i) to vibe-check multimodal chat models for day-to-day tasks and (ii) to rigorously test and probe the capabilities of present frontier models. Notably, our hard set contains over 50% of questions that all frontier models answer incorrectly. The prompts are diverse and challenging, requiring multiple reasoning steps to solve. The benchmark includes both normal and hard sets, with the hard set containing prompts that Reka Core (Reka, 2024) is not able to solve at the time of collection. The prompts are open-ended and can require multiple reasoning steps to solve. The benchmark is accompanied by an official evaluation protocol using Reka Core as the automated judge. The evaluation protocol shows that automatic evaluation correlates with human judgment. The benchmark is also accompanied by human evaluation, where human preference data is collected using a third-party data annotation company. The results show that the general relative rankings of models remain roughly the same as automatic evaluations. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators. The benchmark is also accompanied by a detailed discussion of the challenges and considerations when curating a dataset of hard questions, their results on frontier models, and the use of automated evaluators.

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models