MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

11 Jun 2024 | Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun
This paper introduces a novel benchmark called MLLM-as-a-Judge to assess the ability of Multimodal Large Language Models (MLLMs) in assisting judges across diverse modalities. The benchmark evaluates MLLMs in three tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. The study reveals that while MLLMs perform well in Pair Comparison, they diverge significantly from human preferences in Scoring Evaluation and Batch Ranking. Additionally, the research highlights persistent challenges in the judgment capacities of LLMs, including biases, hallucinations, and inconsistencies, even in advanced models like GPT-4V. The findings emphasize the need for further enhancements and research to make MLLMs fully reliable evaluators. The paper also releases two curated datasets to facilitate future studies and discusses the limitations and implications of MLLMs as judges.This paper introduces a novel benchmark called MLLM-as-a-Judge to assess the ability of Multimodal Large Language Models (MLLMs) in assisting judges across diverse modalities. The benchmark evaluates MLLMs in three tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. The study reveals that while MLLMs perform well in Pair Comparison, they diverge significantly from human preferences in Scoring Evaluation and Batch Ranking. Additionally, the research highlights persistent challenges in the judgment capacities of LLMs, including biases, hallucinations, and inconsistencies, even in advanced models like GPT-4V. The findings emphasize the need for further enhancements and research to make MLLMs fully reliable evaluators. The paper also releases two curated datasets to facilitate future studies and discusses the limitations and implications of MLLMs as judges.
Reach us at info@study.space