MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

2024 | Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinnuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun
This paper introduces a novel benchmark called MLLM-as-a-Judge to evaluate the ability of Multimodal Large Language Models (MLLMs) to assist judges across various modalities. The benchmark includes three tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. The study reveals that while MLLMs demonstrate strong performance in Pair Comparison, they struggle with Scoring Evaluation and Batch Ranking, showing significant divergence from human preferences. The research also highlights persistent challenges in MLLM judgment capabilities, including biases, hallucinations, and inconsistencies, even in advanced models like GPT-4V. The paper advocates for further research to enhance MLLMs before they can be considered fully reliable evaluators. The benchmark includes two curated datasets: MLLM-AS-A-JUDGE-HQ and MLLM-AS-A-JUDGE-HARD, which provide high-quality and challenging examples for evaluating MLLM performance. The study also explores the impact of vision perception on MLLM judging performance and finds that incorporating detailed image descriptions significantly improves performance. Additionally, the paper investigates the effects of multi-step CoT reasoning on MLLM judgments and finds that it does not necessarily improve performance. The research concludes that MLLMs have potential as judges but require further refinement to align with human preferences and reduce biases and hallucinations. The paper also discusses future directions for research, including the integration of human-in-the-loop approaches and the exploration of more sophisticated reasoning frameworks for MLLMs.This paper introduces a novel benchmark called MLLM-as-a-Judge to evaluate the ability of Multimodal Large Language Models (MLLMs) to assist judges across various modalities. The benchmark includes three tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. The study reveals that while MLLMs demonstrate strong performance in Pair Comparison, they struggle with Scoring Evaluation and Batch Ranking, showing significant divergence from human preferences. The research also highlights persistent challenges in MLLM judgment capabilities, including biases, hallucinations, and inconsistencies, even in advanced models like GPT-4V. The paper advocates for further research to enhance MLLMs before they can be considered fully reliable evaluators. The benchmark includes two curated datasets: MLLM-AS-A-JUDGE-HQ and MLLM-AS-A-JUDGE-HARD, which provide high-quality and challenging examples for evaluating MLLM performance. The study also explores the impact of vision perception on MLLM judging performance and finds that incorporating detailed image descriptions significantly improves performance. Additionally, the paper investigates the effects of multi-step CoT reasoning on MLLM judgments and finds that it does not necessarily improve performance. The research concludes that MLLMs have potential as judges but require further refinement to align with human preferences and reduce biases and hallucinations. The paper also discusses future directions for research, including the integration of human-in-the-loop approaches and the exploration of more sophisticated reasoning frameworks for MLLMs.
Reach us at info@study.space
[slides] MLLM-as-a-Judge%3A Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark | StudySpace