Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

24 Dec 2023 | Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
This paper introduces MT-bench and Chatbot Arena, two benchmarks for evaluating large language models (LLMs) as judges to approximate human preferences. The authors argue that traditional benchmarks like MMLU and HELM are insufficient for assessing LLMs in open-ended, multi-turn dialogues, where human preferences are more relevant. They propose using strong LLMs, such as GPT-4, as judges to evaluate chatbots on open-ended questions, as these models can match human preferences with over 80% agreement. The paper explores the use of LLMs as judges, identifying potential biases such as position bias, verbosity bias, and self-enhancement bias. It also highlights limitations in LLMs' reasoning and math capabilities. The authors propose solutions, including reference-guided grading, chain-of-thought prompting, and fine-tuning judge models. They demonstrate that GPT-4 can achieve high agreement with human evaluations, making it a scalable and explainable method for approximating human preferences. The study evaluates several variants of LLaMA and Vicuna, showing that GPT-4 performs well in both controlled and crowdsourced settings. The authors also introduce a hybrid evaluation framework combining capability-based and preference-based benchmarks, enabling the assessment of both core capabilities and human alignment. The benchmarks are publicly available, with 80 MT-bench questions, 3K expert votes, and 30K conversations with human preferences. The results suggest that LLM-as-a-judge is a promising alternative to traditional human evaluations, offering a scalable and efficient way to assess LLM alignment with human preferences.This paper introduces MT-bench and Chatbot Arena, two benchmarks for evaluating large language models (LLMs) as judges to approximate human preferences. The authors argue that traditional benchmarks like MMLU and HELM are insufficient for assessing LLMs in open-ended, multi-turn dialogues, where human preferences are more relevant. They propose using strong LLMs, such as GPT-4, as judges to evaluate chatbots on open-ended questions, as these models can match human preferences with over 80% agreement. The paper explores the use of LLMs as judges, identifying potential biases such as position bias, verbosity bias, and self-enhancement bias. It also highlights limitations in LLMs' reasoning and math capabilities. The authors propose solutions, including reference-guided grading, chain-of-thought prompting, and fine-tuning judge models. They demonstrate that GPT-4 can achieve high agreement with human evaluations, making it a scalable and explainable method for approximating human preferences. The study evaluates several variants of LLaMA and Vicuna, showing that GPT-4 performs well in both controlled and crowdsourced settings. The authors also introduce a hybrid evaluation framework combining capability-based and preference-based benchmarks, enabling the assessment of both core capabilities and human alignment. The benchmarks are publicly available, with 80 MT-bench questions, 3K expert votes, and 30K conversations with human preferences. The results suggest that LLM-as-a-judge is a promising alternative to traditional human evaluations, offering a scalable and efficient way to assess LLM alignment with human preferences.
Reach us at info@study.space
Understanding Judging LLM-as-a-judge with MT-Bench and Chatbot Arena