Understanding Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

This paper explores the use of large language models (LLMs) as judges to evaluate chat assistants on open-ended questions, addressing the challenge of measuring human preferences. The authors introduce two benchmarks: MT-bench, a multi-turn question set, and Chatbot Arena, a crowdsourced battle platform. They examine the limitations of LLM-as-a-judge, including position bias, verbosity bias, self-enhancement bias, and limited reasoning ability, and propose solutions to mitigate these issues. The results show that strong LLMs like GPT-4 can achieve over 80% agreement with human preferences, matching the level of agreement among humans. This suggests that LLM-as-a-judge is a scalable and explainable method to approximate human preferences, complementing traditional benchmarks. The paper also highlights the need for a hybrid evaluation framework combining capability-based and preference-based benchmarks. The datasets and methods are publicly available, providing a valuable resource for future research.This paper explores the use of large language models (LLMs) as judges to evaluate chat assistants on open-ended questions, addressing the challenge of measuring human preferences. The authors introduce two benchmarks: MT-bench, a multi-turn question set, and Chatbot Arena, a crowdsourced battle platform. They examine the limitations of LLM-as-a-judge, including position bias, verbosity bias, self-enhancement bias, and limited reasoning ability, and propose solutions to mitigate these issues. The results show that strong LLMs like GPT-4 can achieve over 80% agreement with human preferences, matching the level of agreement among humans. This suggests that LLM-as-a-judge is a scalable and explainable method to approximate human preferences, complementing traditional benchmarks. The paper also highlights the need for a hybrid evaluation framework combining capability-based and preference-based benchmarks. The datasets and methods are publicly available, providing a valuable resource for future research.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

24 Dec 2023 | Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica