Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

24 Dec 2023 | Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
This paper explores the use of large language models (LLMs) as judges to evaluate chat assistants on open-ended questions, addressing the challenge of measuring human preferences. The authors introduce two benchmarks: MT-bench, a multi-turn question set, and Chatbot Arena, a crowdsourced battle platform. They examine the limitations of LLM-as-a-judge, including position bias, verbosity bias, self-enhancement bias, and limited reasoning ability, and propose solutions to mitigate these issues. The results show that strong LLMs like GPT-4 can achieve over 80% agreement with human preferences, matching the level of agreement among humans. This suggests that LLM-as-a-judge is a scalable and explainable method to approximate human preferences, complementing traditional benchmarks. The paper also highlights the need for a hybrid evaluation framework combining capability-based and preference-based benchmarks. The datasets and methods are publicly available, providing a valuable resource for future research.This paper explores the use of large language models (LLMs) as judges to evaluate chat assistants on open-ended questions, addressing the challenge of measuring human preferences. The authors introduce two benchmarks: MT-bench, a multi-turn question set, and Chatbot Arena, a crowdsourced battle platform. They examine the limitations of LLM-as-a-judge, including position bias, verbosity bias, self-enhancement bias, and limited reasoning ability, and propose solutions to mitigate these issues. The results show that strong LLMs like GPT-4 can achieve over 80% agreement with human preferences, matching the level of agreement among humans. This suggests that LLM-as-a-judge is a scalable and explainable method to approximate human preferences, complementing traditional benchmarks. The paper also highlights the need for a hybrid evaluation framework combining capability-based and preference-based benchmarks. The datasets and methods are publicly available, providing a valuable resource for future research.
Reach us at info@study.space