Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

7 Mar 2024 | Wei-Lin Chiang * 1 Lianmin Zheng * 1 Ying Sheng 2 Anastasios N. Angelopoulos 1 Tianle Li 1 Dacheng Li 1 Banghua Zhu 1 Hao Zhang 3 Michael I. Jordan 1 Joseph E. Gonzalez 1 Ion Stoica 1
**Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference** **Abstract:** Large Language Models (LLMs) have expanded their capabilities, but evaluating their alignment with human preferences remains challenging. To address this, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology uses a pairwise comparison approach and leverages diverse user input through crowdsourcing. The platform has been operational for several months, accumulating over 240K votes. This paper describes the platform, analyzes collected data, and explains statistical methods for efficient and accurate model evaluation and ranking. We confirm that crowdsourced questions are diverse and discriminatory, and votes are consistent with expert ratings. These analyses establish a robust foundation for Chatbot Arena's credibility. Due to its unique value and openness, Chatbot Arena has become one of the most referenced LLM leaderboards, widely cited by leading developers and companies. **Introduction:** Recent advancements in LLMs have expanded their capabilities beyond traditional NLP tasks. However, evaluating LLMs' performance remains challenging, especially in assessing their alignment with human preferences. Current benchmarks often fail to capture nuanced and diverse aspects of models, particularly in real-world, open-ended tasks. Static, ground-truth-based benchmarks have limitations, including the inability to capture flexible and interactive use cases and the risk of contamination over time. To address these issues, we introduce Chatbot Arena, a benchmarking platform that features anonymous, randomized battles in a crowdsourced setting. Users can ask questions and receive answers from two anonymous LLMs, then vote for the preferred response. We employ statistical techniques to estimate model rankings efficiently and accurately. Our data analysis shows that user-generated questions are diverse and challenging, and votes are highly consistent with expert evaluations. **Methods:** We collect user feedback through pairwise comparisons, where users compare two model responses and vote for the better one. We design an interface to reduce friction for users and ensure anonymity. We estimate the win matrix and model scores using statistical methods, including the Bradley-Terry model and E-values. We develop efficient sampling algorithms to accelerate ranking convergence while maintaining statistical validity. **Results:** We have received over 240K votes from about 90K users, covering more than 50 models and over 100 languages. Our data analysis shows that user prompts are diverse and effective in distinguishing model strengths. We validate the quality of crowdsourced votes by relabeling data with experts. Our experiments demonstrate the effectiveness of our ranking system and active sampling rule. We also evaluate the detection of anomalous users. **Discussion:** Our platform addresses the limitations of static benchmarks and provides a robust, open-source benchmark for LLM evaluation. Future work includes developing comprehensive topic leaderboards and improving the detection of harmful users. **Conclusion:** Chatbot Arena is an open platform for evaluating LLMs through crowdsourced, pairwise human preferences. Our analysis validates the diversity and quality**Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference** **Abstract:** Large Language Models (LLMs) have expanded their capabilities, but evaluating their alignment with human preferences remains challenging. To address this, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology uses a pairwise comparison approach and leverages diverse user input through crowdsourcing. The platform has been operational for several months, accumulating over 240K votes. This paper describes the platform, analyzes collected data, and explains statistical methods for efficient and accurate model evaluation and ranking. We confirm that crowdsourced questions are diverse and discriminatory, and votes are consistent with expert ratings. These analyses establish a robust foundation for Chatbot Arena's credibility. Due to its unique value and openness, Chatbot Arena has become one of the most referenced LLM leaderboards, widely cited by leading developers and companies. **Introduction:** Recent advancements in LLMs have expanded their capabilities beyond traditional NLP tasks. However, evaluating LLMs' performance remains challenging, especially in assessing their alignment with human preferences. Current benchmarks often fail to capture nuanced and diverse aspects of models, particularly in real-world, open-ended tasks. Static, ground-truth-based benchmarks have limitations, including the inability to capture flexible and interactive use cases and the risk of contamination over time. To address these issues, we introduce Chatbot Arena, a benchmarking platform that features anonymous, randomized battles in a crowdsourced setting. Users can ask questions and receive answers from two anonymous LLMs, then vote for the preferred response. We employ statistical techniques to estimate model rankings efficiently and accurately. Our data analysis shows that user-generated questions are diverse and challenging, and votes are highly consistent with expert evaluations. **Methods:** We collect user feedback through pairwise comparisons, where users compare two model responses and vote for the better one. We design an interface to reduce friction for users and ensure anonymity. We estimate the win matrix and model scores using statistical methods, including the Bradley-Terry model and E-values. We develop efficient sampling algorithms to accelerate ranking convergence while maintaining statistical validity. **Results:** We have received over 240K votes from about 90K users, covering more than 50 models and over 100 languages. Our data analysis shows that user prompts are diverse and effective in distinguishing model strengths. We validate the quality of crowdsourced votes by relabeling data with experts. Our experiments demonstrate the effectiveness of our ranking system and active sampling rule. We also evaluate the detection of anomalous users. **Discussion:** Our platform addresses the limitations of static benchmarks and provides a robust, open-source benchmark for LLM evaluation. Future work includes developing comprehensive topic leaderboards and improving the detection of harmful users. **Conclusion:** Chatbot Arena is an open platform for evaluating LLMs through crowdsourced, pairwise human preferences. Our analysis validates the diversity and quality
Reach us at info@study.space
[slides and audio] Chatbot Arena%3A An Open Platform for Evaluating LLMs by Human Preference