11 Jun 2024 | Aidar Myrzakhan*, Sondos Mahmoud Bsharat*, and Zhiqiang Shen*
The paper "Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena" addresses the limitations of using multiple-choice questions (MCQ) to evaluate large language models (LLMs). It introduces a new approach that shifts from MCQ to open-style questions to eliminate selection bias and random guessing issues. The authors propose an automated framework, the Open-LLM-Leaderboard, which uses open-style questions to track and compare the performance of various LLMs, including GPT-4o/4/3.5, Claude 3, Gemini, and others. The framework aims to provide a more accurate and fair evaluation of LLMs by addressing the challenges of identifying suitable open-style questions and validating their responses against human-annotated ground truths. The paper details the multi-stage filtering process for converting MCQ to open-style questions and the evaluation method for assessing the correctness of LLM responses. The results show that GPT-4o performs the best in open-style question answering, followed by GPT-4-1106-preview and Claude-3 Opus. The study also highlights the diversity and quality of the questions in the benchmark dataset, demonstrating its effectiveness in evaluating LLMs' capabilities.The paper "Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena" addresses the limitations of using multiple-choice questions (MCQ) to evaluate large language models (LLMs). It introduces a new approach that shifts from MCQ to open-style questions to eliminate selection bias and random guessing issues. The authors propose an automated framework, the Open-LLM-Leaderboard, which uses open-style questions to track and compare the performance of various LLMs, including GPT-4o/4/3.5, Claude 3, Gemini, and others. The framework aims to provide a more accurate and fair evaluation of LLMs by addressing the challenges of identifying suitable open-style questions and validating their responses against human-annotated ground truths. The paper details the multi-stage filtering process for converting MCQ to open-style questions and the evaluation method for assessing the correctness of LLM responses. The results show that GPT-4o performs the best in open-style question answering, followed by GPT-4-1106-preview and Claude-3 Opus. The study also highlights the diversity and quality of the questions in the benchmark dataset, demonstrating its effectiveness in evaluating LLMs' capabilities.