Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

11 Jun 2024 | Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen
The Open-LLM-Leaderboard is a new benchmark for evaluating large language models (LLMs) using open-style questions, which aim to address the limitations of multiple-choice questions (MCQs) such as selection bias and random guessing. MCQs often lead to biased answers due to inherent preferences for certain answer choices, while random guessing can result in inaccurate responses, especially for smaller LLMs. Open-style questions, which require models to generate answers without predefined choices, can eliminate these issues but pose challenges in question selection and validation against human-annotated answers. To address these challenges, the Open-LLM-Leaderboard introduces a framework for automatically filtering and generating open-style questions, ensuring they are suitable for evaluation. The process involves a two-stage filtering method: a coarse filter to identify potentially convertible questions and a fine-grained filter to assign confidence scores. This ensures that only high-quality open-style questions are used for evaluation. Additionally, a custom prompt is designed to validate the correctness of open-style answers against ground truth, ensuring accurate assessment. The benchmark includes questions from various datasets, covering a wide range of domains and question types. The results show that GPT-4o performs the best, followed by GPT-4 and Claude-3 Opus. The leaderboard also includes smaller models, demonstrating their performance on open-style questions. The evaluation process uses a combination of automated and human assessments, with a high agreement between LLM evaluations and human judgments, indicating the reliability of the benchmark. The Open-LLM-Leaderboard provides a more accurate and unbiased evaluation of LLMs, reflecting their true capabilities. It offers a faster, cheaper, and more comprehensive alternative to existing benchmarks, enabling a fair comparison of models across different tasks and domains. The benchmark highlights the importance of open-style questions in assessing LLMs and provides a foundation for future research and development in this area.The Open-LLM-Leaderboard is a new benchmark for evaluating large language models (LLMs) using open-style questions, which aim to address the limitations of multiple-choice questions (MCQs) such as selection bias and random guessing. MCQs often lead to biased answers due to inherent preferences for certain answer choices, while random guessing can result in inaccurate responses, especially for smaller LLMs. Open-style questions, which require models to generate answers without predefined choices, can eliminate these issues but pose challenges in question selection and validation against human-annotated answers. To address these challenges, the Open-LLM-Leaderboard introduces a framework for automatically filtering and generating open-style questions, ensuring they are suitable for evaluation. The process involves a two-stage filtering method: a coarse filter to identify potentially convertible questions and a fine-grained filter to assign confidence scores. This ensures that only high-quality open-style questions are used for evaluation. Additionally, a custom prompt is designed to validate the correctness of open-style answers against ground truth, ensuring accurate assessment. The benchmark includes questions from various datasets, covering a wide range of domains and question types. The results show that GPT-4o performs the best, followed by GPT-4 and Claude-3 Opus. The leaderboard also includes smaller models, demonstrating their performance on open-style questions. The evaluation process uses a combination of automated and human assessments, with a high agreement between LLM evaluations and human judgments, indicating the reliability of the benchmark. The Open-LLM-Leaderboard provides a more accurate and unbiased evaluation of LLMs, reflecting their true capabilities. It offers a faster, cheaper, and more comprehensive alternative to existing benchmarks, enabling a fair comparison of models across different tasks and domains. The benchmark highlights the importance of open-style questions in assessing LLMs and provides a foundation for future research and development in this area.
Reach us at info@study.space