3 Jun 2024 | Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You
**MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures**
This paper addresses the challenges in evaluating large language models (LLMs) by proposing MixEval, a novel paradigm that combines off-the-shelf benchmarks with real-world user queries to create efficient, gold-standard evaluations. Traditional ground-truth-based benchmarks often fail to capture the nuances of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query diversity. User-facing evaluations, such as Chatbot Arena, provide reliable signals but are costly and slow.
MixEval bridges the gap between comprehensive, well-distributed real-world queries and efficient, fairly-graded ground-truth benchmarks by matching web-mined queries with similar queries from existing benchmarks. The authors further introduce MixEval-Hard, a more challenging subset of MixEval, to enhance model separability. Key contributions include:
1. **High Correlation with Chatbot Arena**: MixEval and MixEval-Hard achieve a 0.96 correlation with Chatbot Arena, demonstrating their effectiveness in ranking models.
2. **Efficiency and Cost-Effectiveness**: These benchmarks require only 6% of the time and cost of MMLU, making them significantly more affordable and reproducible.
3. **Dynamic Evaluation**: A rapid and stable data update pipeline ensures dynamic evaluation, mitigating contamination issues.
4. **Comprehensive Query Distribution**: MixEval and MixEval-Hard exhibit a more comprehensive and less biased query distribution compared to other benchmarks.
The paper also includes extensive meta-evaluation and analysis, providing insights into the strengths and weaknesses of various LLM benchmarks and guiding future research directions.**MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures**
This paper addresses the challenges in evaluating large language models (LLMs) by proposing MixEval, a novel paradigm that combines off-the-shelf benchmarks with real-world user queries to create efficient, gold-standard evaluations. Traditional ground-truth-based benchmarks often fail to capture the nuances of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query diversity. User-facing evaluations, such as Chatbot Arena, provide reliable signals but are costly and slow.
MixEval bridges the gap between comprehensive, well-distributed real-world queries and efficient, fairly-graded ground-truth benchmarks by matching web-mined queries with similar queries from existing benchmarks. The authors further introduce MixEval-Hard, a more challenging subset of MixEval, to enhance model separability. Key contributions include:
1. **High Correlation with Chatbot Arena**: MixEval and MixEval-Hard achieve a 0.96 correlation with Chatbot Arena, demonstrating their effectiveness in ranking models.
2. **Efficiency and Cost-Effectiveness**: These benchmarks require only 6% of the time and cost of MMLU, making them significantly more affordable and reproducible.
3. **Dynamic Evaluation**: A rapid and stable data update pipeline ensures dynamic evaluation, mitigating contamination issues.
4. **Comprehensive Query Distribution**: MixEval and MixEval-Hard exhibit a more comprehensive and less biased query distribution compared to other benchmarks.
The paper also includes extensive meta-evaluation and analysis, providing insights into the strengths and weaknesses of various LLM benchmarks and guiding future research directions.