MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

3 Jun 2024 | Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You
MixEval is a new benchmarking approach that combines off-the-shelf benchmarks with web-mined queries to create a gold-standard, efficient evaluation for large language models (LLMs). It addresses the limitations of traditional benchmarks, which often lack comprehensiveness or suffer from grading bias, and user-facing evaluations, which are costly and slow. MixEval bridges real-world user queries with ground-truth-based benchmarks by matching web-mined queries with similar benchmark queries. This approach results in a highly impartial evaluation, with a 0.96 correlation with Chatbot Arena, and is fast, cheap, and reproducible, requiring only 6% of the time and cost of MMLU. MixEval-Hard, a more challenging subset, further improves model differentiation. The benchmarks are dynamically updated to maintain accuracy and reduce contamination. MixEval and MixEval-Hard achieve high correlations with human preferences and provide reliable model rankings. They also demonstrate significant differences between versions, indicating robustness. The benchmarks are effective in aligning with real-world queries and reducing biases in LLM evaluations. MixEval's dynamic updating mechanism ensures that the benchmark remains relevant and accurate over time. The results show that MixEval and MixEval-Hard outperform other benchmarks in terms of correlation with human preferences and model ranking accuracy. The benchmarks are also cost-effective and efficient, making them a valuable tool for evaluating LLMs. The study highlights the importance of comprehensive and unbiased query distributions in LLM evaluation and provides insights into the effectiveness of benchmarking approaches.MixEval is a new benchmarking approach that combines off-the-shelf benchmarks with web-mined queries to create a gold-standard, efficient evaluation for large language models (LLMs). It addresses the limitations of traditional benchmarks, which often lack comprehensiveness or suffer from grading bias, and user-facing evaluations, which are costly and slow. MixEval bridges real-world user queries with ground-truth-based benchmarks by matching web-mined queries with similar benchmark queries. This approach results in a highly impartial evaluation, with a 0.96 correlation with Chatbot Arena, and is fast, cheap, and reproducible, requiring only 6% of the time and cost of MMLU. MixEval-Hard, a more challenging subset, further improves model differentiation. The benchmarks are dynamically updated to maintain accuracy and reduce contamination. MixEval and MixEval-Hard achieve high correlations with human preferences and provide reliable model rankings. They also demonstrate significant differences between versions, indicating robustness. The benchmarks are effective in aligning with real-world queries and reducing biases in LLM evaluations. MixEval's dynamic updating mechanism ensures that the benchmark remains relevant and accurate over time. The results show that MixEval and MixEval-Hard outperform other benchmarks in terms of correlation with human preferences and model ranking accuracy. The benchmarks are also cost-effective and efficient, making them a valuable tool for evaluating LLMs. The study highlights the importance of comprehensive and unbiased query distributions in LLM evaluation and provides insights into the effectiveness of benchmarking approaches.
Reach us at info@study.space
[slides] MixEval%3A Deriving Wisdom of the Crowd from LLM Benchmark Mixtures | StudySpace