3 Jun 2024 | Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You
MixEval is a new benchmarking approach that combines off-the-shelf benchmarks with web-mined queries to create a gold-standard, efficient evaluation for large language models (LLMs). It addresses the limitations of traditional benchmarks, which often lack comprehensiveness or suffer from grading bias, and user-facing evaluations, which are costly and slow. MixEval bridges real-world user queries with ground-truth-based benchmarks by matching web-mined queries with similar benchmark queries. This approach results in a highly impartial evaluation, with a 0.96 correlation with Chatbot Arena, and is fast, cheap, and reproducible, requiring only 6% of the time and cost of MMLU. MixEval-Hard, a more challenging subset, further improves model differentiation. The benchmarks are dynamically updated to maintain accuracy and reduce contamination. MixEval and MixEval-Hard achieve high correlations with human preferences and provide reliable model rankings. They also demonstrate significant differences between versions, indicating robustness. The benchmarks are effective in aligning with real-world queries and reducing biases in LLM evaluations. MixEval's dynamic updating mechanism ensures that the benchmark remains relevant and accurate over time. The results show that MixEval and MixEval-Hard outperform other benchmarks in terms of correlation with human preferences and model ranking accuracy. The benchmarks are also cost-effective and efficient, making them a valuable tool for evaluating LLMs. The study highlights the importance of comprehensive and unbiased query distributions in LLM evaluation and provides insights into the effectiveness of benchmarking approaches.MixEval is a new benchmarking approach that combines off-the-shelf benchmarks with web-mined queries to create a gold-standard, efficient evaluation for large language models (LLMs). It addresses the limitations of traditional benchmarks, which often lack comprehensiveness or suffer from grading bias, and user-facing evaluations, which are costly and slow. MixEval bridges real-world user queries with ground-truth-based benchmarks by matching web-mined queries with similar benchmark queries. This approach results in a highly impartial evaluation, with a 0.96 correlation with Chatbot Arena, and is fast, cheap, and reproducible, requiring only 6% of the time and cost of MMLU. MixEval-Hard, a more challenging subset, further improves model differentiation. The benchmarks are dynamically updated to maintain accuracy and reduce contamination. MixEval and MixEval-Hard achieve high correlations with human preferences and provide reliable model rankings. They also demonstrate significant differences between versions, indicating robustness. The benchmarks are effective in aligning with real-world queries and reducing biases in LLM evaluations. MixEval's dynamic updating mechanism ensures that the benchmark remains relevant and accurate over time. The results show that MixEval and MixEval-Hard outperform other benchmarks in terms of correlation with human preferences and model ranking accuracy. The benchmarks are also cost-effective and efficient, making them a valuable tool for evaluating LLMs. The study highlights the importance of comprehensive and unbiased query distributions in LLM evaluation and provides insights into the effectiveness of benchmarking approaches.