MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

23 Jun 2024 | Yubo Wang*, Xueguang Ma*, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyuan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhuchen*
MMLU-Pro is a more challenging and robust multi-task language understanding benchmark designed to address the limitations of the original MMLU. It expands the dataset by incorporating more complex reasoning tasks, increasing the number of answer choices from four to ten, and eliminating trivial and noisy questions. MMLU-Pro also enhances the stability of model performance across different prompts, reducing the sensitivity of model scores to prompt variations. Experimental results show that MMLU-Pro significantly increases the difficulty of the benchmark, leading to a 16-33% drop in accuracy compared to MMLU. Additionally, models using Chain of Thought (CoT) reasoning perform better on MMLU-Pro than direct answering, indicating the benchmark's focus on complex reasoning. MMLU-Pro is more discriminative in distinguishing model capabilities, with a larger gap between top-performing models. The benchmark includes questions from 14 diverse domains, covering over 12,000 questions, and undergoes rigorous expert review to ensure quality. Evaluations of over 50 LLMs show that even leading models like GPT-4o face significant challenges, highlighting the benchmark's effectiveness in testing deeper cognitive processes. MMLU-Pro also demonstrates improved robustness by reducing dependency on prompt styles, making it a valuable tool for assessing AI language capabilities. The benchmark addresses issues such as limited question diversity, sensitivity to prompts, and the need for more challenging tasks to track progress in AI development.MMLU-Pro is a more challenging and robust multi-task language understanding benchmark designed to address the limitations of the original MMLU. It expands the dataset by incorporating more complex reasoning tasks, increasing the number of answer choices from four to ten, and eliminating trivial and noisy questions. MMLU-Pro also enhances the stability of model performance across different prompts, reducing the sensitivity of model scores to prompt variations. Experimental results show that MMLU-Pro significantly increases the difficulty of the benchmark, leading to a 16-33% drop in accuracy compared to MMLU. Additionally, models using Chain of Thought (CoT) reasoning perform better on MMLU-Pro than direct answering, indicating the benchmark's focus on complex reasoning. MMLU-Pro is more discriminative in distinguishing model capabilities, with a larger gap between top-performing models. The benchmark includes questions from 14 diverse domains, covering over 12,000 questions, and undergoes rigorous expert review to ensure quality. Evaluations of over 50 LLMs show that even leading models like GPT-4o face significant challenges, highlighting the benchmark's effectiveness in testing deeper cognitive processes. MMLU-Pro also demonstrates improved robustness by reducing dependency on prompt styles, making it a valuable tool for assessing AI language capabilities. The benchmark addresses issues such as limited question diversity, sensitivity to prompts, and the need for more challenging tasks to track progress in AI development.
Reach us at info@study.space
[slides and audio] MMLU-Pro%3A A More Robust and Challenging Multi-Task Language Understanding Benchmark