MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

23 Jun 2024 | Yubo Wang*, Xueguang Ma*, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyuan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhuchen*
The paper introduces MMLU-Pro, an enhanced benchmark for evaluating multi-task language understanding and reasoning capabilities in large language models (LLMs). MMLU-Pro addresses the limitations of the original MMLU benchmark by increasing the complexity and robustness of the questions. Key features of MMLU-Pro include: 1. **Question Expansion**: The dataset now includes 10 options instead of 4, reducing the likelihood of guessing correct answers. 2. **Content Diversification**: The dataset covers 14 diverse domains, including mathematics, physics, chemistry, law, engineering, psychology, and health, ensuring a broad range of topics. 3. **Expert Reviews**: Two rounds of expert reviews are conducted to ensure the accuracy and reliability of the questions. 4. **Chain of Thought (CoT) Reasoning**: The benchmark emphasizes the importance of deep reasoning, which is crucial for handling complex tasks. Experimental results show that MMLU-Pro significantly raises the challenge level compared to MMLU, with a 16% to 33% drop in accuracy. Additionally, MMLU-Pro demonstrates greater stability under varying prompts, with the sensitivity of model scores to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro. Models that utilize CoT reasoning perform better on MMLU-Pro, indicating that the benchmark includes more complex reasoning questions. The paper also includes a detailed error analysis of GPT-4o, the current top-performing model on MMLU-Pro, highlighting areas where the model falls short, such as reasoning errors, lack of specific knowledge, and calculation errors. Overall, MMLU-Pro is designed to be a more discriminative benchmark, effectively tracking the progress of LLMs in achieving expert-level intelligence.The paper introduces MMLU-Pro, an enhanced benchmark for evaluating multi-task language understanding and reasoning capabilities in large language models (LLMs). MMLU-Pro addresses the limitations of the original MMLU benchmark by increasing the complexity and robustness of the questions. Key features of MMLU-Pro include: 1. **Question Expansion**: The dataset now includes 10 options instead of 4, reducing the likelihood of guessing correct answers. 2. **Content Diversification**: The dataset covers 14 diverse domains, including mathematics, physics, chemistry, law, engineering, psychology, and health, ensuring a broad range of topics. 3. **Expert Reviews**: Two rounds of expert reviews are conducted to ensure the accuracy and reliability of the questions. 4. **Chain of Thought (CoT) Reasoning**: The benchmark emphasizes the importance of deep reasoning, which is crucial for handling complex tasks. Experimental results show that MMLU-Pro significantly raises the challenge level compared to MMLU, with a 16% to 33% drop in accuracy. Additionally, MMLU-Pro demonstrates greater stability under varying prompts, with the sensitivity of model scores to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro. Models that utilize CoT reasoning perform better on MMLU-Pro, indicating that the benchmark includes more complex reasoning questions. The paper also includes a detailed error analysis of GPT-4o, the current top-performing model on MMLU-Pro, highlighting areas where the model falls short, such as reasoning errors, lack of specific knowledge, and calculation errors. Overall, MMLU-Pro is designed to be a more discriminative benchmark, effectively tracking the progress of LLMs in achieving expert-level intelligence.
Reach us at info@study.space