[slides] MMLU-Pro%3A A More Robust and Challenging Multi-Task Language Understanding Benchmark

The paper introduces MMLU-Pro, an enhanced benchmark for evaluating multi-task language understanding and reasoning capabilities in large language models (LLMs). MMLU-Pro addresses the limitations of the original MMLU benchmark by increasing the complexity and robustness of the questions. Key features of MMLU-Pro include: 1. **Question Expansion**: The dataset now includes 10 options instead of 4, reducing the likelihood of guessing correct answers. 2. **Content Diversification**: The dataset covers 14 diverse domains, including mathematics, physics, chemistry, law, engineering, psychology, and health, ensuring a broad range of topics. 3. **Expert Reviews**: Two rounds of expert reviews are conducted to ensure the accuracy and reliability of the questions. 4. **Chain of Thought (CoT) Reasoning**: The benchmark emphasizes the importance of deep reasoning, which is crucial for handling complex tasks. Experimental results show that MMLU-Pro significantly raises the challenge level compared to MMLU, with a 16% to 33% drop in accuracy. Additionally, MMLU-Pro demonstrates greater stability under varying prompts, with the sensitivity of model scores to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro. Models that utilize CoT reasoning perform better on MMLU-Pro, indicating that the benchmark includes more complex reasoning questions. The paper also includes a detailed error analysis of GPT-4o, the current top-performing model on MMLU-Pro, highlighting areas where the model falls short, such as reasoning errors, lack of specific knowledge, and calculation errors. Overall, MMLU-Pro is designed to be a more discriminative benchmark, effectively tracking the progress of LLMs in achieving expert-level intelligence.The paper introduces MMLU-Pro, an enhanced benchmark for evaluating multi-task language understanding and reasoning capabilities in large language models (LLMs). MMLU-Pro addresses the limitations of the original MMLU benchmark by increasing the complexity and robustness of the questions. Key features of MMLU-Pro include: 1. **Question Expansion**: The dataset now includes 10 options instead of 4, reducing the likelihood of guessing correct answers. 2. **Content Diversification**: The dataset covers 14 diverse domains, including mathematics, physics, chemistry, law, engineering, psychology, and health, ensuring a broad range of topics. 3. **Expert Reviews**: Two rounds of expert reviews are conducted to ensure the accuracy and reliability of the questions. 4. **Chain of Thought (CoT) Reasoning**: The benchmark emphasizes the importance of deep reasoning, which is crucial for handling complex tasks. Experimental results show that MMLU-Pro significantly raises the challenge level compared to MMLU, with a 16% to 33% drop in accuracy. Additionally, MMLU-Pro demonstrates greater stability under varying prompts, with the sensitivity of model scores to prompt variations decreasing from 4-5% in MMLU to just 2% in MMLU-Pro. Models that utilize CoT reasoning perform better on MMLU-Pro, indicating that the benchmark includes more complex reasoning questions. The paper also includes a detailed error analysis of GPT-4o, the current top-performing model on MMLU-Pro, highlighting areas where the model falls short, such as reasoning errors, lack of specific knowledge, and calculation errors. Overall, MMLU-Pro is designed to be a more discriminative benchmark, effectively tracking the progress of LLMs in achieving expert-level intelligence.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

23 Jun 2024 | Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhui Chen