7 Jun 2024 | Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
The paper investigates the MMLU benchmark, identifying numerous errors that could mislead evaluations of large language models (LLMs). Despite its popularity, the MMLU dataset contains significant inaccuracies, such as 57% of questions in the Virology subset having errors. The authors introduce MMLU-Redux, a manually re-annotated subset of 3,000 questions across 30 MMLU subjects, to address these issues. Their analysis reveals that many errors stem from parsing mistakes, ambiguous questions, and incorrect ground truth labels. Using MMLU-Redux, they demonstrate that model performance metrics differ significantly from those originally reported, highlighting the need to revise MMLU's error-prone questions to improve its reliability as a benchmark. The study also explores automated error detection methods, including zero-shot prompting, few-shot prompting, chain-of-thought prompting, retrieval-augmented generation, and instruction fine-tuning. While these methods show promise, they still struggle to achieve high accuracy in detecting errors. The authors conclude that MMLU-Redux provides a valuable resource for improving the quality of benchmark datasets and emphasize the importance of revisiting and refining the MMLU benchmark to ensure its effectiveness in evaluating LLMs. The study underscores the need for careful annotation and systematic error detection to enhance the reliability of benchmarking in natural language processing.The paper investigates the MMLU benchmark, identifying numerous errors that could mislead evaluations of large language models (LLMs). Despite its popularity, the MMLU dataset contains significant inaccuracies, such as 57% of questions in the Virology subset having errors. The authors introduce MMLU-Redux, a manually re-annotated subset of 3,000 questions across 30 MMLU subjects, to address these issues. Their analysis reveals that many errors stem from parsing mistakes, ambiguous questions, and incorrect ground truth labels. Using MMLU-Redux, they demonstrate that model performance metrics differ significantly from those originally reported, highlighting the need to revise MMLU's error-prone questions to improve its reliability as a benchmark. The study also explores automated error detection methods, including zero-shot prompting, few-shot prompting, chain-of-thought prompting, retrieval-augmented generation, and instruction fine-tuning. While these methods show promise, they still struggle to achieve high accuracy in detecting errors. The authors conclude that MMLU-Redux provides a valuable resource for improving the quality of benchmark datasets and emphasize the importance of revisiting and refining the MMLU benchmark to ensure its effectiveness in evaluating LLMs. The study underscores the need for careful annotation and systematic error detection to enhance the reliability of benchmarking in natural language processing.