Understanding Are We Done with MMLU%3F

The paper "Are We Done with MMLU?" by Aryo Pradipta Gema et al. critically examines the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to evaluate the capabilities of Large Language Models (LLMs). The authors identify and analyze numerous ground truth errors in the MMLU dataset, particularly in the *Virology* subset, where 57% of the analyzed questions contain errors. These errors range from simple parsing mistakes to more complex issues related to context and interpretation. To address this issue, the authors introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy and create MMLU-Redux, a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, they demonstrate significant discrepancies in model performance metrics, highlighting the need for revising MMLU's error-ridden questions to enhance its reliability and utility as a benchmark. The paper also explores the feasibility of automatically fixing MMLU errors using various prompting techniques and fine-tuning methods, concluding that while these approaches show promise, they are still insufficient to produce a high-quality dataset. The authors open up MMLU-Redux for additional annotation to further improve the dataset's quality and reliability.The paper "Are We Done with MMLU?" by Aryo Pradipta Gema et al. critically examines the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to evaluate the capabilities of Large Language Models (LLMs). The authors identify and analyze numerous ground truth errors in the MMLU dataset, particularly in the *Virology* subset, where 57% of the analyzed questions contain errors. These errors range from simple parsing mistakes to more complex issues related to context and interpretation. To address this issue, the authors introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy and create MMLU-Redux, a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, they demonstrate significant discrepancies in model performance metrics, highlighting the need for revising MMLU's error-ridden questions to enhance its reliability and utility as a benchmark. The paper also explores the feasibility of automatically fixing MMLU errors using various prompting techniques and fine-tuning methods, concluding that while these approaches show promise, they are still insufficient to produce a high-quality dataset. The authors open up MMLU-Redux for additional annotation to further improve the dataset's quality and reliability.

Are We Done with MMLU?

7 Jun 2024 | Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini