12 Jan 2021 | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
The paper introduces a new benchmark, the Massive Multitask Language Understanding (MLU) test, designed to evaluate the multitask accuracy of text models. The test covers 57 diverse tasks across various subjects, including mathematics, history, computer science, law, and more, ranging from elementary to advanced professional levels. The goal is to assess models' world knowledge and problem-solving abilities. While recent models like GPT-3 show significant improvement over random chance, they still require substantial enhancements to reach expert-level accuracy on all tasks. The paper highlights that models often perform unevenly, with some subjects showing near-random accuracy, and they struggle with tasks requiring calculations and socially important subjects like morality and law. The MLU test provides a comprehensive evaluation framework to identify and address these shortcomings, making it useful for researchers to analyze and improve model capabilities.The paper introduces a new benchmark, the Massive Multitask Language Understanding (MLU) test, designed to evaluate the multitask accuracy of text models. The test covers 57 diverse tasks across various subjects, including mathematics, history, computer science, law, and more, ranging from elementary to advanced professional levels. The goal is to assess models' world knowledge and problem-solving abilities. While recent models like GPT-3 show significant improvement over random chance, they still require substantial enhancements to reach expert-level accuracy on all tasks. The paper highlights that models often perform unevenly, with some subjects showing near-random accuracy, and they struggle with tasks requiring calculations and socially important subjects like morality and law. The MLU test provides a comprehensive evaluation framework to identify and address these shortcomings, making it useful for researchers to analyze and improve model capabilities.