MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING

MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING

12 Jan 2021 | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
A new benchmark for evaluating text models' multitask understanding is introduced, covering 57 subjects across STEM, humanities, social sciences, and more. The test assesses both world knowledge and problem-solving ability, with tasks ranging from elementary to professional levels. While recent models like GPT-3 show improved accuracy compared to random chance, they still fall short of expert-level performance on most tasks. The test reveals that models often lack confidence in their answers, with accuracy sometimes being 24% off from their confidence levels. Models also struggle with socially important subjects like law and morality, indicating a need for better understanding of these areas. The benchmark includes a wide range of tasks, from mathematics and history to law and ethics, and is designed to evaluate models in zero-shot and few-shot settings. It helps identify models' strengths and weaknesses, providing insights into their capabilities across various domains. The test is available for researchers to analyze and improve model performance.A new benchmark for evaluating text models' multitask understanding is introduced, covering 57 subjects across STEM, humanities, social sciences, and more. The test assesses both world knowledge and problem-solving ability, with tasks ranging from elementary to professional levels. While recent models like GPT-3 show improved accuracy compared to random chance, they still fall short of expert-level performance on most tasks. The test reveals that models often lack confidence in their answers, with accuracy sometimes being 24% off from their confidence levels. Models also struggle with socially important subjects like law and morality, indicating a need for better understanding of these areas. The benchmark includes a wide range of tasks, from mathematics and history to law and ethics, and is designed to evaluate models in zero-shot and few-shot settings. It helps identify models' strengths and weaknesses, providing insights into their capabilities across various domains. The test is available for researchers to analyze and improve model performance.
Reach us at info@study.space