22 Feb 2019 | Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy & Samuel R. Bowman
The paper introduces the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating natural language understanding (NLU) models across a diverse set of tasks. GLUE aims to encourage models that can learn general linguistic knowledge and perform well on tasks with limited training data. The benchmark includes nine sentence or sentence-pair NLU tasks, such as question answering, sentiment analysis, and textual entailment, and an online platform for model evaluation, comparison, and analysis. GLUE also features a hand-crafted diagnostic test suite to analyze models' linguistic capabilities. Experiments with baselines show that multi-task training performs better than training separate models for each task, but the best model still has low absolute performance, indicating the need for improved general NLU systems. The paper discusses related work, task descriptions, evaluation methods, and baseline results, highlighting the importance of attention mechanisms and transfer learning methods like ELMo. The diagnostic dataset is designed to probe models' performance on specific linguistic phenomena, and the paper provides detailed analysis of model performance on this dataset. Overall, the paper concludes that while GLUE advances the field, there is still much work to be done to develop truly general-purpose NLU models.The paper introduces the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating natural language understanding (NLU) models across a diverse set of tasks. GLUE aims to encourage models that can learn general linguistic knowledge and perform well on tasks with limited training data. The benchmark includes nine sentence or sentence-pair NLU tasks, such as question answering, sentiment analysis, and textual entailment, and an online platform for model evaluation, comparison, and analysis. GLUE also features a hand-crafted diagnostic test suite to analyze models' linguistic capabilities. Experiments with baselines show that multi-task training performs better than training separate models for each task, but the best model still has low absolute performance, indicating the need for improved general NLU systems. The paper discusses related work, task descriptions, evaluation methods, and baseline results, highlighting the importance of attention mechanisms and transfer learning methods like ELMo. The diagnostic dataset is designed to probe models' performance on specific linguistic phenomena, and the paper provides detailed analysis of model performance on this dataset. Overall, the paper concludes that while GLUE advances the field, there is still much work to be done to develop truly general-purpose NLU models.