SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

13 Feb 2020 | Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman
SuperGLUE is a new benchmark for evaluating general-purpose language understanding systems, designed to provide a more challenging and rigorous test of language understanding than the previous GLUE benchmark. SuperGLUE includes eight language understanding tasks, a software toolkit, and a public leaderboard. The tasks in SuperGLUE are more difficult than those in GLUE and include a variety of formats such as coreference resolution and question answering. SuperGLUE also includes human performance estimates for all tasks, which show that there is still substantial room for improvement. The benchmark is available at super.gluebenchmark.com. The GLUE benchmark, introduced a year ago, has been widely used to evaluate progress in language understanding tasks. However, recent performance on GLUE has surpassed that of non-expert humans, suggesting that further improvements may be limited. In response, SuperGLUE was developed to provide a more challenging test of language understanding. SuperGLUE includes tasks that are more difficult than those in GLUE, as well as a more diverse set of task formats. The benchmark also includes a software toolkit for pretraining, multi-task learning, and transfer learning in NLP, built around standard tools such as PyTorch and AllenNLP. SuperGLUE is designed to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. The benchmark includes tasks that require substantial innovations in core areas of machine learning, including sample-efficient, transfer, multitask, and unsupervised or self-supervised learning. The tasks in SuperGLUE are selected based on their difficulty and the ability to test a system's ability to understand and reason about texts in English. The SuperGLUE benchmark includes eight tasks, each with its own specific format and evaluation criteria. These tasks include BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. Each task is evaluated using a specific metric, and the overall score is calculated by averaging the scores of all tasks. The benchmark also includes a diagnostic dataset that tests models for a broad range of linguistic, commonsense, and world knowledge. The SuperGLUE benchmark is available at super.gluebenchmark.com and includes a software toolkit for pretraining, multi-task learning, and transfer learning in NLP. The toolkit is built around standard tools such as PyTorch and AllenNLP. The benchmark also includes human performance estimates for all tasks, which show that there is still substantial room for improvement. The benchmark is designed to provide a rich and challenging testbed for work developing new general-purpose machine learning methods for language understanding.SuperGLUE is a new benchmark for evaluating general-purpose language understanding systems, designed to provide a more challenging and rigorous test of language understanding than the previous GLUE benchmark. SuperGLUE includes eight language understanding tasks, a software toolkit, and a public leaderboard. The tasks in SuperGLUE are more difficult than those in GLUE and include a variety of formats such as coreference resolution and question answering. SuperGLUE also includes human performance estimates for all tasks, which show that there is still substantial room for improvement. The benchmark is available at super.gluebenchmark.com. The GLUE benchmark, introduced a year ago, has been widely used to evaluate progress in language understanding tasks. However, recent performance on GLUE has surpassed that of non-expert humans, suggesting that further improvements may be limited. In response, SuperGLUE was developed to provide a more challenging test of language understanding. SuperGLUE includes tasks that are more difficult than those in GLUE, as well as a more diverse set of task formats. The benchmark also includes a software toolkit for pretraining, multi-task learning, and transfer learning in NLP, built around standard tools such as PyTorch and AllenNLP. SuperGLUE is designed to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. The benchmark includes tasks that require substantial innovations in core areas of machine learning, including sample-efficient, transfer, multitask, and unsupervised or self-supervised learning. The tasks in SuperGLUE are selected based on their difficulty and the ability to test a system's ability to understand and reason about texts in English. The SuperGLUE benchmark includes eight tasks, each with its own specific format and evaluation criteria. These tasks include BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. Each task is evaluated using a specific metric, and the overall score is calculated by averaging the scores of all tasks. The benchmark also includes a diagnostic dataset that tests models for a broad range of linguistic, commonsense, and world knowledge. The SuperGLUE benchmark is available at super.gluebenchmark.com and includes a software toolkit for pretraining, multi-task learning, and transfer learning in NLP. The toolkit is built around standard tools such as PyTorch and AllenNLP. The benchmark also includes human performance estimates for all tasks, which show that there is still substantial room for improvement. The benchmark is designed to provide a rich and challenging testbed for work developing new general-purpose machine learning methods for language understanding.
Reach us at info@study.space