SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

13 Feb 2020 | Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman
SuperGLUE is a new benchmark for evaluating general-purpose language understanding systems, designed to address the limitations of the GLUE benchmark. GLUE, introduced over a year ago, has become a prominent evaluation framework for research in natural language processing (NLP), offering a single-number metric to summarize progress on a diverse set of tasks. However, recent performance improvements have surpassed human levels, suggesting limited room for further research. SuperGLUE aims to provide a more rigorous test of language understanding by introducing more challenging tasks, diverse task formats, comprehensive human baselines, improved code support, and refined usage rules. The benchmark consists of eight tasks, including BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WIC, and WSC, with a focus on diverse task formats and low-data training tasks. Baselines based on BERT show significant gains but still lag behind human performance by nearly 20 points. SuperGLUE is available at super.gluebenchmark.com and is expected to drive further innovation in multi-task, transfer, and unsupervised/self-supervised learning techniques.SuperGLUE is a new benchmark for evaluating general-purpose language understanding systems, designed to address the limitations of the GLUE benchmark. GLUE, introduced over a year ago, has become a prominent evaluation framework for research in natural language processing (NLP), offering a single-number metric to summarize progress on a diverse set of tasks. However, recent performance improvements have surpassed human levels, suggesting limited room for further research. SuperGLUE aims to provide a more rigorous test of language understanding by introducing more challenging tasks, diverse task formats, comprehensive human baselines, improved code support, and refined usage rules. The benchmark consists of eight tasks, including BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WIC, and WSC, with a focus on diverse task formats and low-data training tasks. Baselines based on BERT show significant gains but still lag behind human performance by nearly 20 points. SuperGLUE is available at super.gluebenchmark.com and is expected to drive further innovation in multi-task, transfer, and unsupervised/self-supervised learning techniques.
Reach us at info@study.space