Adversarial NLI: A New Benchmark for Natural Language Understanding

Adversarial NLI: A New Benchmark for Natural Language Understanding

6 May 2020 | Yixin Nie*, Adina Williams†, Emily Dinan†, Mohit Bansal*, Jason Weston†, Douwe Kiela†
The paper introduces a new large-scale NLI benchmark dataset, ANLI, collected through an adversarial human-and-model-in-the-loop process. This dataset is designed to be more challenging than existing benchmarks and to continuously evolve, making it a dynamic target for NLU systems rather than a static one. The dataset consists of three rounds of data collection, each progressively more difficult and diverse in context types. The data collection process involves human annotators creating examples that current models struggle with, which are then used to train stronger models. This iterative process ensures that the dataset remains challenging over time. The ANLI dataset is used to evaluate the performance of various NLI models, including BERT, RoBERTa, and XLNet. Results show that RoBERTa achieves state-of-the-art performance on several benchmarks, while hypothesis-only models perform poorly on ANLI. The dataset also includes detailed analysis of the collected data, revealing the types of inferences that models struggle with, such as numerical reasoning, reference and names, and tricky inferences. The dataset is also tested against stress tests, which assess models' ability to handle challenging examples, and shows that models trained on ANLI perform well on these tests. The ANLI dataset is designed to address the limitations of previous benchmarks, which often fail to capture the complexity of real-world NLI tasks. By continuously evolving, the dataset provides a more realistic challenge for models, helping to identify and mitigate biases in model performance. The paper also discusses the importance of dynamic datasets in the field of NLU, as static benchmarks quickly become obsolete. The ANLI dataset is available for further research and analysis, offering a valuable resource for improving NLU systems.The paper introduces a new large-scale NLI benchmark dataset, ANLI, collected through an adversarial human-and-model-in-the-loop process. This dataset is designed to be more challenging than existing benchmarks and to continuously evolve, making it a dynamic target for NLU systems rather than a static one. The dataset consists of three rounds of data collection, each progressively more difficult and diverse in context types. The data collection process involves human annotators creating examples that current models struggle with, which are then used to train stronger models. This iterative process ensures that the dataset remains challenging over time. The ANLI dataset is used to evaluate the performance of various NLI models, including BERT, RoBERTa, and XLNet. Results show that RoBERTa achieves state-of-the-art performance on several benchmarks, while hypothesis-only models perform poorly on ANLI. The dataset also includes detailed analysis of the collected data, revealing the types of inferences that models struggle with, such as numerical reasoning, reference and names, and tricky inferences. The dataset is also tested against stress tests, which assess models' ability to handle challenging examples, and shows that models trained on ANLI perform well on these tests. The ANLI dataset is designed to address the limitations of previous benchmarks, which often fail to capture the complexity of real-world NLI tasks. By continuously evolving, the dataset provides a more realistic challenge for models, helping to identify and mitigate biases in model performance. The paper also discusses the importance of dynamic datasets in the field of NLU, as static benchmarks quickly become obsolete. The ANLI dataset is available for further research and analysis, offering a valuable resource for improving NLU systems.
Reach us at info@study.space