6 May 2020 | Yixin Nie*, Adina Williams†, Emily Dinan†, Mohit Bansal*, Jason Weston†, Douwe Kiela†
The paper introduces a new large-scale Natural Language Inference (NLI) benchmark dataset, Adversarial NLI (ANLI), collected through an iterative, adversarial human-and-model-in-the-loop procedure. This method involves human annotators creating examples that current models cannot correctly classify, which are then verified by other humans and added to the training set. The process is repeated, creating a dynamic and increasingly challenging dataset. The authors show that training models on this dataset leads to state-of-the-art performance on popular NLI benchmarks while also exposing the limitations of current models. The ANLI dataset is designed to last longer than traditional benchmarks, addressing the issue of benchmarks quickly becoming saturated. The paper also includes a detailed analysis of the collected data, categorizing examples by inference type and demonstrating the effectiveness of adversarial training. The results indicate that the dataset is more challenging than existing benchmarks and that training on adversarial data improves model robustness. The ANLI benchmark presents new opportunities for research and can be applied to other classification tasks.The paper introduces a new large-scale Natural Language Inference (NLI) benchmark dataset, Adversarial NLI (ANLI), collected through an iterative, adversarial human-and-model-in-the-loop procedure. This method involves human annotators creating examples that current models cannot correctly classify, which are then verified by other humans and added to the training set. The process is repeated, creating a dynamic and increasingly challenging dataset. The authors show that training models on this dataset leads to state-of-the-art performance on popular NLI benchmarks while also exposing the limitations of current models. The ANLI dataset is designed to last longer than traditional benchmarks, addressing the issue of benchmarks quickly becoming saturated. The paper also includes a detailed analysis of the collected data, categorizing examples by inference type and demonstrating the effectiveness of adversarial training. The results indicate that the dataset is more challenging than existing benchmarks and that training on adversarial data improves model robustness. The ANLI benchmark presents new opportunities for research and can be applied to other classification tasks.