16 Apr 2018 | Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, Noah A. Smith
The paper "Annotation Artifacts in Natural Language Inference Data" by Suchin Gururangan et al. examines the impact of annotation artifacts on natural language inference (NLI) datasets, specifically SNLI and MultiNLI. The authors find that a significant portion of the data can be classified correctly based on the hypothesis alone, without observing the premise. This is achieved through specific linguistic phenomena such as negation, vagueness, and gender-neutral references. The study reveals that these artifacts are a result of the annotation strategies and heuristics used by crowd workers. The authors also re-evaluate high-performing NLI models on a subset of examples that their hypothesis-only classifier failed to classify correctly, finding that these models perform much worse on this "hard" subset. This suggests that the success of current NLI models may be overestimated, and the task remains challenging. The paper concludes by discussing the broader implications of annotation artifacts in NLP datasets and the need for more robust evaluation methods.The paper "Annotation Artifacts in Natural Language Inference Data" by Suchin Gururangan et al. examines the impact of annotation artifacts on natural language inference (NLI) datasets, specifically SNLI and MultiNLI. The authors find that a significant portion of the data can be classified correctly based on the hypothesis alone, without observing the premise. This is achieved through specific linguistic phenomena such as negation, vagueness, and gender-neutral references. The study reveals that these artifacts are a result of the annotation strategies and heuristics used by crowd workers. The authors also re-evaluate high-performing NLI models on a subset of examples that their hypothesis-only classifier failed to classify correctly, finding that these models perform much worse on this "hard" subset. This suggests that the success of current NLI models may be overestimated, and the task remains challenging. The paper concludes by discussing the broader implications of annotation artifacts in NLP datasets and the need for more robust evaluation methods.