[slides and audio] Right for the Wrong Reasons%3A Diagnosing Syntactic Heuristics in Natural Language Inference

The paper "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference" by R. Thomas McCoy, Ellie Pavlick, and Tal Linzen explores the issue of machine learning models relying on heuristics that are effective for frequent example types but fail in more challenging cases. Specifically, they focus on natural language inference (NLI), where the task is to determine if one sentence entails another. The authors hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: lexical overlap, subsequence, and constituent heuristics. To test this hypothesis, they introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains examples where these heuristics fail. They find that models trained on MNLI, including BERT, perform poorly on HANS, suggesting that they have indeed adopted these heuristics. The paper concludes that there is significant room for improvement in NLI systems, and that the HANS dataset can be used to motivate and measure progress in this area. The authors also discuss the contributions of their work, including the introduction of the HANS dataset, the identification of interpretable shortcomings in state-of-the-art models, and the demonstration that these shortcomings can be mitigated by augmenting the training set with HANS-like examples.The paper "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference" by R. Thomas McCoy, Ellie Pavlick, and Tal Linzen explores the issue of machine learning models relying on heuristics that are effective for frequent example types but fail in more challenging cases. Specifically, they focus on natural language inference (NLI), where the task is to determine if one sentence entails another. The authors hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: lexical overlap, subsequence, and constituent heuristics. To test this hypothesis, they introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains examples where these heuristics fail. They find that models trained on MNLI, including BERT, perform poorly on HANS, suggesting that they have indeed adopted these heuristics. The paper concludes that there is significant room for improvement in NLI systems, and that the HANS dataset can be used to motivate and measure progress in this area. The authors also discuss the contributions of their work, including the introduction of the HANS dataset, the identification of interpretable shortcomings in state-of-the-art models, and the demonstration that these shortcomings can be mitigated by augmenting the training set with HANS-like examples.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

24 Jun 2019 | R. Thomas McCoy, Ellie Pavlick, & Tal Linzen