Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

24 Jun 2019 | R. Thomas McCoy, Ellie Pavlick, & Tal Linzen
This paper investigates the use of fallible syntactic heuristics in natural language inference (NLI) models. The authors hypothesize that statistical NLI models may rely on three heuristics: lexical overlap, subsequence, and constituent. To test this, they introduce the HANS dataset, which contains examples where these heuristics fail. They find that models like BERT perform poorly on HANS, suggesting they rely on these heuristics. The paper argues that current NLI models have significant room for improvement and that HANS can help measure progress. The study shows that NLI models often use shallow heuristics instead of learning generalizations. These heuristics can lead to incorrect inferences, especially in cases where the correct answer contradicts the heuristic. For example, a model might incorrectly label a sentence pair as entailing when the correct answer is non-entailing. The authors analyze three heuristics: lexical overlap (words appearing in both sentences), subsequence (words in order), and constituent (parts of speech in a sentence). They show that these heuristics are not valid inference strategies but are often used by models. The HANS dataset is designed to test these heuristics by including examples where they fail. The paper evaluates four NLI models (DA, ESIM, SPINN, BERT) on HANS and finds that all perform poorly, indicating they rely on the heuristics. The results suggest that models may be using these heuristics rather than understanding the correct rules of inference. The authors also show that augmenting training data with HANS-like examples can improve model performance. The study highlights the limitations of current NLI models and the importance of datasets like HANS in evaluating their performance. It also suggests that models may be learning from biases in training data rather than understanding the underlying language structure. The paper concludes that there is significant room for improvement in NLI models and that targeted datasets like HANS are essential for assessing whether models are learning what they are intended to learn.This paper investigates the use of fallible syntactic heuristics in natural language inference (NLI) models. The authors hypothesize that statistical NLI models may rely on three heuristics: lexical overlap, subsequence, and constituent. To test this, they introduce the HANS dataset, which contains examples where these heuristics fail. They find that models like BERT perform poorly on HANS, suggesting they rely on these heuristics. The paper argues that current NLI models have significant room for improvement and that HANS can help measure progress. The study shows that NLI models often use shallow heuristics instead of learning generalizations. These heuristics can lead to incorrect inferences, especially in cases where the correct answer contradicts the heuristic. For example, a model might incorrectly label a sentence pair as entailing when the correct answer is non-entailing. The authors analyze three heuristics: lexical overlap (words appearing in both sentences), subsequence (words in order), and constituent (parts of speech in a sentence). They show that these heuristics are not valid inference strategies but are often used by models. The HANS dataset is designed to test these heuristics by including examples where they fail. The paper evaluates four NLI models (DA, ESIM, SPINN, BERT) on HANS and finds that all perform poorly, indicating they rely on the heuristics. The results suggest that models may be using these heuristics rather than understanding the correct rules of inference. The authors also show that augmenting training data with HANS-like examples can improve model performance. The study highlights the limitations of current NLI models and the importance of datasets like HANS in evaluating their performance. It also suggests that models may be learning from biases in training data rather than understanding the underlying language structure. The paper concludes that there is significant room for improvement in NLI models and that targeted datasets like HANS are essential for assessing whether models are learning what they are intended to learn.
Reach us at info@study.space