[slides and audio] Attention is not Explanation

The paper "Attention is not Explanation" by Sarthak Jain explores the relationship between attention weights and model outputs in neural networks, particularly in the context of natural language processing (NLP). The authors argue that while attention mechanisms are widely used to improve predictive performance and provide transparency, they do not necessarily provide meaningful explanations for model predictions. Through extensive experiments across various NLP tasks, including text classification, question answering, and natural language inference, the study finds that attention weights are often uncorrelated with feature importance measures derived from gradients or leave-one-out (LOO) methods. Additionally, counterfactual attention distributions, which differ significantly from the original attention weights but yield equivalent predictions, are identified. These findings suggest that attention weights do not consistently indicate why a model made a particular prediction, and thus should not be treated as explanations. The paper concludes by cautioning against using attention weights to highlight input tokens responsible for model outputs and emphasizes the need for more principled attention mechanisms that improve both performance and interpretability.The paper "Attention is not Explanation" by Sarthak Jain explores the relationship between attention weights and model outputs in neural networks, particularly in the context of natural language processing (NLP). The authors argue that while attention mechanisms are widely used to improve predictive performance and provide transparency, they do not necessarily provide meaningful explanations for model predictions. Through extensive experiments across various NLP tasks, including text classification, question answering, and natural language inference, the study finds that attention weights are often uncorrelated with feature importance measures derived from gradients or leave-one-out (LOO) methods. Additionally, counterfactual attention distributions, which differ significantly from the original attention weights but yield equivalent predictions, are identified. These findings suggest that attention weights do not consistently indicate why a model made a particular prediction, and thus should not be treated as explanations. The paper concludes by cautioning against using attention weights to highlight input tokens responsible for model outputs and emphasizes the need for more principled attention mechanisms that improve both performance and interpretability.

Attention is not Explanation

8 May 2019 | Sarthak Jain, Byron C. Wallace