This paper challenges the claim that "Attention is not Explanation" (Jain and Wallace, 2019). The authors argue that the claim depends on the definition of explanation and that testing it requires considering all elements of the model. They propose four alternative tests to determine when/whether attention can be used as explanation: a simple uniform-weights baseline; a variance calibration based on multiple random seed runs; a diagnostic framework using frozen weights from pretrained models; and an end-to-end adversarial attention training protocol. Each allows for meaningful interpretation of attention mechanisms in RNN models. The authors show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
The authors argue that Jain and Wallace's counterfactual attention weight experiments do not advance their thesis because attention distribution is not a primitive and existence does not entail exclusivity. They propose a model-driven approach to examine the properties of attention distributions and propose alternatives. They test whether the variances observed by Jain and Wallace between trained attention scores and adversarially-obtained ones are unusual. They find that the variances are normal and that the results do not support the claim that attention is not explanation.
The authors also introduce a model-consistent training protocol for finding adversarial attention weights, correcting some flaws they found in the previous approach. They train a model using a modified loss function which takes into account the distance from an ordinarily-trained base model's attention scores in order to learn parameters for adversarial attention distributions. They find that while plausibly adversarial distributions of the consistent kind can indeed be found for the binary classification datasets in question, they are not as extreme as those found in the inconsistent manner, as illustrated by an example from the IMDB task. Furthermore, these outputs do not fare well in the diagnostic MLP, calling into question the extent to which we can treat them as equally powerful for explainability.
The authors provide a theoretical discussion on the definitions of interpretability and explainability, grounding their findings within the accepted definitions of these concepts. They argue that the claim that attention is not explanation depends on the definition of explanation. They conclude that attention mechanisms can provide meaningful model-agnostic interpretations of tokens in an instance. They also argue that the existence of multiple different explanations is not necessarily indicative of the quality of a single one. The authors believe that the conditions under which adversarial distributions can actually be found in practice are an important direction for future work.This paper challenges the claim that "Attention is not Explanation" (Jain and Wallace, 2019). The authors argue that the claim depends on the definition of explanation and that testing it requires considering all elements of the model. They propose four alternative tests to determine when/whether attention can be used as explanation: a simple uniform-weights baseline; a variance calibration based on multiple random seed runs; a diagnostic framework using frozen weights from pretrained models; and an end-to-end adversarial attention training protocol. Each allows for meaningful interpretation of attention mechanisms in RNN models. The authors show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
The authors argue that Jain and Wallace's counterfactual attention weight experiments do not advance their thesis because attention distribution is not a primitive and existence does not entail exclusivity. They propose a model-driven approach to examine the properties of attention distributions and propose alternatives. They test whether the variances observed by Jain and Wallace between trained attention scores and adversarially-obtained ones are unusual. They find that the variances are normal and that the results do not support the claim that attention is not explanation.
The authors also introduce a model-consistent training protocol for finding adversarial attention weights, correcting some flaws they found in the previous approach. They train a model using a modified loss function which takes into account the distance from an ordinarily-trained base model's attention scores in order to learn parameters for adversarial attention distributions. They find that while plausibly adversarial distributions of the consistent kind can indeed be found for the binary classification datasets in question, they are not as extreme as those found in the inconsistent manner, as illustrated by an example from the IMDB task. Furthermore, these outputs do not fare well in the diagnostic MLP, calling into question the extent to which we can treat them as equally powerful for explainability.
The authors provide a theoretical discussion on the definitions of interpretability and explainability, grounding their findings within the accepted definitions of these concepts. They argue that the claim that attention is not explanation depends on the definition of explanation. They conclude that attention mechanisms can provide meaningful model-agnostic interpretations of tokens in an instance. They also argue that the existence of multiple different explanations is not necessarily indicative of the quality of a single one. The authors believe that the conditions under which adversarial distributions can actually be found in practice are an important direction for future work.