This paper proposes an attention-guided visualization method for Vision Transformer (ViT) that provides high-level semantic explanations for its decisions. The method selectively aggregates gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are guided by normalized self-attention scores, which are pairwise patch correlation scores, to efficiently detect patch-level context information. The method provides elaborate high-level semantic explanations with great localization performance using only class labels. It outperforms previous explainability methods of ViT in weakly-supervised localization tasks and demonstrates strong capability in capturing full instances of the target class object. The method also provides a faithful visualization that explains the model, as demonstrated in perturbation comparison tests.
ViT, a transformer-based model adapted for images, uses self-attention to achieve high performance in various vision tasks. However, it lacks explainability, making it difficult to ensure the reliability of the model. Existing methods for CNNs, such as gradient-based methods, have been used to provide faithful explanations. However, ViT's unique structure, including the use of [class] token and self-attention, makes it challenging to provide proper explanations. Attention Rollout and LRP-based methods have been developed for ViT, but they suffer from issues such as peak intensity amplification, leading to poor localization performance.
The proposed method addresses these issues by using a sigmoid normalization of self-attention scores to reduce peak intensity effects. It combines essential target gradients with the feedforward feature of the self-attention module to improve localization performance. The method generates a class activation map (CAM) by aggregating gradients connected to the MLP head and backpropagated along skip connections. The self-attention scores, which represent patch correlation scores, are used as feature maps to guide the gradients on the pattern information of the image. The method provides a high-semantic explanation with great localization performance, outperforming previous methods in weakly-supervised localization tasks.
The method is evaluated on ImageNet ILSVRC 2012, Pascal VOC 2012, and CUB 200 datasets. It achieves high performance in terms of pixel accuracy, IoU, dice coefficient, and ABPC score. The method is also tested on pixel perturbation, where it shows better faithfulness and reliability of explanations compared to previous methods. The method provides a reliable explanation of the model and weakly-supervised object detection, making ViT more adaptable to many tasks involving object localization in the computer vision field.This paper proposes an attention-guided visualization method for Vision Transformer (ViT) that provides high-level semantic explanations for its decisions. The method selectively aggregates gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are guided by normalized self-attention scores, which are pairwise patch correlation scores, to efficiently detect patch-level context information. The method provides elaborate high-level semantic explanations with great localization performance using only class labels. It outperforms previous explainability methods of ViT in weakly-supervised localization tasks and demonstrates strong capability in capturing full instances of the target class object. The method also provides a faithful visualization that explains the model, as demonstrated in perturbation comparison tests.
ViT, a transformer-based model adapted for images, uses self-attention to achieve high performance in various vision tasks. However, it lacks explainability, making it difficult to ensure the reliability of the model. Existing methods for CNNs, such as gradient-based methods, have been used to provide faithful explanations. However, ViT's unique structure, including the use of [class] token and self-attention, makes it challenging to provide proper explanations. Attention Rollout and LRP-based methods have been developed for ViT, but they suffer from issues such as peak intensity amplification, leading to poor localization performance.
The proposed method addresses these issues by using a sigmoid normalization of self-attention scores to reduce peak intensity effects. It combines essential target gradients with the feedforward feature of the self-attention module to improve localization performance. The method generates a class activation map (CAM) by aggregating gradients connected to the MLP head and backpropagated along skip connections. The self-attention scores, which represent patch correlation scores, are used as feature maps to guide the gradients on the pattern information of the image. The method provides a high-semantic explanation with great localization performance, outperforming previous methods in weakly-supervised localization tasks.
The method is evaluated on ImageNet ILSVRC 2012, Pascal VOC 2012, and CUB 200 datasets. It achieves high performance in terms of pixel accuracy, IoU, dice coefficient, and ABPC score. The method is also tested on pixel perturbation, where it shows better faithfulness and reliability of explanations compared to previous methods. The method provides a reliable explanation of the model and weakly-supervised object detection, making ViT more adaptable to many tasks involving object localization in the computer vision field.