[slides and audio] Attention Guided CAM%3A Visual Explanations of Vision Transformer Guided by Self-Attention

This paper introduces an attention-guided visualization method for Vision Transformers (ViT), which provides high-level semantic explanations for model decisions. The method selectively aggregates gradients from the classification output to each self-attention, collecting the contribution of image features from different locations. These gradients are guided by normalized self-attention scores, which represent pairwise patch correlation scores. This approach enhances localization performance and provides detailed semantic explanations with minimal peak intensity amplification. The proposed method outperforms existing explainability methods in weakly-supervised localization tasks, demonstrating superior performance in capturing full instances of target class objects. The contributions of the work include a gradient-based method applicable to ViT, effective aggregation of essential gradients, and improved localization performance in weakly-supervised tasks. The method is evaluated on datasets such as ImageNet ILSVRC 2012, Pascal VOC 2012, and Caltech-UCSD Birds-200-2011, showing significant improvements in pixel accuracy, Intersection over Union (IoU), Dice coefficient, and ABPC scores.This paper introduces an attention-guided visualization method for Vision Transformers (ViT), which provides high-level semantic explanations for model decisions. The method selectively aggregates gradients from the classification output to each self-attention, collecting the contribution of image features from different locations. These gradients are guided by normalized self-attention scores, which represent pairwise patch correlation scores. This approach enhances localization performance and provides detailed semantic explanations with minimal peak intensity amplification. The proposed method outperforms existing explainability methods in weakly-supervised localization tasks, demonstrating superior performance in capturing full instances of target class objects. The contributions of the work include a gradient-based method applicable to ViT, effective aggregation of essential gradients, and improved localization performance in weakly-supervised tasks. The method is evaluated on datasets such as ImageNet ILSVRC 2012, Pascal VOC 2012, and Caltech-UCSD Birds-200-2011, showing significant improvements in pixel accuracy, Intersection over Union (IoU), Dice coefficient, and ABPC scores.

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

7 Feb 2024 | Saebom Leem, Hyunseok Seo