Top-down Neural Attention by Excitation Backprop

Top-down Neural Attention by Excitation Backprop

1 Aug 2016 | Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff
This paper proposes a novel method called Excitation Backprop to model top-down attention in Convolutional Neural Networks (CNNs) for generating task-specific attention maps. Inspired by a top-down human visual attention model, the method uses a probabilistic Winner-Take-All (WTA) process to pass top-down signals downward through the network hierarchy. The method introduces the concept of contrastive attention to enhance the discriminativeness of the generated attention maps. The proposed method is evaluated on several datasets, including MS COCO, PASCAL VOC07, and ImageNet, and demonstrates superior performance in weakly supervised localization tasks. The method is also validated in a text-to-region association task on the Flickr30k Entities dataset, where it achieves promising performance in phrase localization by leveraging the top-down attention of a CNN model trained on weakly labeled web images. The method is implemented in Caffe and is shown to be effective in generating interpretable attention maps. The paper also presents a detailed analysis of the method's performance on different layers of the network and compares it with other existing methods. The results show that the proposed method outperforms existing methods in terms of accuracy and generalizability, particularly in challenging tasks such as localizing small objects and text-to-region association. The method is also shown to be effective in localizing dominant objects in images and is capable of generating highly discriminative attention maps for various tasks. The paper concludes that the proposed method provides a novel and effective approach to modeling top-down attention in CNNs, with potential applications in a wide range of computer vision tasks.This paper proposes a novel method called Excitation Backprop to model top-down attention in Convolutional Neural Networks (CNNs) for generating task-specific attention maps. Inspired by a top-down human visual attention model, the method uses a probabilistic Winner-Take-All (WTA) process to pass top-down signals downward through the network hierarchy. The method introduces the concept of contrastive attention to enhance the discriminativeness of the generated attention maps. The proposed method is evaluated on several datasets, including MS COCO, PASCAL VOC07, and ImageNet, and demonstrates superior performance in weakly supervised localization tasks. The method is also validated in a text-to-region association task on the Flickr30k Entities dataset, where it achieves promising performance in phrase localization by leveraging the top-down attention of a CNN model trained on weakly labeled web images. The method is implemented in Caffe and is shown to be effective in generating interpretable attention maps. The paper also presents a detailed analysis of the method's performance on different layers of the network and compares it with other existing methods. The results show that the proposed method outperforms existing methods in terms of accuracy and generalizability, particularly in challenging tasks such as localizing small objects and text-to-region association. The method is also shown to be effective in localizing dominant objects in images and is capable of generating highly discriminative attention maps for various tasks. The paper concludes that the proposed method provides a novel and effective approach to modeling top-down attention in CNNs, with potential applications in a wide range of computer vision tasks.
Reach us at info@study.space
[slides and audio] Top-Down Neural Attention by Excitation Backprop