14 Dec 2015 | Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba
This paper revisits the global average pooling (GAP) layer in convolutional neural networks (CNNs) and shows how it enables the network to have strong localization ability despite being trained on image-level labels. The authors demonstrate that GAP not only acts as a regularizer but also builds a generic localizable deep representation that can be applied to various tasks. They achieve a top-5 error of 37.1% on ILSVRC 2014 for object localization, which is close to the 34.2% error of a fully supervised CNN approach. The network is able to localize discriminative image regions for various tasks without being trained for them.
The paper introduces a technique called Class Activation Mapping (CAM) that uses GAP to generate class-specific activation maps. These maps highlight the image regions that are most important for classification. The authors show that their approach can be used for weakly-supervised object localization and that the localization ability is generic and can be transferred to other recognition tasks.
The authors also demonstrate that their approach can be used for fine-grained recognition of bird species and for discovering common elements or patterns in images. They show that their technique can be used to identify important regions in images for tasks such as visual question answering.
The paper concludes that the use of GAP in CNNs enables the network to learn to perform object localization without using bounding box annotations. The class activation maps allow the network to visualize the predicted class scores and highlight the discriminative object parts detected by the CNN. The authors demonstrate that their approach can be used for various visual recognition tasks and that the deep features learned by their approach can be used to understand the basis of discrimination used by CNNs for their tasks.This paper revisits the global average pooling (GAP) layer in convolutional neural networks (CNNs) and shows how it enables the network to have strong localization ability despite being trained on image-level labels. The authors demonstrate that GAP not only acts as a regularizer but also builds a generic localizable deep representation that can be applied to various tasks. They achieve a top-5 error of 37.1% on ILSVRC 2014 for object localization, which is close to the 34.2% error of a fully supervised CNN approach. The network is able to localize discriminative image regions for various tasks without being trained for them.
The paper introduces a technique called Class Activation Mapping (CAM) that uses GAP to generate class-specific activation maps. These maps highlight the image regions that are most important for classification. The authors show that their approach can be used for weakly-supervised object localization and that the localization ability is generic and can be transferred to other recognition tasks.
The authors also demonstrate that their approach can be used for fine-grained recognition of bird species and for discovering common elements or patterns in images. They show that their technique can be used to identify important regions in images for tasks such as visual question answering.
The paper concludes that the use of GAP in CNNs enables the network to learn to perform object localization without using bounding box annotations. The class activation maps allow the network to visualize the predicted class scores and highlight the discriminative object parts detected by the CNN. The authors demonstrate that their approach can be used for various visual recognition tasks and that the deep features learned by their approach can be used to understand the basis of discrimination used by CNNs for their tasks.