12 May 2014 | Ali Sharif Razavian Hossein Azizpour Josephine Sullivan Stefan Carlsson
This paper explores the effectiveness of generic descriptors extracted from convolutional neural networks (CNNs) for various recognition tasks. The authors use the OverFeat network, which was trained to perform object classification on the ILSVRC13 dataset, to extract features for tasks such as object image classification, scene recognition, fine-grained recognition, attribute detection, and image retrieval. These tasks are selected to gradually move away from the original task and data that OverFeat was trained on. The results show that the features extracted from OverFeat consistently outperform highly tuned state-of-the-art systems in all visual classification tasks on various datasets. For instance, in image retrieval, the CNN features consistently outperform low-memory footprint methods, except for the sculptures dataset. The findings suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks. The paper also highlights the importance of adapting CNN architectures for specific tasks when computational resources are limited.This paper explores the effectiveness of generic descriptors extracted from convolutional neural networks (CNNs) for various recognition tasks. The authors use the OverFeat network, which was trained to perform object classification on the ILSVRC13 dataset, to extract features for tasks such as object image classification, scene recognition, fine-grained recognition, attribute detection, and image retrieval. These tasks are selected to gradually move away from the original task and data that OverFeat was trained on. The results show that the features extracted from OverFeat consistently outperform highly tuned state-of-the-art systems in all visual classification tasks on various datasets. For instance, in image retrieval, the CNN features consistently outperform low-memory footprint methods, except for the sculptures dataset. The findings suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks. The paper also highlights the importance of adapting CNN architectures for specific tasks when computational resources are limited.