Multi-Scale Orderless Pooling of Deep Convolutional Activation Features

Multi-Scale Orderless Pooling of Deep Convolutional Activation Features

8 Sep 2014 | Yunchao Gong, Liwei Wang, Ruiqi Guo, Svetlana Lazebnik
This paper introduces a multi-scale orderless pooling (MOP-CNN) method to improve the invariance of deep convolutional neural network (CNN) activations without degrading their discriminative power. MOP-CNN extracts CNN activations for local image patches at multiple scales, performs orderless VLAD pooling at each scale level, and concatenates the results. The resulting representation outperforms global CNN activations in both supervised and unsupervised tasks, including image classification and instance-level retrieval, without requiring joint training for specific datasets. It achieves state-of-the-art results on the SUN397 and MIT Indoor Scenes datasets and competitive results on ILSVRC2012/2013 and INRIA Holidays datasets. The method is based on the idea of extracting CNN activations from multiple scales of image patches, then aggregating them using VLAD pooling. This approach preserves more local details while reducing the impact of global transformations. The MOP-CNN representation is formed by concatenating global CNN activations with VLAD features from multiple scales. It is shown to be more robust to geometric deformations than global CNN activations, and performs well in both classification and retrieval tasks. The paper evaluates MOP-CNN on four benchmark datasets: SUN397, MIT Indoor, ILSVRC2012/2013, and INRIA Holidays. Results show that MOP-CNN achieves higher classification accuracy than global CNN activations and other baselines, particularly on datasets with high spatial variability. In image retrieval, MOP-CNN outperforms global CNN activations in terms of mAP, even when using a compact descriptor. The method is also shown to be effective for unsupervised tasks, such as image retrieval, where labeled data may not be available. The paper concludes that MOP-CNN provides a generic, robust representation that can be used for both supervised and unsupervised tasks. It is a simple yet effective method that improves the invariance of CNN activations without degrading their discriminative power. Future work includes exploring more sophisticated ways to incorporate orderless information into CNNs and optimizing the feature extraction process.This paper introduces a multi-scale orderless pooling (MOP-CNN) method to improve the invariance of deep convolutional neural network (CNN) activations without degrading their discriminative power. MOP-CNN extracts CNN activations for local image patches at multiple scales, performs orderless VLAD pooling at each scale level, and concatenates the results. The resulting representation outperforms global CNN activations in both supervised and unsupervised tasks, including image classification and instance-level retrieval, without requiring joint training for specific datasets. It achieves state-of-the-art results on the SUN397 and MIT Indoor Scenes datasets and competitive results on ILSVRC2012/2013 and INRIA Holidays datasets. The method is based on the idea of extracting CNN activations from multiple scales of image patches, then aggregating them using VLAD pooling. This approach preserves more local details while reducing the impact of global transformations. The MOP-CNN representation is formed by concatenating global CNN activations with VLAD features from multiple scales. It is shown to be more robust to geometric deformations than global CNN activations, and performs well in both classification and retrieval tasks. The paper evaluates MOP-CNN on four benchmark datasets: SUN397, MIT Indoor, ILSVRC2012/2013, and INRIA Holidays. Results show that MOP-CNN achieves higher classification accuracy than global CNN activations and other baselines, particularly on datasets with high spatial variability. In image retrieval, MOP-CNN outperforms global CNN activations in terms of mAP, even when using a compact descriptor. The method is also shown to be effective for unsupervised tasks, such as image retrieval, where labeled data may not be available. The paper concludes that MOP-CNN provides a generic, robust representation that can be used for both supervised and unsupervised tasks. It is a simple yet effective method that improves the invariance of CNN activations without degrading their discriminative power. Future work includes exploring more sophisticated ways to incorporate orderless information into CNNs and optimizing the feature extraction process.
Reach us at info@study.space