23 Apr 2015 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
This paper introduces a new pooling strategy called spatial pyramid pooling (SPP) to enhance deep convolutional networks (CNNs) for visual recognition tasks. The SPP layer allows CNNs to process images of arbitrary sizes and scales, eliminating the need for fixed input sizes. This flexibility improves the accuracy of image classification and object detection. The proposed network, called SPP-net, generates fixed-length representations regardless of the input size, making it more robust to object deformations and variations in scale.
SPP-net is tested on the ImageNet 2012 dataset, where it improves the accuracy of various CNN architectures. It also achieves state-of-the-art results on the Pascal VOC 2007 and Caltech101 datasets using a single full-image representation without fine-tuning. In object detection, SPP-net significantly speeds up processing by computing feature maps once and then pooling features in arbitrary regions. This method is 24-102× faster than R-CNN while maintaining or improving accuracy on Pascal VOC 2007.
In the ILSVRC 2014 competition, SPP-net ranks #2 in object detection and #3 in image classification among 38 teams. The paper also introduces improvements for this competition. SPP-net's advantages are orthogonal to specific CNN designs, and it enhances performance across various architectures. The method is efficient and practical for real-world applications, with a fast processing speed and high accuracy. The paper also discusses multi-scale training, full-image representations, and multi-view testing on feature maps, which further improve classification accuracy. Overall, SPP-net demonstrates significant improvements in both image classification and object detection tasks.This paper introduces a new pooling strategy called spatial pyramid pooling (SPP) to enhance deep convolutional networks (CNNs) for visual recognition tasks. The SPP layer allows CNNs to process images of arbitrary sizes and scales, eliminating the need for fixed input sizes. This flexibility improves the accuracy of image classification and object detection. The proposed network, called SPP-net, generates fixed-length representations regardless of the input size, making it more robust to object deformations and variations in scale.
SPP-net is tested on the ImageNet 2012 dataset, where it improves the accuracy of various CNN architectures. It also achieves state-of-the-art results on the Pascal VOC 2007 and Caltech101 datasets using a single full-image representation without fine-tuning. In object detection, SPP-net significantly speeds up processing by computing feature maps once and then pooling features in arbitrary regions. This method is 24-102× faster than R-CNN while maintaining or improving accuracy on Pascal VOC 2007.
In the ILSVRC 2014 competition, SPP-net ranks #2 in object detection and #3 in image classification among 38 teams. The paper also introduces improvements for this competition. SPP-net's advantages are orthogonal to specific CNN designs, and it enhances performance across various architectures. The method is efficient and practical for real-world applications, with a fast processing speed and high accuracy. The paper also discusses multi-scale training, full-image representations, and multi-view testing on feature maps, which further improve classification accuracy. Overall, SPP-net demonstrates significant improvements in both image classification and object detection tasks.