OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

24 Feb 2014 | Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun
This paper presents an integrated framework for using Convolutional Networks (ConvNets) for classification, localization, and detection. The framework efficiently implements a multiscale and sliding window approach within a ConvNet. A novel deep learning approach is introduced for localization by learning to predict object boundaries. Bounding boxes are accumulated rather than suppressed to increase detection confidence. The framework allows different tasks to be learned simultaneously using a single shared network. It won the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and achieved competitive results for detection and classification tasks. Post-competition work established a new state of the art for the detection task. A feature extractor called OverFeat is released from the best model. The paper explores three computer vision tasks: classification, localization, and detection. Each task is a sub-task of the next. The framework is applied to the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In the classification task, each image is assigned a single label corresponding to the main object. In the localization task, 5 guesses are allowed per image, but a bounding box must be returned with each guess. In the detection task, any number of objects can be present, and false positives are penalized by the mean average precision (mAP) measure. The localization task is an intermediate step between classification and detection, allowing evaluation of the localization method independently of detection-specific challenges. The classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky et al. However, the network design and inference step are improved. The model is trained on the ImageNet 2012 training set. The network is trained with a fixed input size, but multi-scale is used for classification. The model uses a fast and accurate version, with the accurate model being more accurate but requiring more connections. The model achieves a top-5 error rate of 13.6% with 6 scales. The regression network is trained to predict object bounding boxes at each spatial location and scale. The regression network uses pooled feature maps from layer 5 and has 2 fully-connected hidden layers. The final output layer has 4 units which specify the coordinates for the bounding box edges. The model combines individual predictions via a greedy merge strategy, resulting in a final prediction with maximum class scores. The detection task is similar to classification but in a spatial manner. The model is trained to predict a background class when no object is present. The model uses a multi-scale approach and achieves a mean average precision (mAP) of 24.3% in post-competition work. The model ranks first in detection results. The model is efficient and can be applied to larger images due to the inherent efficiency of ConvNets. The model is integrated and can perform different tasks while sharing a common feature extraction base.This paper presents an integrated framework for using Convolutional Networks (ConvNets) for classification, localization, and detection. The framework efficiently implements a multiscale and sliding window approach within a ConvNet. A novel deep learning approach is introduced for localization by learning to predict object boundaries. Bounding boxes are accumulated rather than suppressed to increase detection confidence. The framework allows different tasks to be learned simultaneously using a single shared network. It won the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and achieved competitive results for detection and classification tasks. Post-competition work established a new state of the art for the detection task. A feature extractor called OverFeat is released from the best model. The paper explores three computer vision tasks: classification, localization, and detection. Each task is a sub-task of the next. The framework is applied to the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In the classification task, each image is assigned a single label corresponding to the main object. In the localization task, 5 guesses are allowed per image, but a bounding box must be returned with each guess. In the detection task, any number of objects can be present, and false positives are penalized by the mean average precision (mAP) measure. The localization task is an intermediate step between classification and detection, allowing evaluation of the localization method independently of detection-specific challenges. The classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky et al. However, the network design and inference step are improved. The model is trained on the ImageNet 2012 training set. The network is trained with a fixed input size, but multi-scale is used for classification. The model uses a fast and accurate version, with the accurate model being more accurate but requiring more connections. The model achieves a top-5 error rate of 13.6% with 6 scales. The regression network is trained to predict object bounding boxes at each spatial location and scale. The regression network uses pooled feature maps from layer 5 and has 2 fully-connected hidden layers. The final output layer has 4 units which specify the coordinates for the bounding box edges. The model combines individual predictions via a greedy merge strategy, resulting in a final prediction with maximum class scores. The detection task is similar to classification but in a spatial manner. The model is trained to predict a background class when no object is present. The model uses a multi-scale approach and achieves a mean average precision (mAP) of 24.3% in post-competition work. The model ranks first in detection results. The model is efficient and can be applied to larger images due to the inherent efficiency of ConvNets. The model is integrated and can perform different tasks while sharing a common feature extraction base.
Reach us at info@study.space