Understanding Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper presents a scalable and effective object detection algorithm, R-CNN (Regions with CNN features), which improves mean average precision (mAP) by over 30% compared to the previous best result on the PASCAL VOC 2012 dataset, achieving a mAP of 53.3%. The key contributions of R-CNN are two-fold: (1) applying high-capacity convolutional neural networks (CNNs) to bottom-up region proposals for localization and segmentation, and (2) using supervised pre-training on a large auxiliary dataset (ILSVRC) followed by domain-specific fine-tuning on a small dataset (PASCAL). R-CNN outperforms OverFeat, a sliding-window detector, by a significant margin on the 200-class ILSVRC2013 detection dataset, achieving a mAP of 31.4% compared to 24.3%. The paper also discusses the efficiency and scalability of R-CNN, its performance on the PASCAL VOC 2010-12 datasets, and its application to semantic segmentation, achieving an average segmentation accuracy of 47.9% on the VOC 2011 test set. Additionally, the paper explores the visualization of learned features, ablation studies, and error analysis, providing insights into the strengths and limitations of the method.This paper presents a scalable and effective object detection algorithm, R-CNN (Regions with CNN features), which improves mean average precision (mAP) by over 30% compared to the previous best result on the PASCAL VOC 2012 dataset, achieving a mAP of 53.3%. The key contributions of R-CNN are two-fold: (1) applying high-capacity convolutional neural networks (CNNs) to bottom-up region proposals for localization and segmentation, and (2) using supervised pre-training on a large auxiliary dataset (ILSVRC) followed by domain-specific fine-tuning on a small dataset (PASCAL). R-CNN outperforms OverFeat, a sliding-window detector, by a significant margin on the 200-class ILSVRC2013 detection dataset, achieving a mAP of 31.4% compared to 24.3%. The paper also discusses the efficiency and scalability of R-CNN, its performance on the PASCAL VOC 2010-12 datasets, and its application to semantic segmentation, achieving an average segmentation accuracy of 47.9% on the VOC 2011 test set. Additionally, the paper explores the visualization of learned features, ablation studies, and error analysis, providing insights into the strengths and limitations of the method.

Rich feature hierarchies for accurate object detection and semantic segmentation

22 Oct 2014 | Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik