22 Oct 2014 | Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik
R-CNN is a method for object detection that uses convolutional neural networks (CNNs) to extract features from region proposals. The method improves mean average precision (mAP) by over 30% compared to previous results on the PASCAL VOC 2012 dataset, achieving an mAP of 53.3%. R-CNN combines two key insights: (1) applying high-capacity CNNs to bottom-up region proposals for object localization and segmentation, and (2) using supervised pre-training on an auxiliary task followed by domain-specific fine-tuning when labeled data is scarce. R-CNN outperforms OverFeat on the ILSVRC2013 detection dataset, achieving an mAP of 31.4% compared to OverFeat's 24.3%. The method also performs well on semantic segmentation, achieving an average accuracy of 47.9% on the VOC 2011 test set. R-CNN is efficient, with only class-specific computations involving matrix-vector products and non-maximum suppression. The method is scalable and effective for thousands of object classes. R-CNN's performance is improved by fine-tuning the CNN on the target task, which increases mAP by 8 percentage points. The method is also compared to other approaches, including deformable part models and recent feature learning methods, showing significant improvements in performance. R-CNN's success is attributed to its use of CNNs for feature extraction and its ability to handle large-scale data. The method is efficient and effective for object detection and semantic segmentation, and its results demonstrate the potential of CNNs in vision tasks.R-CNN is a method for object detection that uses convolutional neural networks (CNNs) to extract features from region proposals. The method improves mean average precision (mAP) by over 30% compared to previous results on the PASCAL VOC 2012 dataset, achieving an mAP of 53.3%. R-CNN combines two key insights: (1) applying high-capacity CNNs to bottom-up region proposals for object localization and segmentation, and (2) using supervised pre-training on an auxiliary task followed by domain-specific fine-tuning when labeled data is scarce. R-CNN outperforms OverFeat on the ILSVRC2013 detection dataset, achieving an mAP of 31.4% compared to OverFeat's 24.3%. The method also performs well on semantic segmentation, achieving an average accuracy of 47.9% on the VOC 2011 test set. R-CNN is efficient, with only class-specific computations involving matrix-vector products and non-maximum suppression. The method is scalable and effective for thousands of object classes. R-CNN's performance is improved by fine-tuning the CNN on the target task, which increases mAP by 8 percentage points. The method is also compared to other approaches, including deformable part models and recent feature learning methods, showing significant improvements in performance. R-CNN's success is attributed to its use of CNNs for feature extraction and its ability to handle large-scale data. The method is efficient and effective for object detection and semantic segmentation, and its results demonstrate the potential of CNNs in vision tasks.