17 Sep 2014 | Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
This paper introduces a deep convolutional neural network architecture called Inception, which achieved state-of-the-art results in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The Inception architecture is designed to efficiently utilize computational resources by increasing the depth and width of the network while maintaining a constant computational budget. It is based on the Hebbian principle and the intuition of multi-scale processing. The architecture includes "Inception modules" that allow for the integration of multiple types of convolutions (1×1, 3×3, and 5×5) in parallel, enabling the network to process information at different scales. The Inception architecture is used in the GoogLeNet model, which is a 22-layer deep network that significantly outperforms previous state-of-the-art models in both classification and detection tasks. The GoogLeNet model was trained using a combination of techniques, including ensemble learning, aggressive cropping, and multiple models. The model achieved a top-5 error rate of 6.67% on the validation and testing data, ranking first among other participants. In the detection task, the GoogLeNet model was used in conjunction with the R-CNN algorithm, but with improvements in the region proposal step and the use of an ensemble of 6 ConvNets. The model achieved a mean average precision (mAP) of 43.9% for single model cases. The paper also discusses the motivation behind the Inception architecture, including the need for efficient computation and the benefits of using sparsely connected architectures. The results show that the Inception architecture is a viable method for improving neural networks for computer vision tasks.This paper introduces a deep convolutional neural network architecture called Inception, which achieved state-of-the-art results in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The Inception architecture is designed to efficiently utilize computational resources by increasing the depth and width of the network while maintaining a constant computational budget. It is based on the Hebbian principle and the intuition of multi-scale processing. The architecture includes "Inception modules" that allow for the integration of multiple types of convolutions (1×1, 3×3, and 5×5) in parallel, enabling the network to process information at different scales. The Inception architecture is used in the GoogLeNet model, which is a 22-layer deep network that significantly outperforms previous state-of-the-art models in both classification and detection tasks. The GoogLeNet model was trained using a combination of techniques, including ensemble learning, aggressive cropping, and multiple models. The model achieved a top-5 error rate of 6.67% on the validation and testing data, ranking first among other participants. In the detection task, the GoogLeNet model was used in conjunction with the R-CNN algorithm, but with improvements in the region proposal step and the use of an ensemble of 6 ConvNets. The model achieved a mean average precision (mAP) of 43.9% for single model cases. The paper also discusses the motivation behind the Inception architecture, including the need for efficient computation and the benefits of using sparsely connected architectures. The results show that the Inception architecture is a viable method for improving neural networks for computer vision tasks.