Deep scene CNNs have emerged as a powerful approach for object detection. This paper shows that object detectors emerge naturally when training CNNs for scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful object detectors, representative of the learned scene categories. This demonstrates that a single network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.
The paper investigates the internal representation learned by CNNs trained on scene classification tasks, such as the Places dataset. It shows that the internal representation of a CNN trained on scene classification differs from that of a CNN trained on object classification. For example, the earlier layers of the CNN prefer similar images for both networks, while the later layers tend to be more specialized to the specific task of scene or object categorization.
The paper also explores the receptive fields (RFs) of the various units in the CNNs. It shows that as the layers go deeper, the RF size gradually increases and the activation regions become more semantically meaningful. The RFs are used to segment images using the feature maps of different units. The analysis shows that the actual size of the RF is much smaller than the theoretical size, especially in the later layers.
The paper also investigates the semantics of internal units. It shows that the units in the later layers have higher ratios of high-level semantics as compared to the units in the ImageNet-CNN. The paper also shows that the units in the Places-CNN can detect a wide range of object classes, including buildings, lamps, and dinner tables. The paper also shows that the units in the Places-CNN can detect objects that are not present in the ImageNet-CNN.
The paper concludes that object detectors emerge as a result of learning to classify scene categories, showing that a single network can support recognition at several levels of abstraction (e.g., edges, textures, objects, and scenes) without needing multiple outputs or networks. The paper also shows that the network can perform both scene recognition and object localization in a single forward-pass.Deep scene CNNs have emerged as a powerful approach for object detection. This paper shows that object detectors emerge naturally when training CNNs for scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful object detectors, representative of the learned scene categories. This demonstrates that a single network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.
The paper investigates the internal representation learned by CNNs trained on scene classification tasks, such as the Places dataset. It shows that the internal representation of a CNN trained on scene classification differs from that of a CNN trained on object classification. For example, the earlier layers of the CNN prefer similar images for both networks, while the later layers tend to be more specialized to the specific task of scene or object categorization.
The paper also explores the receptive fields (RFs) of the various units in the CNNs. It shows that as the layers go deeper, the RF size gradually increases and the activation regions become more semantically meaningful. The RFs are used to segment images using the feature maps of different units. The analysis shows that the actual size of the RF is much smaller than the theoretical size, especially in the later layers.
The paper also investigates the semantics of internal units. It shows that the units in the later layers have higher ratios of high-level semantics as compared to the units in the ImageNet-CNN. The paper also shows that the units in the Places-CNN can detect a wide range of object classes, including buildings, lamps, and dinner tables. The paper also shows that the units in the Places-CNN can detect objects that are not present in the ImageNet-CNN.
The paper concludes that object detectors emerge as a result of learning to classify scene categories, showing that a single network can support recognition at several levels of abstraction (e.g., edges, textures, objects, and scenes) without needing multiple outputs or networks. The paper also shows that the network can perform both scene recognition and object localization in a single forward-pass.