24 Nov 2015 | Justin Johnson*, Andrej Karpathy*, Li Fei-Fei
DenseCap introduces a fully convolutional localization network (FCLN) for dense captioning, which requires a computer vision system to both localize and describe salient regions in images in natural language. The FCLN architecture processes an image with a single forward pass, requires no external region proposals, and can be trained end-to-end. It consists of a convolutional network, a novel dense localization layer, and a recurrent neural network language model that generates label sequences. The model is evaluated on the Visual Genome dataset, which contains 94,000 images and 4,100,000 region-grounded captions. The results show improvements in both speed and accuracy over existing methods. The model is efficient and effective, supporting end-to-end training and efficient inference. It outperforms baselines in both generation and retrieval settings. The model can also be used for image retrieval using natural-language queries and can localize these queries in retrieved images. The model is able to correctly retrieve and localize people, animals, and parts of both natural and man-made objects. The model is efficient at test time, with a 720x600 image processed in 240ms. The model outperforms other baselines in both ranking and localization tasks. The model is able to detect animal parts and understand some object attributes and interactions between objects. The model can also be used for open-world object detection, where object classes are specified using natural language at test-time. The model is able to detect arbitrary pieces of text in images. The model is efficient and effective, supporting end-to-end training and efficient inference. The model is able to detect animal parts and understand some object attributes and interactions between objects. The model is able to detect arbitrary pieces of text in images.DenseCap introduces a fully convolutional localization network (FCLN) for dense captioning, which requires a computer vision system to both localize and describe salient regions in images in natural language. The FCLN architecture processes an image with a single forward pass, requires no external region proposals, and can be trained end-to-end. It consists of a convolutional network, a novel dense localization layer, and a recurrent neural network language model that generates label sequences. The model is evaluated on the Visual Genome dataset, which contains 94,000 images and 4,100,000 region-grounded captions. The results show improvements in both speed and accuracy over existing methods. The model is efficient and effective, supporting end-to-end training and efficient inference. It outperforms baselines in both generation and retrieval settings. The model can also be used for image retrieval using natural-language queries and can localize these queries in retrieved images. The model is able to correctly retrieve and localize people, animals, and parts of both natural and man-made objects. The model is efficient at test time, with a 720x600 image processed in 240ms. The model outperforms other baselines in both ranking and localization tasks. The model is able to detect animal parts and understand some object attributes and interactions between objects. The model can also be used for open-world object detection, where object classes are specified using natural language at test-time. The model is able to detect arbitrary pieces of text in images. The model is efficient and effective, supporting end-to-end training and efficient inference. The model is able to detect animal parts and understand some object attributes and interactions between objects. The model is able to detect arbitrary pieces of text in images.