DenseCap: Fully Convolutional Localization Networks for Dense Captioning

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

24 Nov 2015 | Justin Johnson*, Andrej Karpathy*, Li Fei-Fei
The paper introduces the dense captioning task, which requires a system to localize and describe salient regions in images using natural language. To address this task, the authors propose the Fully Convolutional Localization Network (FCLN) architecture, which processes images with a single forward pass, requires no external region proposals, and can be trained end-to-end with a single round of optimization. The FCLN consists of a Convolutional Network, a novel dense localization layer, and a Recurrent Neural Network (RNN) language model. The dense localization layer predicts regions of interest and uses bilinear interpolation to extract features from these regions. The RNN language model generates label sequences based on the extracted features. The model is evaluated on the Visual Genome dataset, which contains 94,000 images and 4,100,000 region-grounded captions. The results show both speed and accuracy improvements over baselines in generation and retrieval settings. The authors also discuss related work in object detection, image captioning, and soft spatial attention, and provide a detailed description of the model architecture, loss function, and training process.The paper introduces the dense captioning task, which requires a system to localize and describe salient regions in images using natural language. To address this task, the authors propose the Fully Convolutional Localization Network (FCLN) architecture, which processes images with a single forward pass, requires no external region proposals, and can be trained end-to-end with a single round of optimization. The FCLN consists of a Convolutional Network, a novel dense localization layer, and a Recurrent Neural Network (RNN) language model. The dense localization layer predicts regions of interest and uses bilinear interpolation to extract features from these regions. The RNN language model generates label sequences based on the extracted features. The model is evaluated on the Visual Genome dataset, which contains 94,000 images and 4,100,000 region-grounded captions. The results show both speed and accuracy improvements over baselines in generation and retrieval settings. The authors also discuss related work in object detection, image captioning, and soft spatial attention, and provide a detailed description of the model architecture, loss function, and training process.
Reach us at info@study.space