Understanding Show and tell%3A A neural image caption generator

The paper presents a generative model, Neural Image Caption (NIC), which combines deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to automatically generate natural language descriptions of images. The model is trained to maximize the likelihood of a target sentence given an image, using a combination of image features and word embeddings. Experiments on several datasets, including Pascal, Flickr30k, SBU, and COCO, demonstrate the model's accuracy and fluency in generating descriptions. The model achieves BLEU-1 scores of 59 on Pascal, 66 on Flickr30k, 28 on SBU, and 27.7 on COCO, outperforming previous state-of-the-art methods. The paper also discusses the model's performance on various evaluation metrics and its ability to generate diverse and high-quality captions.The paper presents a generative model, Neural Image Caption (NIC), which combines deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to automatically generate natural language descriptions of images. The model is trained to maximize the likelihood of a target sentence given an image, using a combination of image features and word embeddings. Experiments on several datasets, including Pascal, Flickr30k, SBU, and COCO, demonstrate the model's accuracy and fluency in generating descriptions. The model achieves BLEU-1 scores of 59 on Pascal, 66 on Flickr30k, 28 on SBU, and 27.7 on COCO, outperforming previous state-of-the-art methods. The paper also discusses the model's performance on various evaluation metrics and its ability to generate diverse and high-quality captions.

Show and Tell: A Neural Image Caption Generator

20 Apr 2015 | Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan