Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption Generator

20 Apr 2015 | Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan
This paper presents a neural image caption generator (NIC), which combines deep recurrent architecture with recent advances in computer vision and machine translation to generate natural language descriptions of images. The model is trained to maximize the likelihood of the target description sentence given the input image. Experiments on several datasets show that NIC outperforms previous approaches in terms of accuracy and fluency. On the Pascal dataset, NIC achieves a BLEU-1 score of 59, compared to the current state-of-the-art of 25 and human performance of 69. On Flickr30k, the BLEU-1 score improves from 56 to 66, and on SBU from 19 to 28. On the newly released COCO dataset, NIC achieves a BLEU-4 score of 27.7, which is the current state-of-the-art. The model is based on a neural network consisting of a vision CNN followed by a language generating RNN. The CNN encodes the image into a fixed-length vector, which is then used as the initial hidden state of the RNN that generates the sentence. The RNN is trained to maximize the likelihood of the correct description given the image. The model is end-to-end, fully trainable using stochastic gradient descent. The model is evaluated on several datasets, including Pascal, Flickr30k, SBU, and COCO. The results show that NIC performs well on all datasets, with BLEU scores that are significantly higher than previous approaches. The model is also evaluated using human scores, which show that NIC performs well compared to human-generated descriptions. The model is also evaluated in terms of transfer learning, data size, and label quality. The results show that NIC can transfer knowledge from one dataset to another, and that the model performs well even with weakly labeled data. The model is also evaluated in terms of generation diversity, showing that it can generate diverse and high-quality descriptions. The model is also evaluated in terms of ranking results, showing that it performs well on both ranking descriptions given images and ranking images given descriptions. The model is also evaluated in terms of human evaluation, showing that it performs well compared to human-generated descriptions. The model is based on a convolutional neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image. Experiments on several datasets show the robustness of NIC in terms of qualitative results (the generated sentences are very reasonable) and quantitative evaluations, using either ranking metrics or BLEU, a metric used in machine translation to evaluate the quality of generated sentences. It is clear from these experiments that, as the size of the available datasets for image description increases, so will the performance of approaches like NIC. Furthermore, it will be interesting to see how one can use unsupervised data, both from images alone and text alone, to improveThis paper presents a neural image caption generator (NIC), which combines deep recurrent architecture with recent advances in computer vision and machine translation to generate natural language descriptions of images. The model is trained to maximize the likelihood of the target description sentence given the input image. Experiments on several datasets show that NIC outperforms previous approaches in terms of accuracy and fluency. On the Pascal dataset, NIC achieves a BLEU-1 score of 59, compared to the current state-of-the-art of 25 and human performance of 69. On Flickr30k, the BLEU-1 score improves from 56 to 66, and on SBU from 19 to 28. On the newly released COCO dataset, NIC achieves a BLEU-4 score of 27.7, which is the current state-of-the-art. The model is based on a neural network consisting of a vision CNN followed by a language generating RNN. The CNN encodes the image into a fixed-length vector, which is then used as the initial hidden state of the RNN that generates the sentence. The RNN is trained to maximize the likelihood of the correct description given the image. The model is end-to-end, fully trainable using stochastic gradient descent. The model is evaluated on several datasets, including Pascal, Flickr30k, SBU, and COCO. The results show that NIC performs well on all datasets, with BLEU scores that are significantly higher than previous approaches. The model is also evaluated using human scores, which show that NIC performs well compared to human-generated descriptions. The model is also evaluated in terms of transfer learning, data size, and label quality. The results show that NIC can transfer knowledge from one dataset to another, and that the model performs well even with weakly labeled data. The model is also evaluated in terms of generation diversity, showing that it can generate diverse and high-quality descriptions. The model is also evaluated in terms of ranking results, showing that it performs well on both ranking descriptions given images and ranking images given descriptions. The model is also evaluated in terms of human evaluation, showing that it performs well compared to human-generated descriptions. The model is based on a convolutional neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image. Experiments on several datasets show the robustness of NIC in terms of qualitative results (the generated sentences are very reasonable) and quantitative evaluations, using either ranking metrics or BLEU, a metric used in machine translation to evaluate the quality of generated sentences. It is clear from these experiments that, as the size of the available datasets for image description increases, so will the performance of approaches like NIC. Furthermore, it will be interesting to see how one can use unsupervised data, both from images alone and text alone, to improve
Reach us at info@study.space