Deep Visual-Semantic Alignments for Generating Image Descriptions

Deep Visual-Semantic Alignments for Generating Image Descriptions

14 Apr 2015 | Andrej Karpathy, Li Fei-Fei
The paper presents a model that generates natural language descriptions of images and their regions. The approach leverages datasets of images and their sentence descriptions to learn inter-modal correspondences between language and visual data. The model combines Convolutional Neural Networks (CNNs) over image regions, bidirectional Recurrent Neural Networks (RNNs) over sentences, and a structured objective to align the two modalities through a multimodal embedding. The authors then describe a Multimodal RNN architecture that uses the inferred alignments to generate novel descriptions of image regions. The model is evaluated on retrieval experiments on Flickr8K, Flickr30K, and MSCOCO datasets, demonstrating state-of-the-art results. The generated descriptions significantly outperform retrieval baselines on both full images and a new dataset of region-level annotations. The paper also discusses related work and limitations, and provides additional experimental details and visualizations.The paper presents a model that generates natural language descriptions of images and their regions. The approach leverages datasets of images and their sentence descriptions to learn inter-modal correspondences between language and visual data. The model combines Convolutional Neural Networks (CNNs) over image regions, bidirectional Recurrent Neural Networks (RNNs) over sentences, and a structured objective to align the two modalities through a multimodal embedding. The authors then describe a Multimodal RNN architecture that uses the inferred alignments to generate novel descriptions of image regions. The model is evaluated on retrieval experiments on Flickr8K, Flickr30K, and MSCOCO datasets, demonstrating state-of-the-art results. The generated descriptions significantly outperform retrieval baselines on both full images and a new dataset of region-level annotations. The paper also discusses related work and limitations, and provides additional experimental details and visualizations.
Reach us at info@study.space
[slides and audio] Deep visual-semantic alignments for generating image descriptions