Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

10 Nov 2014 | Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel
This paper introduces an encoder-decoder pipeline that learns a multimodal joint embedding space for images and text, and a novel language model for decoding distributed representations from this space. The encoder allows ranking of images and sentences, while the decoder generates novel descriptions. Using LSTM to encode sentences, the model achieves state-of-the-art performance on Flickr8K and Flickr30K without using object detections. It also sets new best results with the 19-layer Oxford convolutional network. The model captures multimodal regularities, such as vector space arithmetic, where *image of a blue car* - "blue" + "red" is near images of red cars. The paper also discusses the structure-content neural language model (SC-NLM), which disentangles sentence structure from content. The SC-NLM generates realistic image captions and improves over previous models. The model is evaluated on image caption generation, ranking, and multimodal linguistic regularities. It shows that the learned embedding space captures linguistic regularities and can be used for image retrieval. The model is compared to other methods, including template-based, composition-based, and neural network methods. The paper also discusses the use of encoder-decoder methods in machine translation and how image caption generation can be viewed as a translation problem. The model is trained on a large collection of image descriptions and achieves state-of-the-art results on several benchmarks. The paper also presents additional experiments and details on the model's performance.This paper introduces an encoder-decoder pipeline that learns a multimodal joint embedding space for images and text, and a novel language model for decoding distributed representations from this space. The encoder allows ranking of images and sentences, while the decoder generates novel descriptions. Using LSTM to encode sentences, the model achieves state-of-the-art performance on Flickr8K and Flickr30K without using object detections. It also sets new best results with the 19-layer Oxford convolutional network. The model captures multimodal regularities, such as vector space arithmetic, where *image of a blue car* - "blue" + "red" is near images of red cars. The paper also discusses the structure-content neural language model (SC-NLM), which disentangles sentence structure from content. The SC-NLM generates realistic image captions and improves over previous models. The model is evaluated on image caption generation, ranking, and multimodal linguistic regularities. It shows that the learned embedding space captures linguistic regularities and can be used for image retrieval. The model is compared to other methods, including template-based, composition-based, and neural network methods. The paper also discusses the use of encoder-decoder methods in machine translation and how image caption generation can be viewed as a translation problem. The model is trained on a large collection of image descriptions and achieves state-of-the-art results on several benchmarks. The paper also presents additional experiments and details on the model's performance.
Reach us at info@study.space
Understanding Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models