Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

10 Nov 2014 | Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel
This paper introduces an encoder-decoder pipeline for image caption generation, which learns a multimodal joint embedding space for images and text and a novel language model for decoding distributed representations. The pipeline unifies joint image-text embedding models with multimodal neural language models. The structure-content neural language model (SC-NLM) disentangles the structure of a sentence from its content, conditioned on representations produced by the encoder. The encoder allows ranking images and sentences, while the decoder generates novel descriptions. Using LSTM for encoding sentences, the model matches state-of-the-art performance on Flickr8K and Flickr30K without object detections. The 19-layer Oxford convolutional network also sets new best results. The learned embedding space captures multimodal regularities, such as the vector difference between "image of a blue car" and "red" being near images of red cars. Sample captions generated for 800 images are available for comparison. The paper also discusses the effectiveness of using an LSTM sentence encoder for ranking images and sentences and explores multimodal linguistic regularities in the learned vector space.This paper introduces an encoder-decoder pipeline for image caption generation, which learns a multimodal joint embedding space for images and text and a novel language model for decoding distributed representations. The pipeline unifies joint image-text embedding models with multimodal neural language models. The structure-content neural language model (SC-NLM) disentangles the structure of a sentence from its content, conditioned on representations produced by the encoder. The encoder allows ranking images and sentences, while the decoder generates novel descriptions. Using LSTM for encoding sentences, the model matches state-of-the-art performance on Flickr8K and Flickr30K without object detections. The 19-layer Oxford convolutional network also sets new best results. The learned embedding space captures multimodal regularities, such as the vector difference between "image of a blue car" and "red" being near images of red cars. Sample captions generated for 800 images are available for comparison. The paper also discusses the effectiveness of using an LSTM sentence encoder for ranking images and sentences and explores multimodal linguistic regularities in the learned vector space.
Reach us at info@study.space
[slides] Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models | StudySpace