Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

22 Jun 2014 | Andrej Karpathy, Armand Joulin, Li Fei-Fei
The paper introduces a model for bidirectional retrieval of images and sentences by embedding visual and natural language data into a common multi-modal space. Unlike previous models that directly map images or sentences into a common embedding space, this model works at a finer level by embedding fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. The model includes a ranking objective and a new fragment alignment objective to learn the direct association between these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. The model also provides interpretable predictions since the inferred inter-modal fragment alignment is explicit. The authors report significant improvements over state-of-the-art methods on datasets such as Pascal1K, Flickr8K, and Flickr30K.The paper introduces a model for bidirectional retrieval of images and sentences by embedding visual and natural language data into a common multi-modal space. Unlike previous models that directly map images or sentences into a common embedding space, this model works at a finer level by embedding fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. The model includes a ranking objective and a new fragment alignment objective to learn the direct association between these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. The model also provides interpretable predictions since the inferred inter-modal fragment alignment is explicit. The authors report significant improvements over state-of-the-art methods on datasets such as Pascal1K, Flickr8K, and Flickr30K.
Reach us at info@study.space