22 Jun 2014 | Andrej Karpathy, Armand Joulin, Li Fei-Fei
This paper introduces a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, this model embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. The model introduces a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experiments show that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, the model provides interpretable predictions since the inferred intermodal fragment alignment is explicit.
The model breaks down both images and sentences into fragments and reasons about their latent, inter-modal correspondences. For images, fragments correspond to object detections and scene context. For sentences, fragments consist of typed dependency tree relations. The model uses a fragment-level loss function that complements a traditional sentence-image ranking loss. The model is trained on a set of images and their associated natural language descriptions to rank a fixed set of withheld sentences given an image query, and vice versa.
The model uses a neural network that is connected to image pixels on one side and raw 1-of-k word representations on the other. The model uses Convolutional Neural Networks (CNNs) for image processing and neural network models for language processing. The model's objective function includes a Global Ranking Objective and a Fragment Alignment Objective. The Global Ranking Objective ensures that the computed image-sentence similarities are consistent with the ground truth annotation. The Fragment Alignment Objective learns the appearance of all sentence fragments in the visual domain.
The model is evaluated on Pascal1K, Flickr8K, and Flickr30K datasets. The results show that the model outperforms previous methods on these datasets. The model also produces interpretable predictions since the inferred intermodal fragment alignment is explicit. The model is able to generalize to more fine-grained subcategories and out of sample classes. The model has limitations, including failure cases where the model cannot count or reason about spatial positions of objects. The model is extended to support counting, reasoning about spatial positions of objects, and move beyond bags of fragments in future work.This paper introduces a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, this model embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. The model introduces a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experiments show that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, the model provides interpretable predictions since the inferred intermodal fragment alignment is explicit.
The model breaks down both images and sentences into fragments and reasons about their latent, inter-modal correspondences. For images, fragments correspond to object detections and scene context. For sentences, fragments consist of typed dependency tree relations. The model uses a fragment-level loss function that complements a traditional sentence-image ranking loss. The model is trained on a set of images and their associated natural language descriptions to rank a fixed set of withheld sentences given an image query, and vice versa.
The model uses a neural network that is connected to image pixels on one side and raw 1-of-k word representations on the other. The model uses Convolutional Neural Networks (CNNs) for image processing and neural network models for language processing. The model's objective function includes a Global Ranking Objective and a Fragment Alignment Objective. The Global Ranking Objective ensures that the computed image-sentence similarities are consistent with the ground truth annotation. The Fragment Alignment Objective learns the appearance of all sentence fragments in the visual domain.
The model is evaluated on Pascal1K, Flickr8K, and Flickr30K datasets. The results show that the model outperforms previous methods on these datasets. The model also produces interpretable predictions since the inferred intermodal fragment alignment is explicit. The model is able to generalize to more fine-grained subcategories and out of sample classes. The model has limitations, including failure cases where the model cannot count or reason about spatial positions of objects. The model is extended to support counting, reasoning about spatial positions of objects, and move beyond bags of fragments in future work.