20 Mar 2013 | Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng
This paper introduces a zero-shot learning model that can recognize objects in images even when no training data is available for those objects. The model uses unsupervised text corpora to learn semantic information, allowing it to understand what objects look like. Unlike previous models that can only differentiate between unseen classes, this model achieves state-of-the-art performance on seen classes and reasonable performance on unseen classes. It does this by first detecting outliers in the semantic space and then using two separate recognition models. The model does not require manually defined semantic features for words or images.
The model maps images into a semantic space of words learned by a neural network. Word vectors capture distributional similarities from a large, unsupervised text corpus. By learning an image mapping into this space, the word vectors are grounded by the visual modality, allowing for prototypical instances of various words. The model also incorporates an outlier detection probability to determine whether a new image is on the manifold of known categories. If the image is of a known category, a standard classifier is used. Otherwise, the image is assigned to a class based on the likelihood of being an unseen category.
The model is tested on the CIFAR10 dataset, where it achieves high accuracy on seen classes and reasonable accuracy on unseen classes. The model outperforms previous work by not requiring manually defined semantic features and by using cross-modal knowledge transfer from natural language. The model is illustrated in Figure 1, showing how images are mapped into a semantic space and how outlier detection is used to classify unseen classes. The model is also compared to related work in knowledge transfer, where it does not require manually defined semantic or visual attributes for zero-shot classes. The model uses unsupervised large-text corpora to learn semantic word representations.This paper introduces a zero-shot learning model that can recognize objects in images even when no training data is available for those objects. The model uses unsupervised text corpora to learn semantic information, allowing it to understand what objects look like. Unlike previous models that can only differentiate between unseen classes, this model achieves state-of-the-art performance on seen classes and reasonable performance on unseen classes. It does this by first detecting outliers in the semantic space and then using two separate recognition models. The model does not require manually defined semantic features for words or images.
The model maps images into a semantic space of words learned by a neural network. Word vectors capture distributional similarities from a large, unsupervised text corpus. By learning an image mapping into this space, the word vectors are grounded by the visual modality, allowing for prototypical instances of various words. The model also incorporates an outlier detection probability to determine whether a new image is on the manifold of known categories. If the image is of a known category, a standard classifier is used. Otherwise, the image is assigned to a class based on the likelihood of being an unseen category.
The model is tested on the CIFAR10 dataset, where it achieves high accuracy on seen classes and reasonable accuracy on unseen classes. The model outperforms previous work by not requiring manually defined semantic features and by using cross-modal knowledge transfer from natural language. The model is illustrated in Figure 1, showing how images are mapped into a semantic space and how outlier detection is used to classify unseen classes. The model is also compared to related work in knowledge transfer, where it does not require manually defined semantic or visual attributes for zero-shot classes. The model uses unsupervised large-text corpora to learn semantic word representations.