21 Mar 2014 | Mohammad Norouzi*, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, Jeffrey Dean
This paper proposes a simple method for constructing an image embedding system using existing n-way image classifiers and semantic word embeddings. The method maps images into a semantic embedding space via convex combination of class label embedding vectors, without requiring additional training. The approach leverages the probabilistic predictions of an existing classifier to compute a weighted combination of label embeddings in the semantic space, resulting in a continuous embedding vector for each image. This vector is then used to extrapolate the classifier's predictions beyond the training labels to unseen test labels.
The method, called "convex combination of semantic embeddings" (ConSE), is evaluated on the ImageNet zero-shot learning task. Using a convolutional neural network trained on 1000 object categories from ImageNet, ConSE achieves 9.4% hit@1 and 24.7% hit@5 on 1600 unseen object categories. The method outperforms a recent state-of-the-art model on the same task. ConSE is compared with the DeViSE model, which also uses a convolutional neural network but replaces the Softmax layer with a linear transformation layer. ConSE outperforms DeViSE on all datasets and values of T, with ConSE(10) performing the best.
The paper also discusses the effectiveness of ConSE in standard classification tasks with the training 1000 labels. ConSE(10) improves upon the Softmax baseline in hierarchical precision at 5, 10, and 20, suggesting that the mistakes made by ConSE are more semantically consistent with the correct class labels. However, ConSE underperforms DeViSE on the 1000-class learning task, which is expected because DeViSE is trained with a k-nearest neighbor retrieval objective on the same specific set of 1000 labels.
The paper concludes that while ConSE is simple, it outperforms more elaborate joint training approaches on zero-shot learning and performance metrics that weight errors based on semantic quality. ConSE leverages the strengths of the state-of-the-art image classifier and text embedding system. The method is generalizable to other visual and text models, and its natural representation of confidence could be useful in settings where confidence is a useful signal.This paper proposes a simple method for constructing an image embedding system using existing n-way image classifiers and semantic word embeddings. The method maps images into a semantic embedding space via convex combination of class label embedding vectors, without requiring additional training. The approach leverages the probabilistic predictions of an existing classifier to compute a weighted combination of label embeddings in the semantic space, resulting in a continuous embedding vector for each image. This vector is then used to extrapolate the classifier's predictions beyond the training labels to unseen test labels.
The method, called "convex combination of semantic embeddings" (ConSE), is evaluated on the ImageNet zero-shot learning task. Using a convolutional neural network trained on 1000 object categories from ImageNet, ConSE achieves 9.4% hit@1 and 24.7% hit@5 on 1600 unseen object categories. The method outperforms a recent state-of-the-art model on the same task. ConSE is compared with the DeViSE model, which also uses a convolutional neural network but replaces the Softmax layer with a linear transformation layer. ConSE outperforms DeViSE on all datasets and values of T, with ConSE(10) performing the best.
The paper also discusses the effectiveness of ConSE in standard classification tasks with the training 1000 labels. ConSE(10) improves upon the Softmax baseline in hierarchical precision at 5, 10, and 20, suggesting that the mistakes made by ConSE are more semantically consistent with the correct class labels. However, ConSE underperforms DeViSE on the 1000-class learning task, which is expected because DeViSE is trained with a k-nearest neighbor retrieval objective on the same specific set of 1000 labels.
The paper concludes that while ConSE is simple, it outperforms more elaborate joint training approaches on zero-shot learning and performance metrics that weight errors based on semantic quality. ConSE leverages the strengths of the state-of-the-art image classifier and text embedding system. The method is generalizable to other visual and text models, and its natural representation of confidence could be useful in settings where confidence is a useful signal.