28 Aug 2015 | Zeynep Akata*, Scott Reed†, Daniel Walter†, Honglak Lee† and Bernt Schiele*
This paper addresses the challenge of fine-grained image classification, particularly in scenarios where labeled training data is scarce or unavailable. The authors propose a Structured Joint Embedding (SJE) framework that leverages both input and output embeddings to improve zero-shot classification performance. The framework learns a compatibility function that measures the similarity between input image features and output embeddings, allowing for accurate classification without labeled training data. The study evaluates various supervised and unsupervised output embeddings, including human-annotated attributes, unsupervised word embeddings from text corpora, and hierarchical embeddings derived from taxonomies. The results show that unsupervised output embeddings, especially those learned from fine-grained text, can achieve competitive or superior performance compared to supervised methods. By combining different output embeddings, the authors further enhance the classification accuracy, demonstrating the complementary nature of these embeddings. The paper also introduces a novel weakly-supervised Word2Vec variant that improves accuracy when combined with other output embeddings. Overall, the SJE framework significantly improves state-of-the-art results on datasets such as Animals with Attributes and Caltech-UCSD Birds, highlighting the potential of unsupervised and weakly-supervised methods in fine-grained image classification.This paper addresses the challenge of fine-grained image classification, particularly in scenarios where labeled training data is scarce or unavailable. The authors propose a Structured Joint Embedding (SJE) framework that leverages both input and output embeddings to improve zero-shot classification performance. The framework learns a compatibility function that measures the similarity between input image features and output embeddings, allowing for accurate classification without labeled training data. The study evaluates various supervised and unsupervised output embeddings, including human-annotated attributes, unsupervised word embeddings from text corpora, and hierarchical embeddings derived from taxonomies. The results show that unsupervised output embeddings, especially those learned from fine-grained text, can achieve competitive or superior performance compared to supervised methods. By combining different output embeddings, the authors further enhance the classification accuracy, demonstrating the complementary nature of these embeddings. The paper also introduces a novel weakly-supervised Word2Vec variant that improves accuracy when combined with other output embeddings. Overall, the SJE framework significantly improves state-of-the-art results on datasets such as Animals with Attributes and Caltech-UCSD Birds, highlighting the potential of unsupervised and weakly-supervised methods in fine-grained image classification.