Understanding Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

The paper "Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions" addresses the limitations of existing vision-language models (VLMs) in fine-grained domains, particularly in zero-shot classification tasks. The authors leverage two complementary sources of information: descriptions of categories generated by large language models (LLMs) and fine-grained image classification datasets. They develop methods to train VLMs using "bag-level" image-text supervision, which involves pairing images within a category with descriptions generated by LLMs. This approach significantly improves zero-shot classification accuracy for novel categories of birds and flowers, achieving an average improvement of 4-5% across 12 datasets. The method also outperforms prior work on prompt-based tuning of VLMs. The authors systematically evaluate the effectiveness of their method by assessing zero-shot classification performance on novel classes, demonstrating that geographic priors are equally effective and complementary to visual appearance cues. The paper includes a detailed experimental setup, baseline comparisons, and a discussion of the limitations and future directions.The paper "Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions" addresses the limitations of existing vision-language models (VLMs) in fine-grained domains, particularly in zero-shot classification tasks. The authors leverage two complementary sources of information: descriptions of categories generated by large language models (LLMs) and fine-grained image classification datasets. They develop methods to train VLMs using "bag-level" image-text supervision, which involves pairing images within a category with descriptions generated by LLMs. This approach significantly improves zero-shot classification accuracy for novel categories of birds and flowers, achieving an average improvement of 4-5% across 12 datasets. The method also outperforms prior work on prompt-based tuning of VLMs. The authors systematically evaluate the effectiveness of their method by assessing zero-shot classification performance on novel classes, demonstrating that geographic priors are equally effective and complementary to visual appearance cues. The paper includes a detailed experimental setup, baseline comparisons, and a discussion of the limitations and future directions.

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

3 Apr 2024 | Oindrila Saha, Grant Van Horn, Subhransu Maji