Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

3 Apr 2024 | Oindrila Saha, Grant Van Horn, Subhransu Maji
This paper presents a method to improve the zero-shot classification performance of Vision-Language Models (VLMs) by adapting them with text descriptions generated by Large Language Models (LLMs) and fine-grained image classification datasets. The approach leverages the ability of LLMs to generate structured and accurate descriptions of categories, which are then paired with existing image datasets to create coarsely-aligned image-text pairs for training. This method is tested on various fine-grained domains, including birds, flowers, and aircraft, and shows significant improvements in zero-shot classification accuracy. The paper also demonstrates that geographic priors can be as effective as visual appearance in zero-shot classification. The method outperforms prior work on prompt-based tuning of VLMs and is evaluated on a benchmark consisting of 14 datasets. The results show that the method improves performance across multiple domains and is robust to different training strategies. The paper also discusses the limitations of the method, including the need for verifying the correctness of the generated text descriptions. Overall, the study shows that adapting VLMs with LLM-generated text descriptions can significantly improve their zero-shot classification performance in fine-grained domains.This paper presents a method to improve the zero-shot classification performance of Vision-Language Models (VLMs) by adapting them with text descriptions generated by Large Language Models (LLMs) and fine-grained image classification datasets. The approach leverages the ability of LLMs to generate structured and accurate descriptions of categories, which are then paired with existing image datasets to create coarsely-aligned image-text pairs for training. This method is tested on various fine-grained domains, including birds, flowers, and aircraft, and shows significant improvements in zero-shot classification accuracy. The paper also demonstrates that geographic priors can be as effective as visual appearance in zero-shot classification. The method outperforms prior work on prompt-based tuning of VLMs and is evaluated on a benchmark consisting of 14 datasets. The results show that the method improves performance across multiple domains and is robust to different training strategies. The paper also discusses the limitations of the method, including the need for verifying the correctness of the generated text descriptions. Overall, the study shows that adapting VLMs with LLM-generated text descriptions can significantly improve their zero-shot classification performance in fine-grained domains.
Reach us at info@study.space