7 Aug 2024 | M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger
This paper introduces Meta-Prompting for Visual Recognition (MPVR), a framework that automates the generation of category-specific prompts for zero-shot visual recognition using Large Language Models (LLMs). MPVR minimizes human involvement by using minimal information about the target task, such as a short natural language description and a list of class labels, to generate diverse, task-specific prompts. These prompts are then used to create a zero-shot classifier. MPVR generalizes well across various zero-shot image recognition benchmarks and outperforms existing methods, achieving up to 19.8% and 18.2% improvements over CLIP using GPT and Mixtral LLMs, respectively. The framework is designed to automatically generate task-specific LLM query templates, which are then populated with class names to produce category-specific VLM prompts. These prompts are ensembled to form a robust zero-shot classifier. MPVR also opens-sources a large dataset of 2.5 million unique class descriptions generated using its framework. The method is evaluated on 20 diverse datasets and shows significant performance gains across most of them. The results demonstrate that MPVR effectively leverages the visual knowledge of LLMs to enhance zero-shot recognition performance without requiring manual intervention. The framework is shown to be effective with both closed and open-source LLMs and is capable of scaling to various visual domains where visual data may not be available. The paper also discusses related work, including other approaches to zero-shot classification and prompt engineering, and highlights the advantages of MPVR in terms of automation, generalization, and performance.This paper introduces Meta-Prompting for Visual Recognition (MPVR), a framework that automates the generation of category-specific prompts for zero-shot visual recognition using Large Language Models (LLMs). MPVR minimizes human involvement by using minimal information about the target task, such as a short natural language description and a list of class labels, to generate diverse, task-specific prompts. These prompts are then used to create a zero-shot classifier. MPVR generalizes well across various zero-shot image recognition benchmarks and outperforms existing methods, achieving up to 19.8% and 18.2% improvements over CLIP using GPT and Mixtral LLMs, respectively. The framework is designed to automatically generate task-specific LLM query templates, which are then populated with class names to produce category-specific VLM prompts. These prompts are ensembled to form a robust zero-shot classifier. MPVR also opens-sources a large dataset of 2.5 million unique class descriptions generated using its framework. The method is evaluated on 20 diverse datasets and shows significant performance gains across most of them. The results demonstrate that MPVR effectively leverages the visual knowledge of LLMs to enhance zero-shot recognition performance without requiring manual intervention. The framework is shown to be effective with both closed and open-source LLMs and is capable of scaling to various visual domains where visual data may not be available. The paper also discusses related work, including other approaches to zero-shot classification and prompt engineering, and highlights the advantages of MPVR in terms of automation, generalization, and performance.