Democratizing Fine-grained Visual Recognition with Large Language Models

Democratizing Fine-grained Visual Recognition with Large Language Models

2024 | Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci
This paper introduces a novel approach called Fine-grained Semantic Category Reasoning (FineR) for fine-grained visual recognition (FGVR). FGVR involves identifying subordinate-level categories in images, such as species of birds or mushrooms, which is challenging due to subtle differences between similar objects. Traditional FGVR systems require expert annotations, which are costly and time-consuming to obtain. FineR addresses this challenge by leveraging large language models (LLMs) to reason about category names without requiring expert knowledge. FineR works by first extracting visual attributes from images using a visual question answering (VQA) model, then feeding these attributes to an LLM to reason about possible category names. The LLM uses its internal world knowledge to generate candidate class names, which are then used to construct a semantic classifier. This classifier is trained using a vision-language model (VLM) to classify test images. The system is trained-free and can work in new domains where expert annotations are difficult to obtain. The proposed method is evaluated on several fine-grained datasets, including Caltech-UCSD Bird-200, Stanford Car-196, Stanford Dog-120, Flower-102, and Oxford-IIIT Pet-37. Results show that FineR outperforms existing methods in terms of clustering accuracy (cACC) and semantic accuracy (sACC). Additionally, FineR is tested on a new Pokemon dataset, where it successfully identifies 7 out of 10 ground-truth categories, outperforming other methods. The system is also evaluated in a human study, where it outperforms machine-based methods in identifying specific car models and pet breeds. The results demonstrate that FineR can effectively capture fine-grained visual details and leverage them for reasoning, making it a promising approach for democratizing FGVR systems. The method is training-free, modular, and interpretable, making it suitable for real-world applications where expert annotations are not available.This paper introduces a novel approach called Fine-grained Semantic Category Reasoning (FineR) for fine-grained visual recognition (FGVR). FGVR involves identifying subordinate-level categories in images, such as species of birds or mushrooms, which is challenging due to subtle differences between similar objects. Traditional FGVR systems require expert annotations, which are costly and time-consuming to obtain. FineR addresses this challenge by leveraging large language models (LLMs) to reason about category names without requiring expert knowledge. FineR works by first extracting visual attributes from images using a visual question answering (VQA) model, then feeding these attributes to an LLM to reason about possible category names. The LLM uses its internal world knowledge to generate candidate class names, which are then used to construct a semantic classifier. This classifier is trained using a vision-language model (VLM) to classify test images. The system is trained-free and can work in new domains where expert annotations are difficult to obtain. The proposed method is evaluated on several fine-grained datasets, including Caltech-UCSD Bird-200, Stanford Car-196, Stanford Dog-120, Flower-102, and Oxford-IIIT Pet-37. Results show that FineR outperforms existing methods in terms of clustering accuracy (cACC) and semantic accuracy (sACC). Additionally, FineR is tested on a new Pokemon dataset, where it successfully identifies 7 out of 10 ground-truth categories, outperforming other methods. The system is also evaluated in a human study, where it outperforms machine-based methods in identifying specific car models and pet breeds. The results demonstrate that FineR can effectively capture fine-grained visual details and leverage them for reasoning, making it a promising approach for democratizing FGVR systems. The method is training-free, modular, and interpretable, making it suitable for real-world applications where expert annotations are not available.
Reach us at info@study.space