Understanding Democratizing Fine-grained Visual Recognition with Large Language Models

The paper "Democratizing Fine-grained Visual Recognition with Large Language Models" addresses the challenge of fine-grained visual recognition (FGVR) by leveraging large language models (LLMs) to reason about subordinate-level category names without requiring expert knowledge. The authors propose Fine-grained Semantic Category Reasoning (FineR), a system that uses visual question answering (VQA) models to extract visual attributes from images and feed them into LLMs to reason about candidate class names. The LLMs, which contain extensive world knowledge, are used to identify and refine these candidate names, which are then used to construct a multi-modal classifier for zero-shot semantic classification. The proposed method is evaluated on multiple FGVR benchmarks and shows superior performance compared to state-of-the-art methods, demonstrating its effectiveness in both challenging and novel domains. The paper also includes a human study and an ablation study to validate the contributions of each component of the FineR system. Overall, FineR is a training-free approach that can democratize FGVR by making it accessible to non-experts.The paper "Democratizing Fine-grained Visual Recognition with Large Language Models" addresses the challenge of fine-grained visual recognition (FGVR) by leveraging large language models (LLMs) to reason about subordinate-level category names without requiring expert knowledge. The authors propose Fine-grained Semantic Category Reasoning (FineR), a system that uses visual question answering (VQA) models to extract visual attributes from images and feed them into LLMs to reason about candidate class names. The LLMs, which contain extensive world knowledge, are used to identify and refine these candidate names, which are then used to construct a multi-modal classifier for zero-shot semantic classification. The proposed method is evaluated on multiple FGVR benchmarks and shows superior performance compared to state-of-the-art methods, demonstrating its effectiveness in both challenging and novel domains. The paper also includes a human study and an ablation study to validate the contributions of each component of the FineR system. Overall, FineR is a training-free approach that can democratize FGVR by making it accessible to non-experts.

DEMOCRATIZING FINE-GRAINED VISUAL RECOGNITION WITH LARGE LANGUAGE MODELS

10 Mar 2024 | Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci