RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

20 Mar 2024 | Ziyu Liu*, Zeyi Sun*, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
The paper introduces RAR (Retrieving And Ranking Augmented MLLMs), a method to enhance the few-shot and zero-shot recognition abilities of Multimodal Large Language Models (MLLMs) in visual recognition tasks. RAR combines a multi-modal retriever based on CLIP to create and store explicit memory for different categories, and an MLLM to rank retrieved results during inference. This approach addresses the limitations of CLIP in handling fine-grained categories and the constraints of MLLMs in managing large context windows. The method is evaluated on five fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and two object detection datasets with vast vocabularies. RAR demonstrates significant improvements in accuracy, achieving an average improvement of 6.2% over 11 image classification datasets under the 4-shot setting and a 6.4% improvement on the LVIS dataset. The paper also discusses the integration of RAR into various MLLMs and provides ablation studies to validate the effectiveness of different design choices.The paper introduces RAR (Retrieving And Ranking Augmented MLLMs), a method to enhance the few-shot and zero-shot recognition abilities of Multimodal Large Language Models (MLLMs) in visual recognition tasks. RAR combines a multi-modal retriever based on CLIP to create and store explicit memory for different categories, and an MLLM to rank retrieved results during inference. This approach addresses the limitations of CLIP in handling fine-grained categories and the constraints of MLLMs in managing large context windows. The method is evaluated on five fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and two object detection datasets with vast vocabularies. RAR demonstrates significant improvements in accuracy, achieving an average improvement of 6.2% over 11 image classification datasets under the 4-shot setting and a 6.4% improvement on the LVIS dataset. The paper also discusses the integration of RAR into various MLLMs and provides ablation studies to validate the effectiveness of different design choices.
Reach us at info@study.space