RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

20 Mar 2024 | Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition This paper introduces RAR, a Retrieving And Ranking augmented method for Multimodal Large Language Models (MLLMs), to enhance few-shot and zero-shot recognition abilities for datasets with extensive and fine-grained vocabularies. RAR combines the strengths of CLIP and MLLMs by using a multi-modal retriever to store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. The proposed approach addresses the inherent limitations in fine-grained recognition while preserving the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. RAR demonstrates significant improvements in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and 2 object detection datasets under the zero-shot recognition setting. The method is seamlessly integrated into various MLLMs and has been tested across 11 classification datasets and 2 object detection datasets, showing superior performance on a variety of visual recognition tasks. The paper also presents ablation studies on the impact of different hyperparameters and fine-tuning data sources, demonstrating the robustness and effectiveness of the RAR approach.RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition This paper introduces RAR, a Retrieving And Ranking augmented method for Multimodal Large Language Models (MLLMs), to enhance few-shot and zero-shot recognition abilities for datasets with extensive and fine-grained vocabularies. RAR combines the strengths of CLIP and MLLMs by using a multi-modal retriever to store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. The proposed approach addresses the inherent limitations in fine-grained recognition while preserving the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. RAR demonstrates significant improvements in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and 2 object detection datasets under the zero-shot recognition setting. The method is seamlessly integrated into various MLLMs and has been tested across 11 classification datasets and 2 object detection datasets, showing superior performance on a variety of visual recognition tasks. The paper also presents ablation studies on the impact of different hyperparameters and fine-tuning data sources, demonstrating the robustness and effectiveness of the RAR approach.
Reach us at info@study.space
[slides and audio] RAR%3A Retrieving And Ranking Augmented MLLMs for Visual Recognition