5 Mar 2024 | Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiaowu Zheng, Xiaoshuai Sun, Rongrong Ji
This paper introduces Mixture-of-Resolution Adaptation (MRA), a novel and efficient method for improving the visual recognition capabilities of multimodal large language models (MLLMs). MRA addresses the issue of granular visual recognition by combining low- and high-resolution visual features. The method employs a dual visual pathway design, where high-resolution information is embedded into the low-resolution pathway via novel mixture-of-resolution adapters (MR-Adapters). This design reduces the input sequence length of MLLMs and enhances their efficiency.
The proposed method is applied to the LLaVA model, resulting in the LLaVA-HR model. Extensive experiments on 11 vision-language tasks show that LLaVA-HR outperforms existing MLLMs on 8 tasks, with a +9.4% improvement on TextVQA. LLaVA-HR also maintains high efficiency, with training costs reduced to 20 training hours and inference speed three times faster than LLaVA-1.5. The model's performance is further validated on various benchmarks, including MME, POPE, and MM-VET, where it achieves significant improvements.
The MRA method is effective in addressing the visual shortcomings of MLLMs by leveraging high-resolution images without significantly increasing computational costs. The dual visual pathways and MR-Adapters enable the model to process both high- and low-resolution images efficiently, capturing fine-grained visual information while maintaining performance and efficiency. The results demonstrate that MRA is a promising approach for enhancing the visual recognition capabilities of MLLMs.This paper introduces Mixture-of-Resolution Adaptation (MRA), a novel and efficient method for improving the visual recognition capabilities of multimodal large language models (MLLMs). MRA addresses the issue of granular visual recognition by combining low- and high-resolution visual features. The method employs a dual visual pathway design, where high-resolution information is embedded into the low-resolution pathway via novel mixture-of-resolution adapters (MR-Adapters). This design reduces the input sequence length of MLLMs and enhances their efficiency.
The proposed method is applied to the LLaVA model, resulting in the LLaVA-HR model. Extensive experiments on 11 vision-language tasks show that LLaVA-HR outperforms existing MLLMs on 8 tasks, with a +9.4% improvement on TextVQA. LLaVA-HR also maintains high efficiency, with training costs reduced to 20 training hours and inference speed three times faster than LLaVA-1.5. The model's performance is further validated on various benchmarks, including MME, POPE, and MM-VET, where it achieves significant improvements.
The MRA method is effective in addressing the visual shortcomings of MLLMs by leveraging high-resolution images without significantly increasing computational costs. The dual visual pathways and MR-Adapters enable the model to process both high- and low-resolution images efficiently, capturing fine-grained visual information while maintaining performance and efficiency. The results demonstrate that MRA is a promising approach for enhancing the visual recognition capabilities of MLLMs.