Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

5 Mar 2024 | Gen Luo 1 2 Yiyi Zhou 1 3 Yuxin Zhang 1 3 Xiaowu Zheng 1 3 Xiaoshuai Sun 1 3 Rongrong Ji 1 3
The paper addresses the visual recognition shortcomings of multimodal large language models (MLLMs) by focusing on image resolution. It proposes a novel method called *Mixture-of-Resolution Adaptation* (MRA), which combines low- and high-resolution visual features to enhance performance. MRA uses two visual pathways for images of different resolutions, with high-resolution information embedded into the low-resolution pathway via *mixture-of-resolution adapters* (MR-Adapters). This approach reduces the input sequence length of MLLMs while improving performance. The authors apply MRA to the LLaVA model, creating LLaVA-HR, which outperforms existing MLLMs on 8 out of 11 vision-language tasks, such as TextVQA, with a significant improvement of +9.4%. Additionally, LLaVA-HR maintains efficiency in training and inference, requiring only 20 training hours and achieving 3 times faster inference speed compared to LLaVA-1.5. The paper also includes extensive experiments and ablation studies to validate the effectiveness and efficiency of MRA.The paper addresses the visual recognition shortcomings of multimodal large language models (MLLMs) by focusing on image resolution. It proposes a novel method called *Mixture-of-Resolution Adaptation* (MRA), which combines low- and high-resolution visual features to enhance performance. MRA uses two visual pathways for images of different resolutions, with high-resolution information embedded into the low-resolution pathway via *mixture-of-resolution adapters* (MR-Adapters). This approach reduces the input sequence length of MLLMs while improving performance. The authors apply MRA to the LLaVA model, creating LLaVA-HR, which outperforms existing MLLMs on 8 out of 11 vision-language tasks, such as TextVQA, with a significant improvement of +9.4%. Additionally, LLaVA-HR maintains efficiency in training and inference, requiring only 20 training hours and achieving 3 times faster inference speed compared to LLaVA-1.5. The paper also includes extensive experiments and ablation studies to validate the effectiveness and efficiency of MRA.
Reach us at info@study.space
[slides] Feast Your Eyes%3A Mixture-of-Resolution Adaptation for Multimodal Large Language Models | StudySpace