[slides and audio] MoVA%3A Adapting Mixture of Vision Experts to Multimodal Context

MoVA: Adapting Mixture of Vision Experts to Multimodal Context **Abstract:** This paper addresses the challenge of adapting vision encoders to diverse image content in multimodal large language models (MLLMs). While large-scale pre-trained vision encoders like CLIP and DINoV2 have shown promising performance, they often lack the ability to dominate various image content understanding tasks. To mitigate this bias, MoVA (Mixture of Vision Experts) is proposed, a novel MLLM that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The coarse-grained stage employs a context-aware expert routing strategy to dynamically select the most suitable vision experts based on user instructions, input images, and expert expertise. This is achieved through the integration of expert-routing low-rank adaptation (LoRA) into the large language model (LLM). In the fine-grained stage, the mixture-of-vision-expert adapter (MoV-Adapter) extracts and fuses task-specific knowledge from various experts, enhancing the model's generalization ability. Extensive experiments demonstrate that MoVA significantly outperforms state-of-the-art methods in a wide range of challenging multimodal benchmarks, achieving significant performance gains without additional modifications. **Keywords:** Multimodal large language model · Vision encoder · Mixture-of-expertMoVA: Adapting Mixture of Vision Experts to Multimodal Context **Abstract:** This paper addresses the challenge of adapting vision encoders to diverse image content in multimodal large language models (MLLMs). While large-scale pre-trained vision encoders like CLIP and DINoV2 have shown promising performance, they often lack the ability to dominate various image content understanding tasks. To mitigate this bias, MoVA (Mixture of Vision Experts) is proposed, a novel MLLM that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The coarse-grained stage employs a context-aware expert routing strategy to dynamically select the most suitable vision experts based on user instructions, input images, and expert expertise. This is achieved through the integration of expert-routing low-rank adaptation (LoRA) into the large language model (LLM). In the fine-grained stage, the mixture-of-vision-expert adapter (MoV-Adapter) extracts and fuses task-specific knowledge from various experts, enhancing the model's generalization ability. Extensive experiments demonstrate that MoVA significantly outperforms state-of-the-art methods in a wide range of challenging multimodal benchmarks, achieving significant performance gains without additional modifications. **Keywords:** Multimodal large language model · Vision encoder · Mixture-of-expert

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

19 Apr 2024 | Zhuofan Zong1*, Bingqi Ma1*, Dazhong Shen2, Guanglu Song1, Hao Shao3, Dongzhi Jiang3, Hongsheng Li2,3✉, and Yu Liu1,2✉

19 Apr 2024 | Zhuofan Zong1, Bingqi Ma1, Dazhong Shen2, Guanglu Song1, Hao Shao3, Dongzhi Jiang3, Hongsheng Li2,3✉, and Yu Liu1,2✉