MoVA: Adapting Mixture of Vision Experts to Multimodal Context
**Abstract:**
This paper addresses the challenge of adapting vision encoders to diverse image content in multimodal large language models (MLLMs). While large-scale pre-trained vision encoders like CLIP and DINoV2 have shown promising performance, they often lack the ability to dominate various image content understanding tasks. To mitigate this bias, MoVA (Mixture of Vision Experts) is proposed, a novel MLLM that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The coarse-grained stage employs a context-aware expert routing strategy to dynamically select the most suitable vision experts based on user instructions, input images, and expert expertise. This is achieved through the integration of expert-routing low-rank adaptation (LoRA) into the large language model (LLM). In the fine-grained stage, the mixture-of-vision-expert adapter (MoV-Adapter) extracts and fuses task-specific knowledge from various experts, enhancing the model's generalization ability. Extensive experiments demonstrate that MoVA significantly outperforms state-of-the-art methods in a wide range of challenging multimodal benchmarks, achieving significant performance gains without additional modifications.
**Keywords:**
Multimodal large language model · Vision encoder · Mixture-of-expertMoVA: Adapting Mixture of Vision Experts to Multimodal Context
**Abstract:**
This paper addresses the challenge of adapting vision encoders to diverse image content in multimodal large language models (MLLMs). While large-scale pre-trained vision encoders like CLIP and DINoV2 have shown promising performance, they often lack the ability to dominate various image content understanding tasks. To mitigate this bias, MoVA (Mixture of Vision Experts) is proposed, a novel MLLM that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The coarse-grained stage employs a context-aware expert routing strategy to dynamically select the most suitable vision experts based on user instructions, input images, and expert expertise. This is achieved through the integration of expert-routing low-rank adaptation (LoRA) into the large language model (LLM). In the fine-grained stage, the mixture-of-vision-expert adapter (MoV-Adapter) extracts and fuses task-specific knowledge from various experts, enhancing the model's generalization ability. Extensive experiments demonstrate that MoVA significantly outperforms state-of-the-art methods in a wide range of challenging multimodal benchmarks, achieving significant performance gains without additional modifications.
**Keywords:**
Multimodal large language model · Vision encoder · Mixture-of-expert