MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

2024-04-19 | Zhuofan Zong*, Bingqi Ma*, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li*, and Yu Liu*
MoVA: Adapting Mixture of Vision Experts to Multimodal Context This paper proposes MoVA, a powerful multimodal large language model (MLLM) that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The key idea is to leverage the strengths of different vision encoders by dynamically selecting and combining the most relevant ones based on the input context. MoVA consists of a pre-trained large language model (LLM), a base vision encoder, and multiple task-specific vision experts. The LLM is used to select the most appropriate vision experts for the given task, while the MoV-Adapter module performs fine-grained expert fusion based on multimodal context. The coarse-grained stage involves context-aware expert routing, where the LLM selects the most relevant vision experts based on the input image and instruction. This is achieved through expert-routing low-rank adaptation (LoRA), which improves the efficiency and effectiveness of expert routing. The fine-grained stage involves the MoV-Adapter, which extracts and fuses task-specific knowledge from various experts using a mixture-of-expert (MoE) cross-attention mechanism. The dynamic gating network assigns soft weights to the extracted knowledge based on the input image and instruction. MoVA is evaluated on various benchmarks, including MLLM benchmarks, visual question answering (VQA), visual grounding, image segmentation, and biomedical understanding. The results show that MoVA achieves significant performance gains over current state-of-the-art methods in a wide range of challenging benchmarks. The key contributions of this paper include the analysis of the inherent bias of individual vision encoders, the proposal of MoVA with coarse-grained context-aware expert routing and fine-grained expert fusion, and the demonstration of the effectiveness of each component through ablation studies.MoVA: Adapting Mixture of Vision Experts to Multimodal Context This paper proposes MoVA, a powerful multimodal large language model (MLLM) that adaptively routes and fuses task-specific vision experts using a coarse-to-fine mechanism. The key idea is to leverage the strengths of different vision encoders by dynamically selecting and combining the most relevant ones based on the input context. MoVA consists of a pre-trained large language model (LLM), a base vision encoder, and multiple task-specific vision experts. The LLM is used to select the most appropriate vision experts for the given task, while the MoV-Adapter module performs fine-grained expert fusion based on multimodal context. The coarse-grained stage involves context-aware expert routing, where the LLM selects the most relevant vision experts based on the input image and instruction. This is achieved through expert-routing low-rank adaptation (LoRA), which improves the efficiency and effectiveness of expert routing. The fine-grained stage involves the MoV-Adapter, which extracts and fuses task-specific knowledge from various experts using a mixture-of-expert (MoE) cross-attention mechanism. The dynamic gating network assigns soft weights to the extracted knowledge based on the input image and instruction. MoVA is evaluated on various benchmarks, including MLLM benchmarks, visual question answering (VQA), visual grounding, image segmentation, and biomedical understanding. The results show that MoVA achieves significant performance gains over current state-of-the-art methods in a wide range of challenging benchmarks. The key contributions of this paper include the analysis of the inherent bias of individual vision encoders, the proposal of MoVA with coarse-grained context-aware expert routing and fine-grained expert fusion, and the demonstration of the effectiveness of each component through ablation studies.
Reach us at info@study.space