[slides and audio] MoE-LLaVA%3A Mixture of Experts for Large Vision-Language Models

The paper introduces MoE-LLaVA, a sparse Large Vision-Language Model (LVLM) architecture that leverages Mixtures of Experts (MoE) to reduce computational costs while maintaining or improving performance. The authors propose MoE-Tuning, a three-stage training strategy to adapt MoE to LVLMs, preventing performance degradation due to sparsity. MoE-LLaVA uses learnable routers to activate only the top-k experts for each token, keeping the remaining experts inactive. Extensive experiments show that MoE-LLaVA achieves comparable performance to dense models with significantly fewer parameters, outperforming state-of-the-art models on various visual understanding and object hallucination benchmarks. The method is designed to provide a baseline for sparse LVLMs and inspire future research in more efficient and effective multi-modal learning systems.The paper introduces MoE-LLaVA, a sparse Large Vision-Language Model (LVLM) architecture that leverages Mixtures of Experts (MoE) to reduce computational costs while maintaining or improving performance. The authors propose MoE-Tuning, a three-stage training strategy to adapt MoE to LVLMs, preventing performance degradation due to sparsity. MoE-LLaVA uses learnable routers to activate only the top-k experts for each token, keeping the remaining experts inactive. Extensive experiments show that MoE-LLaVA achieves comparable performance to dense models with significantly fewer parameters, outperforming state-of-the-art models on various visual understanding and object hallucination benchmarks. The method is designed to provide a baseline for sparse LVLMs and inspire future research in more efficient and effective multi-modal learning systems.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

6 Jul 2024 | Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, Li Yuan