CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

9 May 2024 | Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
CuMo is a method for scaling multimodal large language models (LLMs) by integrating co-upcycled sparsely-gated mixture-of-experts (MoE) blocks into both the vision encoder and the MLP connector. This approach enhances the model's capabilities with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage, with auxiliary losses to ensure balanced expert loading. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo. The paper introduces CuMo, which integrates co-upcycled sparsely-gated MoE layers into both the MLP connector and the vision encoder, enhancing the multimodal LLM with only slightly additional activated parameters. It outlines the training methodology for CuMo, including a three-stage training process with auxiliary losses to stabilize training and ensure balanced expert loading. CuMo is trained exclusively on open-sourced datasets and pre-trained models, outperforming state-of-the-art open-sourced and private multimodal LLMs across multiple competitive benchmarks within each model size group. CuMo's architecture incorporates sparse MoE blocks into the CLIP vision encoder and vision-language MLP connector, improving the multimodal LLM capabilities from the vision side. The three-stage training process includes pre-training the MLP connector, pre-finetuning to warm up the model, and visual instruction fine-tuning with up-cycled MoE blocks. The model uses auxiliary losses to maintain balanced expert loading and achieves superior performance on various benchmarks. CuMo outperforms other state-of-the-art multimodal LLMs on multiple benchmarks, demonstrating the effectiveness of the co-upcycled MoE blocks in enhancing model performance. The results show that CuMo significantly outperforms other models, particularly in the same model size group. The paper also includes ablation studies and qualitative analysis, highlighting the effectiveness of CuMo in handling complex scenes and providing accurate answers while addressing challenges such as hallucinations. The conclusion emphasizes the introduction of the sparse MoE design into multimodal LLMs, demonstrating the effectiveness of the approach in enhancing model performance and stability.CuMo is a method for scaling multimodal large language models (LLMs) by integrating co-upcycled sparsely-gated mixture-of-experts (MoE) blocks into both the vision encoder and the MLP connector. This approach enhances the model's capabilities with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage, with auxiliary losses to ensure balanced expert loading. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo. The paper introduces CuMo, which integrates co-upcycled sparsely-gated MoE layers into both the MLP connector and the vision encoder, enhancing the multimodal LLM with only slightly additional activated parameters. It outlines the training methodology for CuMo, including a three-stage training process with auxiliary losses to stabilize training and ensure balanced expert loading. CuMo is trained exclusively on open-sourced datasets and pre-trained models, outperforming state-of-the-art open-sourced and private multimodal LLMs across multiple competitive benchmarks within each model size group. CuMo's architecture incorporates sparse MoE blocks into the CLIP vision encoder and vision-language MLP connector, improving the multimodal LLM capabilities from the vision side. The three-stage training process includes pre-training the MLP connector, pre-finetuning to warm up the model, and visual instruction fine-tuning with up-cycled MoE blocks. The model uses auxiliary losses to maintain balanced expert loading and achieves superior performance on various benchmarks. CuMo outperforms other state-of-the-art multimodal LLMs on multiple benchmarks, demonstrating the effectiveness of the co-upcycled MoE blocks in enhancing model performance. The results show that CuMo significantly outperforms other models, particularly in the same model size group. The paper also includes ablation studies and qualitative analysis, highlighting the effectiveness of CuMo in handling complex scenes and providing accurate answers while addressing challenges such as hallucinations. The conclusion emphasizes the introduction of the sparse MoE design into multimodal LLMs, demonstrating the effectiveness of the approach in enhancing model performance and stability.
Reach us at info@study.space