9 May 2024 | Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**
**Authors:** Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
**Institution:** SHI Labs (Georgia Tech & UIUC), ByteDance Inc.
**GitHub:** https://github.com/SHI-Labs/CuMo
Recent advancements in multimodal large language models (LLMs) have focused on scaling by increasing text-image pair data and enhancing LLMs for multimodal tasks. However, these approaches are computationally expensive and overlook the importance of efficiently improving model capabilities from the vision side. Inspired by the successful application of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to smaller models, the authors propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated MoE blocks into both the vision encoder and the MLP connector, enhancing multimodal LLMs with negligible additional activated parameters during inference.
**Key Contributions:**
- Introduces CuMo, which integrates co-upcycled sparsely-gated MoE layers into the MLP connector and vision encoder.
- Outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within the same model size group.
- Trains exclusively on open-sourced datasets and pre-trained models.
**Methodology:**
- **Sparse MoE Structure:** Replaces dense MLP blocks with sparse MoE blocks, including a router network to select Top-K experts.
- **Co-Upcycling:** Initializes each MoE block from corresponding pre-trained MLP blocks.
- **Training Recipe:** A three-stage training process with auxiliary losses to stabilize training and ensure balanced loading of experts.
**Experiments:**
- Trains CuMo models on open-sourced datasets converted into visual instruction tuning formats.
- Conducts comprehensive evaluations on various VQA and instruction-following benchmarks.
- Perform ablation studies on each module with upcycled MoE blocks.
**Results:**
- CuMo outperforms other state-of-the-art multimodal LLMs across multiple benchmarks.
- Achieves comparable performance to 13B-based models with limited training data.
**Conclusion:**
CuMo effectively scales multimodal LLMs by integrating co-upcycled MoE blocks, enhancing model capabilities with minimal additional parameters.**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**
**Authors:** Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
**Institution:** SHI Labs (Georgia Tech & UIUC), ByteDance Inc.
**GitHub:** https://github.com/SHI-Labs/CuMo
Recent advancements in multimodal large language models (LLMs) have focused on scaling by increasing text-image pair data and enhancing LLMs for multimodal tasks. However, these approaches are computationally expensive and overlook the importance of efficiently improving model capabilities from the vision side. Inspired by the successful application of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to smaller models, the authors propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated MoE blocks into both the vision encoder and the MLP connector, enhancing multimodal LLMs with negligible additional activated parameters during inference.
**Key Contributions:**
- Introduces CuMo, which integrates co-upcycled sparsely-gated MoE layers into the MLP connector and vision encoder.
- Outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within the same model size group.
- Trains exclusively on open-sourced datasets and pre-trained models.
**Methodology:**
- **Sparse MoE Structure:** Replaces dense MLP blocks with sparse MoE blocks, including a router network to select Top-K experts.
- **Co-Upcycling:** Initializes each MoE block from corresponding pre-trained MLP blocks.
- **Training Recipe:** A three-stage training process with auxiliary losses to stabilize training and ensure balanced loading of experts.
**Experiments:**
- Trains CuMo models on open-sourced datasets converted into visual instruction tuning formats.
- Conducts comprehensive evaluations on various VQA and instruction-following benchmarks.
- Perform ablation studies on each module with upcycled MoE blocks.
**Results:**
- CuMo outperforms other state-of-the-art multimodal LLMs across multiple benchmarks.
- Achieves comparable performance to 13B-based models with limited training data.
**Conclusion:**
CuMo effectively scales multimodal LLMs by integrating co-upcycled MoE blocks, enhancing model capabilities with minimal additional parameters.