20 Jul 2024 | Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang
MixLoRA is a parameter-efficient mixture-of-experts (MoE) method that enhances large language model (LLM) fine-tuning by integrating LoRA-based experts. It constructs a sparse MoE model using LoRA, which allows for efficient training and inference. MixLoRA introduces multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a top-k router to assign tokens to different experts. Unlike other LoRA-based MoE methods, MixLoRA improves performance by using independent attention-layer LoRA adapters and an auxiliary load balance loss to address router imbalance. The proposed method achieves a 9% accuracy improvement over state-of-the-art PEFT methods in multi-task learning scenarios. Additionally, a high-throughput framework is introduced to reduce GPU memory consumption by 40% and token computation latency by 30% during training and inference. MixLoRA outperforms LoRA and DoRA in both single-task and multi-task learning scenarios, achieving an average accuracy improvement of 5.8% on LLaMA-2 7B for single-task learning and 9.8% for multi-task learning. The method is evaluated on various benchmarks, demonstrating superior performance in handling downstream tasks compared to existing fine-tuning methods. MixLoRA's architecture combines LoRA's resource-saving features with MoE's versatility, enabling efficient and effective fine-tuning of LLMs.MixLoRA is a parameter-efficient mixture-of-experts (MoE) method that enhances large language model (LLM) fine-tuning by integrating LoRA-based experts. It constructs a sparse MoE model using LoRA, which allows for efficient training and inference. MixLoRA introduces multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a top-k router to assign tokens to different experts. Unlike other LoRA-based MoE methods, MixLoRA improves performance by using independent attention-layer LoRA adapters and an auxiliary load balance loss to address router imbalance. The proposed method achieves a 9% accuracy improvement over state-of-the-art PEFT methods in multi-task learning scenarios. Additionally, a high-throughput framework is introduced to reduce GPU memory consumption by 40% and token computation latency by 30% during training and inference. MixLoRA outperforms LoRA and DoRA in both single-task and multi-task learning scenarios, achieving an average accuracy improvement of 5.8% on LLaMA-2 7B for single-task learning and 9.8% for multi-task learning. The method is evaluated on various benchmarks, demonstrating superior performance in handling downstream tasks compared to existing fine-tuning methods. MixLoRA's architecture combines LoRA's resource-saving features with MoE's versatility, enabling efficient and effective fine-tuning of LLMs.