[slides] Higher Layers Need More LoRA Experts

The paper introduces a novel parameter-efficient tuning method called *MoE-LoRA with Layer-wise Expert Allocation (MoLA)* for Transformer-based models. MoLA combines Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to improve the performance of parameter-efficient fine-tuning (PEFT) methods. The key idea is to allocate a varying number of LoRA experts to each layer of the Transformer model, allowing for more efficient and effective integration of MoE. The authors investigate several architectures with different layer-wise expert configurations, including MoLA Triangle, MoLA Inverted-Triangle, MoLA Hourglass, and MoLA Rectangle. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to other baselines. Specifically, allocating more LoRA experts to higher layers further enhances the effectiveness of the models, even with a fixed total number of experts. The paper also provides a comprehensive analysis of layer-wise expert redundancy, showing that experts in lower layers are more similar and exhibit more redundancy. This analysis supports the intuition that higher layers should have more experts to handle fine-grained and task-specific patterns. Additionally, MoLA shows promising continuous learning capabilities, outperforming other methods in domain forgetting tasks. Overall, MoLA is a plug-and-play parameter-efficient tuning approach that can be widely used for various applications, offering both improved performance and reduced training costs.The paper introduces a novel parameter-efficient tuning method called *MoE-LoRA with Layer-wise Expert Allocation (MoLA)* for Transformer-based models. MoLA combines Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to improve the performance of parameter-efficient fine-tuning (PEFT) methods. The key idea is to allocate a varying number of LoRA experts to each layer of the Transformer model, allowing for more efficient and effective integration of MoE. The authors investigate several architectures with different layer-wise expert configurations, including MoLA Triangle, MoLA Inverted-Triangle, MoLA Hourglass, and MoLA Rectangle. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to other baselines. Specifically, allocating more LoRA experts to higher layers further enhances the effectiveness of the models, even with a fixed total number of experts. The paper also provides a comprehensive analysis of layer-wise expert redundancy, showing that experts in lower layers are more similar and exhibit more redundancy. This analysis supports the intuition that higher layers should have more experts to handle fine-grained and task-specific patterns. Additionally, MoLA shows promising continuous learning capabilities, outperforming other methods in domain forgetting tasks. Overall, MoLA is a plug-and-play parameter-efficient tuning approach that can be widely used for various applications, offering both improved performance and reduced training costs.

Higher Layers Need More LoRA Experts

13 Feb 2024 | Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian