13 Feb 2024 | Chongyang Gao, Kezhen Chen, Jimmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian
This paper introduces MoLA, a novel parameter-efficient tuning method that integrates LoRA with MoE and employs layer-wise expert allocation. The method allows each Transformer layer to have a different number of LoRA experts, optimizing performance by allocating more experts to higher layers. Experiments on six NLP and commonsense QA benchmarks show that MoLA achieves equal or superior performance compared to baselines. Allocating more experts to higher layers enhances model effectiveness, especially with fewer parameters. MoLA is a plug-and-play approach suitable for various applications. The code is available at https://github.com/GCYZSL/MoLA. The paper also explores the redundancy of layer-wise experts, finding that lower layers have more similar experts and thus higher redundancy. MoLA's configuration with more experts in higher layers outperforms others, demonstrating improved scalability and performance. The study highlights the importance of layer-wise expert allocation in enhancing model effectiveness and efficiency.This paper introduces MoLA, a novel parameter-efficient tuning method that integrates LoRA with MoE and employs layer-wise expert allocation. The method allows each Transformer layer to have a different number of LoRA experts, optimizing performance by allocating more experts to higher layers. Experiments on six NLP and commonsense QA benchmarks show that MoLA achieves equal or superior performance compared to baselines. Allocating more experts to higher layers enhances model effectiveness, especially with fewer parameters. MoLA is a plug-and-play approach suitable for various applications. The code is available at https://github.com/GCYZSL/MoLA. The paper also explores the redundancy of layer-wise experts, finding that lower layers have more similar experts and thus higher redundancy. MoLA's configuration with more experts in higher layers outperforms others, demonstrating improved scalability and performance. The study highlights the importance of layer-wise expert allocation in enhancing model effectiveness and efficiency.