24 Jun 2024 | Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, Ness B. Shroff
This paper presents the first theoretical analysis of the Mixture-of-Experts (MoE) model in Continual Learning (CL), focusing on overparameterized linear regression tasks. The study shows that MoE can diversify experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the load across all experts. The research also demonstrates that terminating the update of the gating network after sufficient training rounds is necessary for system convergence in CL, which is not required in existing MoE studies that do not consider the continual task arrival. The paper provides explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in CL. It also shows that adding more experts may not enhance learning performance and may delay convergence. The study further extends these insights from linear models to deep neural networks (DNNs) through experiments on synthetic and real datasets, showing that MoE can improve the learning performance in CL. The theoretical results are supported by experiments that validate the effectiveness of MoE in CL. The paper concludes that MoE can significantly reduce catastrophic forgetting and improve the generalization performance in CL.This paper presents the first theoretical analysis of the Mixture-of-Experts (MoE) model in Continual Learning (CL), focusing on overparameterized linear regression tasks. The study shows that MoE can diversify experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the load across all experts. The research also demonstrates that terminating the update of the gating network after sufficient training rounds is necessary for system convergence in CL, which is not required in existing MoE studies that do not consider the continual task arrival. The paper provides explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in CL. It also shows that adding more experts may not enhance learning performance and may delay convergence. The study further extends these insights from linear models to deep neural networks (DNNs) through experiments on synthetic and real datasets, showing that MoE can improve the learning performance in CL. The theoretical results are supported by experiments that validate the effectiveness of MoE in CL. The paper concludes that MoE can significantly reduce catastrophic forgetting and improve the generalization performance in CL.