Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

3 Jun 2024 | Jiazu Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu and You He
This paper proposes a parameter-efficient continual learning framework for vision-language models (VLMs) to address the issue of long-term forgetting in incremental learning. The framework integrates Mixture-of-Experts (MoE) adapters to dynamically expand a pre-trained CLIP model in response to new tasks. To preserve the zero-shot recognition capability of VLMs, a Distribution Discriminative Auto-Selector (DDAS) is introduced, which automatically routes in-distribution and out-of-distribution inputs to the MoE adapters and the original CLIP, respectively. The MoE-Adapters are designed to enable efficient adaptation and inter-task collaboration, while the DDAS ensures effective predictions for seen data and zero-shot transfer for unseen data within a unified framework. The proposed method consistently outperforms previous state-of-the-art approaches while reducing parameter training burdens by 60%. Extensive experiments across various settings demonstrate the effectiveness of the method in addressing catastrophic forgetting, with the model showing exceptional resistance to forgetting and outperforming previous methods by 3.6%, 7.0%, and 4.2% in a 5-shot setting. The method is implemented on a frozen CLIP model, with MoE-Adapters built on the parallel encoders of CLIP. The framework is evaluated on tasks such as Multidomain Task Incremental Learning (MTIL) and Class Incremental Learning (CIL), with results showing superior performance in terms of classification accuracy and training efficiency. The method also demonstrates significant computational efficiency, with a reduction of approximately 60% in training parameters, 15% in GPU burdens, and 60% in iteration time compared to the state-of-the-art ZSCL method. The proposed framework is parameter-efficient, scalable, and effective in addressing the challenges of continual learning in vision-language models.This paper proposes a parameter-efficient continual learning framework for vision-language models (VLMs) to address the issue of long-term forgetting in incremental learning. The framework integrates Mixture-of-Experts (MoE) adapters to dynamically expand a pre-trained CLIP model in response to new tasks. To preserve the zero-shot recognition capability of VLMs, a Distribution Discriminative Auto-Selector (DDAS) is introduced, which automatically routes in-distribution and out-of-distribution inputs to the MoE adapters and the original CLIP, respectively. The MoE-Adapters are designed to enable efficient adaptation and inter-task collaboration, while the DDAS ensures effective predictions for seen data and zero-shot transfer for unseen data within a unified framework. The proposed method consistently outperforms previous state-of-the-art approaches while reducing parameter training burdens by 60%. Extensive experiments across various settings demonstrate the effectiveness of the method in addressing catastrophic forgetting, with the model showing exceptional resistance to forgetting and outperforming previous methods by 3.6%, 7.0%, and 4.2% in a 5-shot setting. The method is implemented on a frozen CLIP model, with MoE-Adapters built on the parallel encoders of CLIP. The framework is evaluated on tasks such as Multidomain Task Incremental Learning (MTIL) and Class Incremental Learning (CIL), with results showing superior performance in terms of classification accuracy and training efficiency. The method also demonstrates significant computational efficiency, with a reduction of approximately 60% in training parameters, 15% in GPU burdens, and 60% in iteration time compared to the state-of-the-art ZSCL method. The proposed framework is parameter-efficient, scalable, and effective in addressing the challenges of continual learning in vision-language models.
Reach us at info@study.space