2023 | Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, James T. Kwok
This paper proposes a novel pre-training paradigm called Mixture of Cluster-conditional Experts (MoCE) to address the issue of negative transfer in Masked Autoencoder (MAE) for self-supervised learning. MAE, a popular self-supervised learning method, can suffer from negative transfer when downstream tasks have different data distributions from the pre-training data. MoCE is designed to provide customized pre-training models for diverse downstream tasks by clustering data and training each expert with semantically relevant images. Unlike the Mixture of Experts (MoE), MoCE uses cluster-conditional gates to train each expert with semantically relevant images, allowing each downstream task to be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45% on average and achieves state-of-the-art results on detection and segmentation tasks. MoCE also demonstrates superior performance in terms of training and testing efficiency, with a significant reduction in parameters and computational cost. The proposed MoCE is the first work that achieves state-of-the-art transfer performance by training vision MoE models with ImageNet under the SSL setting.This paper proposes a novel pre-training paradigm called Mixture of Cluster-conditional Experts (MoCE) to address the issue of negative transfer in Masked Autoencoder (MAE) for self-supervised learning. MAE, a popular self-supervised learning method, can suffer from negative transfer when downstream tasks have different data distributions from the pre-training data. MoCE is designed to provide customized pre-training models for diverse downstream tasks by clustering data and training each expert with semantically relevant images. Unlike the Mixture of Experts (MoE), MoCE uses cluster-conditional gates to train each expert with semantically relevant images, allowing each downstream task to be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45% on average and achieves state-of-the-art results on detection and segmentation tasks. MoCE also demonstrates superior performance in terms of training and testing efficiency, with a significant reduction in parameters and computational cost. The proposed MoCE is the first work that achieves state-of-the-art transfer performance by training vision MoE models with ImageNet under the SSL setting.