Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

2 Apr 2024 | Keon-Hee Park¹, Kyungwoo Song²†, Gyeong-Moon Park¹†
This paper presents a novel framework called PriViLege for Few-Shot Class Incremental Learning (FSCIL), which leverages pre-trained vision and language transformers. The framework addresses the challenges of catastrophic forgetting and overfitting in large models through a combination of pre-trained knowledge tuning (PKT), an entropy-based divergence loss, and a semantic knowledge distillation loss. PKT selectively trains specific layers of the pre-trained model to preserve domain knowledge while learning new classes. The entropy-based divergence loss enhances the discriminative power of the model during the base session, while the semantic knowledge distillation loss transfers semantic knowledge from the language space to the visual space. Experimental results show that PriViLege significantly outperforms existing state-of-the-art methods on benchmark datasets such as CUB200, CIFAR-100, and miniImageNet, achieving performance improvements of up to +9.38% on CUB200, +20.58% on CIFAR-100, and +13.36% on miniImageNet. The framework is applicable to various pre-trained models, including Vision Transformers (ViT) and CLIP. The study highlights the potential of large pre-trained models in FSCIL and provides a new direction for research in this area.This paper presents a novel framework called PriViLege for Few-Shot Class Incremental Learning (FSCIL), which leverages pre-trained vision and language transformers. The framework addresses the challenges of catastrophic forgetting and overfitting in large models through a combination of pre-trained knowledge tuning (PKT), an entropy-based divergence loss, and a semantic knowledge distillation loss. PKT selectively trains specific layers of the pre-trained model to preserve domain knowledge while learning new classes. The entropy-based divergence loss enhances the discriminative power of the model during the base session, while the semantic knowledge distillation loss transfers semantic knowledge from the language space to the visual space. Experimental results show that PriViLege significantly outperforms existing state-of-the-art methods on benchmark datasets such as CUB200, CIFAR-100, and miniImageNet, achieving performance improvements of up to +9.38% on CUB200, +20.58% on CIFAR-100, and +13.36% on miniImageNet. The framework is applicable to various pre-trained models, including Vision Transformers (ViT) and CLIP. The study highlights the potential of large pre-trained models in FSCIL and provides a new direction for research in this area.
Reach us at info@study.space
Understanding Pre-trained Vision and Language Transformers are Few-Shot Incremental Learners