[slides and audio] Pre-trained Vision and Language Transformers are Few-Shot Incremental Learners

The paper "Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners" addresses the challenges of Few-Shot Class Incremental Learning (FSCIL), which involves learning new classes with limited data without forgetting previously learned knowledge. Traditional shallow models like ResNet-18 struggle with catastrophic forgetting and overfitting, leading to inadequate knowledge transfer in incremental sessions. The authors propose PriViLege, a novel FSCIL framework that leverages large pre-trained vision and language transformers, such as Vision Transformers (ViT) and CLIP, to effectively address these challenges. PriViLege introduces three key components: 1. **Pre-trained Knowledge Tuning (PKT)**: This method selectively fine-tunes specific layers using additional prompts to acquire domain-specific knowledge during the base session, while preserving pre-trained knowledge. 2. **Entropy-based Divergence Loss**: This loss function enhances the discriminative power of the vision token by distinguishing it from the [CLS] token, improving the model's ability to capture domain-specific knowledge. 3. **Semantic Knowledge Distillation Loss**: This loss function uses pre-trained language models to transfer semantic knowledge from the language space to the visual space, enhancing representation learning. Experimental results on datasets like CUB200, CIFAR-100, and miniImageNet show that PriViLege significantly outperforms existing state-of-the-art methods, achieving substantial improvements in accuracy. The framework demonstrates its effectiveness in handling catastrophic forgetting and overfitting, making it a promising approach for few-shot incremental learning.The paper "Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners" addresses the challenges of Few-Shot Class Incremental Learning (FSCIL), which involves learning new classes with limited data without forgetting previously learned knowledge. Traditional shallow models like ResNet-18 struggle with catastrophic forgetting and overfitting, leading to inadequate knowledge transfer in incremental sessions. The authors propose PriViLege, a novel FSCIL framework that leverages large pre-trained vision and language transformers, such as Vision Transformers (ViT) and CLIP, to effectively address these challenges. PriViLege introduces three key components: 1. **Pre-trained Knowledge Tuning (PKT)**: This method selectively fine-tunes specific layers using additional prompts to acquire domain-specific knowledge during the base session, while preserving pre-trained knowledge. 2. **Entropy-based Divergence Loss**: This loss function enhances the discriminative power of the vision token by distinguishing it from the [CLS] token, improving the model's ability to capture domain-specific knowledge. 3. **Semantic Knowledge Distillation Loss**: This loss function uses pre-trained language models to transfer semantic knowledge from the language space to the visual space, enhancing representation learning. Experimental results on datasets like CUB200, CIFAR-100, and miniImageNet show that PriViLege significantly outperforms existing state-of-the-art methods, achieving substantial improvements in accuracy. The framework demonstrates its effectiveness in handling catastrophic forgetting and overfitting, making it a promising approach for few-shot incremental learning.

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

2 Apr 2024 | Keon-Hee Park, Kyungwoo Song, Gyeong-Moon Park