Transductive Zero-Shot and Few-Shot CLIP

Transductive Zero-Shot and Few-Shot CLIP

8 Apr 2024 | Ségolène Martin*, Yunshi Huang†, Fereshteh Shakeri†, Jean-Christophe Pesquet*, Ismail Ben Ayed†
This paper addresses the transductive zero-shot and few-shot classification challenge using the CLIP model, a popular vision-language model. The authors propose a novel approach that jointly performs inference across a mini-batch of unlabeled query samples, rather than treating each instance independently. They construct informative vision-text probability features and formulate the classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), they model the data probability distribution for each class using Dirichlet distributions. The minimization problem is solved using a block Majorization-Minimization (MM) algorithm, which estimates the distribution parameters and class assignments simultaneously. Extensive experiments on 11 datasets demonstrate the effectiveness of the proposed method, showing a near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance and outperforming state-of-the-art methods in the few-shot setting. The code is available at: <https://github.com/SegoleneMartin/transductive-CLIP>.This paper addresses the transductive zero-shot and few-shot classification challenge using the CLIP model, a popular vision-language model. The authors propose a novel approach that jointly performs inference across a mini-batch of unlabeled query samples, rather than treating each instance independently. They construct informative vision-text probability features and formulate the classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), they model the data probability distribution for each class using Dirichlet distributions. The minimization problem is solved using a block Majorization-Minimization (MM) algorithm, which estimates the distribution parameters and class assignments simultaneously. Extensive experiments on 11 datasets demonstrate the effectiveness of the proposed method, showing a near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance and outperforming state-of-the-art methods in the few-shot setting. The code is available at: <https://github.com/SegoleneMartin/transductive-CLIP>.
Reach us at info@study.space