8 Apr 2024 | Sé golène Martin*, Yunshi Huang†, Fereshteh Shakeri†, Jean-Christophe Pesquet*, Ismail Ben Ayed†
This paper proposes a transductive zero-shot and few-shot classification method for the CLIP vision-language model. The method addresses the challenge of performing joint inference across a mini-batch of unlabeled query samples, rather than treating each instance independently. The approach constructs informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), the method models the data probability distribution for each class using a Dirichlet law. A novel block Majorization-Minimization algorithm is proposed to simultaneously estimate the distribution parameters and class assignments. The method is evaluated on 11 datasets, showing significant improvements in accuracy over existing methods. On zero-shot tasks with test batches of 75 samples, the method achieves near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, the method outperforms state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.This paper proposes a transductive zero-shot and few-shot classification method for the CLIP vision-language model. The method addresses the challenge of performing joint inference across a mini-batch of unlabeled query samples, rather than treating each instance independently. The approach constructs informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), the method models the data probability distribution for each class using a Dirichlet law. A novel block Majorization-Minimization algorithm is proposed to simultaneously estimate the distribution parameters and class assignments. The method is evaluated on 11 datasets, showing significant improvements in accuracy over existing methods. On zero-shot tasks with test batches of 75 samples, the method achieves near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, the method outperforms state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.