Understanding Do Vision and Language Encoders Represent the World Similarly%3F

Do Vision and Language Encoders Represent the World Similarly? Aligned text-image encoders like CLIP have become the de facto model for vision-language tasks. However, the question remains: do unaligned vision and language encoders represent the world similarly? Using Centered Kernel Alignment (CKA), we analyze the latent spaces of vision and language models on image-caption benchmarks and find that the representation spaces of unaligned and aligned encoders are semantically similar. We propose two methods for matching unaligned encoders: a Fast Quadratic Assignment Problem (QAP) optimization and a localized CKA metric-based matching/retrieval. These methods are effective for downstream tasks such as cross-lingual, cross-domain caption matching and image classification. Our results show that unaligned encoders can achieve zero-shot communication between their latent spaces. We also demonstrate that the CKA metric is sensitive to data ordering and that the best performance is achieved when the data is in the correct order. The method is benchmarked on COCO, NoCaps, and ImageNet-100 tasks, showing that it can achieve zero-shot communication between unaligned encoders. We also show that the method can be applied to cross-lingual image retrieval by using sentence transformers trained in various languages and a CLIP vision encoder trained only in English. The results indicate that the representations between unaligned vision and language encoders are sufficiently high-level and differ only by a linear transformation. However, this linear layer is trained on CC-3M, consisting of three million image-caption pairs. We argue that using an explicit similarity measure is sensitive to the selection of anchors and noise in the original embeddings. Instead, we propose an implicit measure that captures the similarity of similarities, inducing more robustness to the alignment process. We also explore how this similarity can be leveraged for downstream cross-modal tasks in a training-free manner with the aid of CKA and a set of parallel anchors in the image and text latent embedding spaces.Do Vision and Language Encoders Represent the World Similarly? Aligned text-image encoders like CLIP have become the de facto model for vision-language tasks. However, the question remains: do unaligned vision and language encoders represent the world similarly? Using Centered Kernel Alignment (CKA), we analyze the latent spaces of vision and language models on image-caption benchmarks and find that the representation spaces of unaligned and aligned encoders are semantically similar. We propose two methods for matching unaligned encoders: a Fast Quadratic Assignment Problem (QAP) optimization and a localized CKA metric-based matching/retrieval. These methods are effective for downstream tasks such as cross-lingual, cross-domain caption matching and image classification. Our results show that unaligned encoders can achieve zero-shot communication between their latent spaces. We also demonstrate that the CKA metric is sensitive to data ordering and that the best performance is achieved when the data is in the correct order. The method is benchmarked on COCO, NoCaps, and ImageNet-100 tasks, showing that it can achieve zero-shot communication between unaligned encoders. We also show that the method can be applied to cross-lingual image retrieval by using sentence transformers trained in various languages and a CLIP vision encoder trained only in English. The results indicate that the representations between unaligned vision and language encoders are sufficiently high-level and differ only by a linear transformation. However, this linear layer is trained on CC-3M, consisting of three million image-caption pairs. We argue that using an explicit similarity measure is sensitive to the selection of anchors and noise in the original embeddings. Instead, we propose an implicit measure that captures the similarity of similarities, inducing more robustness to the alignment process. We also explore how this similarity can be leveraged for downstream cross-modal tasks in a training-free manner with the aid of CKA and a set of parallel anchors in the image and text latent embedding spaces.

Do Vision and Language Encoders Represent the World Similarly?

22 Mar 2024 | Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor