22 Mar 2024 | Mayug Maniparambil*, 1 Raiymbek Akshulakov*, 2 Yasser Abdelaziz Dahou Djilali3 1 Sanath Narayan3 Mohamed El Amine Seddik3 Karttikeya Mangalam2 Noel E. O'Connor1
The paper investigates whether unaligned vision and language encoders represent the physical world similarly. Using Centered Kernel Alignment (CKA), the authors find that the latent spaces of unaligned encoders are semantically similar, even though aligned encoders like CLIP do not exhibit statistical similarity. They propose two methods—Fast Quadratic Assignment Problem (QAP) optimization and a localized CKA metric—to align unaligned encoders without training. These methods are evaluated on downstream tasks such as cross-lingual caption matching and image classification, demonstrating superior performance compared to relative representations. The study reveals that well-trained vision encoders on large datasets show high semantic similarity with language encoders, regardless of the training paradigm. The authors also analyze the impact of training paradigms, data regimes, and encoder size/architecture on the semantic alignment of vision and language encoders.The paper investigates whether unaligned vision and language encoders represent the physical world similarly. Using Centered Kernel Alignment (CKA), the authors find that the latent spaces of unaligned encoders are semantically similar, even though aligned encoders like CLIP do not exhibit statistical similarity. They propose two methods—Fast Quadratic Assignment Problem (QAP) optimization and a localized CKA metric—to align unaligned encoders without training. These methods are evaluated on downstream tasks such as cross-lingual caption matching and image classification, demonstrating superior performance compared to relative representations. The study reveals that well-trained vision encoders on large datasets show high semantic similarity with language encoders, regardless of the training paradigm. The authors also analyze the impact of training paradigms, data regimes, and encoder size/architecture on the semantic alignment of vision and language encoders.