1 May 2024 | Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim
This paper introduces a novel Zero-Shot Composed Image Retrieval (ZS-CIR) method based on Spherical Linear Interpolation (Slerp) and a Text-Anchored-Tuning (TAT) strategy. The proposed method directly merges image and text representations by identifying an intermediate embedding, avoiding the limitations of previous pseudo-word token-based approaches that distort image representations. TAT fine-tunes the image encoder while keeping the text encoder fixed, reducing the modality gap and enhancing the effectiveness of Slerp. The integration of Slerp and TAT significantly improves ZS-CIR performance across various benchmarks, including natural and fashion image datasets. The method is efficient, requiring only a single training epoch and demonstrating superior performance even with limited training data. The TAT strategy also serves as an effective initial checkpoint for supervised CIR models. Experimental results show that the proposed method achieves state-of-the-art performance in ZS-CIR tasks, outperforming existing methods in terms of retrieval accuracy and efficiency. The approach is versatile and applicable across diverse domains, highlighting the potential of Slerp and TAT in vision-language tasks.This paper introduces a novel Zero-Shot Composed Image Retrieval (ZS-CIR) method based on Spherical Linear Interpolation (Slerp) and a Text-Anchored-Tuning (TAT) strategy. The proposed method directly merges image and text representations by identifying an intermediate embedding, avoiding the limitations of previous pseudo-word token-based approaches that distort image representations. TAT fine-tunes the image encoder while keeping the text encoder fixed, reducing the modality gap and enhancing the effectiveness of Slerp. The integration of Slerp and TAT significantly improves ZS-CIR performance across various benchmarks, including natural and fashion image datasets. The method is efficient, requiring only a single training epoch and demonstrating superior performance even with limited training data. The TAT strategy also serves as an effective initial checkpoint for supervised CIR models. Experimental results show that the proposed method achieves state-of-the-art performance in ZS-CIR tasks, outperforming existing methods in terms of retrieval accuracy and efficiency. The approach is versatile and applicable across diverse domains, highlighting the potential of Slerp and TAT in vision-language tasks.