1 May 2024 | Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim
The paper introduces a novel approach for Zero-Shot Composed Image Retrieval (ZS-CIR) that combines Spherical Linear Interpolation (Slerp) and Text-Anchored-Tuning (TAT). Slerp directly merges image and text representations by identifying an intermediate embedding, while TAT fine-tunes the image encoder to align image embeddings with text embeddings, reducing the modality gap. This method achieves state-of-the-art performance on various benchmarks, including CIRR, CIRCO, and FashionIQ, with superior training efficiency and broader applicability. The TAT strategy also serves as an effective initial checkpoint for supervised CIR models, demonstrating the potential of the proposed approach in both zero-shot and supervised settings.The paper introduces a novel approach for Zero-Shot Composed Image Retrieval (ZS-CIR) that combines Spherical Linear Interpolation (Slerp) and Text-Anchored-Tuning (TAT). Slerp directly merges image and text representations by identifying an intermediate embedding, while TAT fine-tunes the image encoder to align image embeddings with text embeddings, reducing the modality gap. This method achieves state-of-the-art performance on various benchmarks, including CIRR, CIRCO, and FashionIQ, with superior training efficiency and broader applicability. The TAT strategy also serves as an effective initial checkpoint for supervised CIR models, demonstrating the potential of the proposed approach in both zero-shot and supervised settings.