[slides] MagicLens%3A Self-Supervised Image Retrieval with Open-Ended Instructions

**MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions** **Abstract:** Image retrieval, which involves finding desired images based on a reference image, often requires capturing complex and multi-faceted search intents that are challenging to define solely through image-based measures. Recent approaches have leveraged text instructions to allow users to express more nuanced search intentions. However, these methods primarily focus on visually similar images or a limited set of predefined relations. This paper introduces MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is designed to capture rich, implicit relations between images, such as "inside view of," by synthesizing instructions from foundation models. Trained on 36.7 million triplets (query image, instruction, target image) with rich semantic relations mined from the web, MagicLens achieves results comparable to or better than prior state-of-the-art (SOTA) methods on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Human evaluations on a 1.4 million-image corpus further demonstrate the model's ability to support diverse and complex search intents. **Introduction:** Image retrieval is a fundamental problem in computer vision with applications in visual search, object localization, and re-identification. However, the task has been limited by ambiguous definitions and the complexity of image content. Users often present multiple search intents for a single query image, indicating the need for models that can capture diverse real-world search intents. MagicLens addresses this by incorporating open-ended text instructions, which span a wide range of topics and concepts. The model is trained on a large dataset of 36.7 million triplets, where each triplet consists of a query image, an instruction, and a target image. The instructions are generated using large language models (LLMs) to refine the description of open-ended semantic relations between the query and target images. **Related Work:** Previous work in image retrieval has focused on pre-training multimodal encoders, composed image retrieval, and retrieval with instructions. MagicLens differs by using naturally occurring image pairs from the same web pages, which provide rich and diverse semantic relations. The model is trained using a contrastive loss function, and its performance is evaluated on multiple benchmarks, including composed image retrieval, domain transfer retrieval, and conditional image similarity. **Experiments:** MagicLens outperforms previous SOTA methods on five benchmarks, demonstrating its effectiveness in multimodality-to-image retrieval tasks. It also shows strong generalization capabilities, outperforming prior methods on zero-shot sketch-based image retrieval tasks. Additionally, MagicLens enhances the performance of backbone encoders for text-to-image retrieval. **Analysis:** Human evaluations on a 1.4 million-image corpus show that MagicLens can satisfy diverse search intents, especially complex and beyond-visual ones. The model's ability to understand and follow open-ended instructions is demonstrated through various qualitative studies. **Conclusion:****MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions** **Abstract:** Image retrieval, which involves finding desired images based on a reference image, often requires capturing complex and multi-faceted search intents that are challenging to define solely through image-based measures. Recent approaches have leveraged text instructions to allow users to express more nuanced search intentions. However, these methods primarily focus on visually similar images or a limited set of predefined relations. This paper introduces MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is designed to capture rich, implicit relations between images, such as "inside view of," by synthesizing instructions from foundation models. Trained on 36.7 million triplets (query image, instruction, target image) with rich semantic relations mined from the web, MagicLens achieves results comparable to or better than prior state-of-the-art (SOTA) methods on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Human evaluations on a 1.4 million-image corpus further demonstrate the model's ability to support diverse and complex search intents. **Introduction:** Image retrieval is a fundamental problem in computer vision with applications in visual search, object localization, and re-identification. However, the task has been limited by ambiguous definitions and the complexity of image content. Users often present multiple search intents for a single query image, indicating the need for models that can capture diverse real-world search intents. MagicLens addresses this by incorporating open-ended text instructions, which span a wide range of topics and concepts. The model is trained on a large dataset of 36.7 million triplets, where each triplet consists of a query image, an instruction, and a target image. The instructions are generated using large language models (LLMs) to refine the description of open-ended semantic relations between the query and target images. **Related Work:** Previous work in image retrieval has focused on pre-training multimodal encoders, composed image retrieval, and retrieval with instructions. MagicLens differs by using naturally occurring image pairs from the same web pages, which provide rich and diverse semantic relations. The model is trained using a contrastive loss function, and its performance is evaluated on multiple benchmarks, including composed image retrieval, domain transfer retrieval, and conditional image similarity. **Experiments:** MagicLens outperforms previous SOTA methods on five benchmarks, demonstrating its effectiveness in multimodality-to-image retrieval tasks. It also shows strong generalization capabilities, outperforming prior methods on zero-shot sketch-based image retrieval tasks. Additionally, MagicLens enhances the performance of backbone encoders for text-to-image retrieval. **Analysis:** Human evaluations on a 1.4 million-image corpus show that MagicLens can satisfy diverse search intents, especially complex and beyond-visual ones. The model's ability to understand and follow open-ended instructions is demonstrated through various qualitative studies. **Conclusion:**

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

2024 | Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhui Chen, Yu Su, Ming-Wei Chang