2024 | Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhui Chen, Yu Su, Ming-Wei Chang
MagicLens is a self-supervised image retrieval system that supports open-ended text instructions. The system leverages naturally occurring image pairs from the same web pages, which contain a wide range of implicit relations. These relations are synthesized into explicit instructions using foundation models, enabling the retrieval of images with richer semantic relations beyond visual similarity. MagicLens is trained on 36.7 million (query image, instruction, target image) triplets mined from web pages, achieving results comparable or better than prior state-of-the-art methods on eight image retrieval benchmarks. It also maintains high parameter efficiency with a significantly smaller model size. Human analysis on a 1.4 million image corpus further demonstrates the diversity of search intents supported by MagicLens. MagicLens outperforms prior methods on multiple benchmarks, including CIRCO, Domain Transfer ImageNet, and GeneCIS. The system is capable of handling complex and beyond-visual search intents, as demonstrated by its ability to retrieve images that align with the deeper meaning and context of text instructions, even when the images do not resemble the query. MagicLens is a series of lightweight dual-encoders trained on a large-scale dataset of image pairs with open-ended instructions. The system's performance is evaluated on various benchmarks, including composed image retrieval, domain transfer retrieval, and conditional image similarity. MagicLens achieves strong results on these tasks, demonstrating its effectiveness in handling diverse image retrieval scenarios. The system's ability to understand and follow open-ended instructions makes it a valuable tool for real-world image search applications. The research also highlights the importance of self-supervised training with naturally occurring image pairs and the potential of such approaches for other vision-language tasks.MagicLens is a self-supervised image retrieval system that supports open-ended text instructions. The system leverages naturally occurring image pairs from the same web pages, which contain a wide range of implicit relations. These relations are synthesized into explicit instructions using foundation models, enabling the retrieval of images with richer semantic relations beyond visual similarity. MagicLens is trained on 36.7 million (query image, instruction, target image) triplets mined from web pages, achieving results comparable or better than prior state-of-the-art methods on eight image retrieval benchmarks. It also maintains high parameter efficiency with a significantly smaller model size. Human analysis on a 1.4 million image corpus further demonstrates the diversity of search intents supported by MagicLens. MagicLens outperforms prior methods on multiple benchmarks, including CIRCO, Domain Transfer ImageNet, and GeneCIS. The system is capable of handling complex and beyond-visual search intents, as demonstrated by its ability to retrieve images that align with the deeper meaning and context of text instructions, even when the images do not resemble the query. MagicLens is a series of lightweight dual-encoders trained on a large-scale dataset of image pairs with open-ended instructions. The system's performance is evaluated on various benchmarks, including composed image retrieval, domain transfer retrieval, and conditional image similarity. MagicLens achieves strong results on these tasks, demonstrating its effectiveness in handling diverse image retrieval scenarios. The system's ability to understand and follow open-ended instructions makes it a valuable tool for real-world image search applications. The research also highlights the importance of self-supervised training with naturally occurring image pairs and the potential of such approaches for other vision-language tasks.