24 Mar 2024 | Yucheng Suo, Fan Ma, Linchao Zhu†, Yi Yang
This paper proposes a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs) for zero-shot composed image retrieval. KEDs incorporates a Bi-modality Knowledge-guided Projection network (BKP) to generate pseudo-word tokens based on external knowledge. The BKP uses a database to provide relevant images and captions, enriching the pseudo-word tokens with shared attribute information. Additionally, KEDs introduces an extra stream to generate pseudo-word tokens aligned with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The framework is evaluated on four widely used benchmarks: ImageNet-R, COCO, Fashion-IQ, and CIRR. The results show that KEDs outperforms previous zero-shot composed image retrieval methods, achieving significant improvements in recall metrics. The framework demonstrates strong generalization ability across different domains and tasks, including domain conversion, object composition, scene manipulation, and fashion attribute manipulation. The proposed method effectively captures fine-grained object information and leverages multi-modal knowledge from the database to recognize objects and deduce scene layouts. The framework also shows robustness to varying weights in the training process and is capable of handling various compositional datasets. The results indicate that KEDs has strong real-world application potential in zero-shot composed image retrieval.This paper proposes a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs) for zero-shot composed image retrieval. KEDs incorporates a Bi-modality Knowledge-guided Projection network (BKP) to generate pseudo-word tokens based on external knowledge. The BKP uses a database to provide relevant images and captions, enriching the pseudo-word tokens with shared attribute information. Additionally, KEDs introduces an extra stream to generate pseudo-word tokens aligned with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The framework is evaluated on four widely used benchmarks: ImageNet-R, COCO, Fashion-IQ, and CIRR. The results show that KEDs outperforms previous zero-shot composed image retrieval methods, achieving significant improvements in recall metrics. The framework demonstrates strong generalization ability across different domains and tasks, including domain conversion, object composition, scene manipulation, and fashion attribute manipulation. The proposed method effectively captures fine-grained object information and leverages multi-modal knowledge from the database to recognize objects and deduce scene layouts. The framework also shows robustness to varying weights in the training process and is capable of handling various compositional datasets. The results indicate that KEDs has strong real-world application potential in zero-shot composed image retrieval.