24 Mar 2024 | Yucheng Suo, Fan Ma, Linchao Zhu†, Yi Yang
The paper introduces a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs) to address the challenge of retrieving target images given a reference image and a description without training on triplet datasets. Previous methods generate pseudo-word tokens by projecting reference image features into text embedding space, but they often overlook detailed attributes like color, object number, and layout. KEDs addresses this by incorporating a database to enrich pseudo-word tokens with relevant images and captions, emphasizing shared attribute information. Additionally, an extra stream aligns pseudo-word tokens with textual concepts using pseudo-triplets mined from image-text pairs. Extensive experiments on ImageNet-R, COCO, Fashion-IQ, and CIRR datasets show that KEDs outperforms previous methods, particularly in domain conversion tasks. The method's effectiveness is demonstrated through ablation studies and qualitative examples, highlighting its ability to capture fine-grained object information and generalization across different compositional tasks.The paper introduces a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs) to address the challenge of retrieving target images given a reference image and a description without training on triplet datasets. Previous methods generate pseudo-word tokens by projecting reference image features into text embedding space, but they often overlook detailed attributes like color, object number, and layout. KEDs addresses this by incorporating a database to enrich pseudo-word tokens with relevant images and captions, emphasizing shared attribute information. Additionally, an extra stream aligns pseudo-word tokens with textual concepts using pseudo-triplets mined from image-text pairs. Extensive experiments on ImageNet-R, COCO, Fashion-IQ, and CIRR datasets show that KEDs outperforms previous methods, particularly in domain conversion tasks. The method's effectiveness is demonstrated through ablation studies and qualitative examples, highlighting its ability to capture fine-grained object information and generalization across different compositional tasks.