[slides and audio] Stacked Cross Attention for Image-Text Matching

This paper introduces Stacked Cross Attention (SCAN), a novel method for image-text matching that captures fine-grained visual-semantic alignments. The method uses both image regions and words in a sentence as context to infer image-text similarity. Unlike prior approaches that either aggregate region-word similarities without attention or use limited multi-step attention, SCAN discovers all possible alignments simultaneously, leading to more interpretable results. The method achieves state-of-the-art performance on the MS-COCO and Flickr30K datasets, outperforming existing methods by significant margins in both text and image retrieval tasks. SCAN uses bottom-up attention to detect and encode image regions, and an RNN to represent sentences. The model employs two complementary formulations: Image-Text and Text-Image, which attend to words with respect to image regions and vice versa. The model uses LogSumExp or average pooling to compute similarity scores. The method is evaluated on the MS-COCO and Flickr30K datasets, showing significant improvements in retrieval performance. The paper also presents ablation studies to validate the effectiveness of the model and its components. The results demonstrate that SCAN provides more interpretable alignments between image regions and words, leading to better performance in image-text matching tasks.This paper introduces Stacked Cross Attention (SCAN), a novel method for image-text matching that captures fine-grained visual-semantic alignments. The method uses both image regions and words in a sentence as context to infer image-text similarity. Unlike prior approaches that either aggregate region-word similarities without attention or use limited multi-step attention, SCAN discovers all possible alignments simultaneously, leading to more interpretable results. The method achieves state-of-the-art performance on the MS-COCO and Flickr30K datasets, outperforming existing methods by significant margins in both text and image retrieval tasks. SCAN uses bottom-up attention to detect and encode image regions, and an RNN to represent sentences. The model employs two complementary formulations: Image-Text and Text-Image, which attend to words with respect to image regions and vice versa. The model uses LogSumExp or average pooling to compute similarity scores. The method is evaluated on the MS-COCO and Flickr30K datasets, showing significant improvements in retrieval performance. The paper also presents ablation studies to validate the effectiveness of the model and its components. The results demonstrate that SCAN provides more interpretable alignments between image regions and words, leading to better performance in image-text matching tasks.

Stacked Cross Attention for Image-Text Matching

23 Jul 2018 | Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He