Understanding Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper proposes a hierarchical co-attention model for Visual Question Answering (VQA), which jointly reasons about image and question attention. The model addresses both "where to look" (visual attention) and "what words to listen to" (question attention). It uses a hierarchical architecture to represent the question at three levels: word, phrase, and question. At each level, co-attention is applied to both image and question features. The model also employs a 1-dimensional convolutional neural network (CNN) to capture information from unigrams, bigrams, and trigrams. The final model outperforms existing methods on the VQA and COCO-QA datasets, achieving state-of-the-art results. The model uses parallel and alternating co-attention mechanisms to generate attention maps for image and question features. The hierarchical co-attention model is evaluated on two large datasets, VQA and COCO-QA, and shows significant improvements in accuracy. The model also includes ablation studies to quantify the roles of different components in the model. The results show that the question level attention is most important for accurate answer prediction, followed by phrase and word level attention. The model is visualized to show how it attends to relevant image and question regions for answering questions. The model is also applied to other tasks involving vision and language.This paper proposes a hierarchical co-attention model for Visual Question Answering (VQA), which jointly reasons about image and question attention. The model addresses both "where to look" (visual attention) and "what words to listen to" (question attention). It uses a hierarchical architecture to represent the question at three levels: word, phrase, and question. At each level, co-attention is applied to both image and question features. The model also employs a 1-dimensional convolutional neural network (CNN) to capture information from unigrams, bigrams, and trigrams. The final model outperforms existing methods on the VQA and COCO-QA datasets, achieving state-of-the-art results. The model uses parallel and alternating co-attention mechanisms to generate attention maps for image and question features. The hierarchical co-attention model is evaluated on two large datasets, VQA and COCO-QA, and shows significant improvements in accuracy. The model also includes ablation studies to quantify the roles of different components in the model. The results show that the question level attention is most important for accurate answer prediction, followed by phrase and word level attention. The model is visualized to show how it attends to relevant image and question regions for answering questions. The model is also applied to other tasks involving vision and language.

Hierarchical Question-Image Co-Attention for Visual Question Answering

2 Jun 2016 | Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh