Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

2 Jun 2016 | Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
This paper introduces a novel hierarchical co-attention model for Visual Question Answering (VQA) that jointly reasons about both visual and question attention. The model addresses the problem of "what words to listen to" in addition to "where to look," by proposing a co-attention mechanism that integrates visual and question representations. The model is designed to work at three levels: word, phrase, and question, using 1-dimensional convolution neural networks (CNNs) to capture information at each level. The final answer prediction is made by recursively combining the co-attended features from all levels. The proposed model outperforms existing methods on the VQA and COCO-QA datasets, improving the state-of-the-art by 2% and 4%, respectively. The paper also includes ablation studies to validate the effectiveness of different components of the model and visualizations to illustrate the co-attention maps.This paper introduces a novel hierarchical co-attention model for Visual Question Answering (VQA) that jointly reasons about both visual and question attention. The model addresses the problem of "what words to listen to" in addition to "where to look," by proposing a co-attention mechanism that integrates visual and question representations. The model is designed to work at three levels: word, phrase, and question, using 1-dimensional convolution neural networks (CNNs) to capture information at each level. The final answer prediction is made by recursively combining the co-attended features from all levels. The proposed model outperforms existing methods on the VQA and COCO-QA datasets, improving the state-of-the-art by 2% and 4%, respectively. The paper also includes ablation studies to validate the effectiveness of different components of the model and visualizations to illustrate the co-attention maps.
Reach us at info@study.space
[slides] Hierarchical Question-Image Co-Attention for Visual Question Answering | StudySpace