9 Apr 2016 | Yuke Zhu† Oliver Groth† Michael Bernstein† Li Fei-Fei†
The paper "Visual7W: Grounded Question Answering in Images" by Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei addresses the challenge of visual question answering (QA) by introducing a new dataset and a novel LSTM model. The authors highlight the limitations of previous QA tasks, which often lack a tight semantic link between textual descriptions and image regions, leading to a gap between human performance and machine performance. To bridge this gap, they propose object-level grounding, which establishes a direct correspondence between textual mentions of objects and their visual appearances in images. This approach enables a new type of QA task where answers can be both textual and visual.
The Visual7W dataset is constructed using 7W questions (what, where, when, who, why, how, which) and includes 327,939 QA pairs on 47,300 COCO images. Each QA pair is associated with multiple-choice answers and object groundings, providing dense annotations and a flexible evaluation environment. The authors evaluate human performance and several baseline models on the QA tasks, finding a significant performance gap between humans (96.6%) and state-of-the-art LSTM models (52.1%).
To address the visually grounded QA tasks, the authors propose an attention-based LSTM model that captures the intuition that answers to image-related questions correspond to specific image regions. The model learns to attend to relevant regions as it processes the question sequence, achieving state-of-the-art performance with 55.6%. The paper also includes a detailed analysis of the dataset, comparisons with existing datasets, and qualitative results to demonstrate the effectiveness of the proposed model.The paper "Visual7W: Grounded Question Answering in Images" by Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei addresses the challenge of visual question answering (QA) by introducing a new dataset and a novel LSTM model. The authors highlight the limitations of previous QA tasks, which often lack a tight semantic link between textual descriptions and image regions, leading to a gap between human performance and machine performance. To bridge this gap, they propose object-level grounding, which establishes a direct correspondence between textual mentions of objects and their visual appearances in images. This approach enables a new type of QA task where answers can be both textual and visual.
The Visual7W dataset is constructed using 7W questions (what, where, when, who, why, how, which) and includes 327,939 QA pairs on 47,300 COCO images. Each QA pair is associated with multiple-choice answers and object groundings, providing dense annotations and a flexible evaluation environment. The authors evaluate human performance and several baseline models on the QA tasks, finding a significant performance gap between humans (96.6%) and state-of-the-art LSTM models (52.1%).
To address the visually grounded QA tasks, the authors propose an attention-based LSTM model that captures the intuition that answers to image-related questions correspond to specific image regions. The model learns to attend to relevant regions as it processes the question sequence, achieving state-of-the-art performance with 55.6%. The paper also includes a detailed analysis of the dataset, comparisons with existing datasets, and qualitative results to demonstrate the effectiveness of the proposed model.