9 Apr 2016 | Yuke Zhu† Oliver Groth† Michael Bernstein† Li Fei-Fei†
Visual7W: Grounded Question Answering in Images
This paper introduces the Visual7W dataset, which extends previous visual question answering (QA) tasks by incorporating object-level grounding. The dataset includes 327,939 QA pairs from 47,300 COCO images, along with 1,311,756 human-generated multiple-choice answers and 561,459 object groundings. The 7W questions cover what, where, when, who, why, how, and which, enabling both textual and visual answers. The dataset provides detailed annotations linking object mentions in QA sentences to their bounding boxes in images, allowing for a new type of QA with visually grounded answers.
The paper proposes an attention-based LSTM model with spatial attention to tackle the 7W QA tasks. The model captures the intuition that answers to image-related questions usually correspond with specific image regions. It learns to attend to the pertinent regions as it reads the question tokens in a sequence. The model achieves state-of-the-art performance with 55.6%, and finds correlations between the model's attention heat maps and the object groundings.
The Visual7W dataset constitutes a part of the Visual Genome project. It features richer questions and longer answers than VQA. The dataset includes extra annotations such as object groundings, multiple choices, and human experiments, making it a clean and complete benchmark for evaluation and analysis.
The paper evaluates human and model performances on the QA tasks. Human performance on the 7W QA tasks is 96.6%, while the state-of-the-art LSTM model achieves 55.6%. The results show that the model performs significantly better than human performance on some question types, but is still outperformed by humans on others. The model's attention heat maps show that it focuses on the answer-related regions, indicating a tendency to attend to the mentioned objects.
The paper also discusses the impact of object category frequency on the model accuracy in the pointing QA task. The model's accuracy increases gradually as it sees more instances from the same category, and it is able to transfer knowledge from common categories to rare ones, generating an adequate performance on object categories with only a few instances.
The paper concludes that the Visual7W dataset provides a new type of QA with visually grounded answers, and that future research directions include exploring ways of utilizing common sense knowledge to improve the model's performance on QA tasks that require complex reasoning.Visual7W: Grounded Question Answering in Images
This paper introduces the Visual7W dataset, which extends previous visual question answering (QA) tasks by incorporating object-level grounding. The dataset includes 327,939 QA pairs from 47,300 COCO images, along with 1,311,756 human-generated multiple-choice answers and 561,459 object groundings. The 7W questions cover what, where, when, who, why, how, and which, enabling both textual and visual answers. The dataset provides detailed annotations linking object mentions in QA sentences to their bounding boxes in images, allowing for a new type of QA with visually grounded answers.
The paper proposes an attention-based LSTM model with spatial attention to tackle the 7W QA tasks. The model captures the intuition that answers to image-related questions usually correspond with specific image regions. It learns to attend to the pertinent regions as it reads the question tokens in a sequence. The model achieves state-of-the-art performance with 55.6%, and finds correlations between the model's attention heat maps and the object groundings.
The Visual7W dataset constitutes a part of the Visual Genome project. It features richer questions and longer answers than VQA. The dataset includes extra annotations such as object groundings, multiple choices, and human experiments, making it a clean and complete benchmark for evaluation and analysis.
The paper evaluates human and model performances on the QA tasks. Human performance on the 7W QA tasks is 96.6%, while the state-of-the-art LSTM model achieves 55.6%. The results show that the model performs significantly better than human performance on some question types, but is still outperformed by humans on others. The model's attention heat maps show that it focuses on the answer-related regions, indicating a tendency to attend to the mentioned objects.
The paper also discusses the impact of object category frequency on the model accuracy in the pointing QA task. The model's accuracy increases gradually as it sees more instances from the same category, and it is able to transfer knowledge from common categories to rare ones, generating an adequate performance on object categories with only a few instances.
The paper concludes that the Visual7W dataset provides a new type of QA with visually grounded answers, and that future research directions include exploring ways of utilizing common sense knowledge to improve the model's performance on QA tasks that require complex reasoning.